Relcache refactoring

Started by Heikki Linnakangasalmost 2 years ago6 messageshackers
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

While looking at the recent bug report from Alexander Lakhin [1]/messages/by-id/e56be7d9-14b1-664d-0bfc-00ce9772721c@gmail.com, I got
annoyed by the relcache code, and RelationClearRelation in particular. I
propose to refactor it for clarity.

[1]: /messages/by-id/e56be7d9-14b1-664d-0bfc-00ce9772721c@gmail.com
/messages/by-id/e56be7d9-14b1-664d-0bfc-00ce9772721c@gmail.com

## Patch 1

This is just a narrow fix for the reported bug [1]/messages/by-id/e56be7d9-14b1-664d-0bfc-00ce9772721c@gmail.com, same as I posted on
that thread. Included here because I wrote the refactorings on top of
this patch and didn't commit it yet.

## Patch 2: Simplify call to rebuild relcache entry for indexes

To rebuild a relcache entry that's been marked as invalid,
RelationIdGetRelation() calls RelationReloadIndexInfo() for indexes and
RelationClearRelation(rebuild == true) for other relations. However,
RelationClearRelation calls RelationReloadIndexInfo() for indexes
anyway, so RelationIdGetRelation() can just always call
RelationClearRelation() and let RelationClearRelation() do the right
thing to rebuild the relation, whether it's an index or something else.
That seems more straightforward.

Also add comments explaining how the rebuild works at index creation.
It's a bit special, see the comments.

## Patch 3: Split RelationClearRelation into three different functions

RelationClearRelation() is complicated. Depending on the 'rebuild'
argument and the circumstances, like if it's called in a transaction and
whether the relation is an index, a nailed relation, a regular table, or
a relation dropped in the same xact, it does different things:

- Remove the relation completely from the cache (rebuild == false),
- Mark the entry as invalid (rebuild == true, but not in xact), or
- Rebuild the entry (rebuild == true).

The callers have expectations on what they want it to do. Mostly the
callers with 'rebuild == false' expect the entry to be removed, and
callers with 'rebuild == true' expect it to be rebuilt or invalidated,
but there are exceptions. RelationForgetRelation() for example sets
rd_droppedSubid and expects RelationClearRelation() to then merely
invalidate it, and the call from RelationIdGetRelation() expects it to
rebuild, not merely invalidate it.

I propose to split RelationClearRelation() into three functions:

RelationInvalidateRelation: mark the relcache entry as invalid, so that
it it is rebuilt on next access.
RelationRebuildRelation: rebuild the relcache entry in-place.
RelationClearRelation: Remove the entry from the relcache.

This moves the responsibility of deciding the right action to the
callers. Which they were mostly already doing. Each of those actions
have different preconditions, e.g. RelationRebuildRelation() can only be
called in a valid transaction, and RelationClearRelation() can only be
called if the reference count is zero. Splitting them makes those
preconditions more clear, we can have assertions to document them in each.

## RelationInitPhysicalAddr() call in RelationReloadNailed()

One question or little doubt I have: Before these patches,
RelationReloadNailed() calls RelationInitPhysicalAddr() even when it
leaves the relation as invalidated because we're not in a transaction or
if the relation isn't currently in use. That's a bit surprising, I'd
expect it to be done when the entry is reloaded, not when it's
invalidated. That's how it's done for non-nailed relations. And in fact,
for a nailed relation, RelationInitPhysicalAddr() is called *again* when
it's reloaded.

Is it important? Before commit a54e1f1587, nailed non-index relations
were not reloaded at all, except for the call to
RelationInitPhysicalAddr(), which seemed consistent. I think this was
unintentional in commit a54e1f1587, or perhaps just overly defensive, as
there's no harm in some extra RelationInitPhysicalAddr() calls.

This patch removes that extra call from the invalidation path, but if it
turns out to be wrong, we can easily add it to RelationInvalidateRelation.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

0001-Make-RelationFlushRelation-work-without-ResourceOwne.patchtext/x-patch; charset=UTF-8; name=0001-Make-RelationFlushRelation-work-without-ResourceOwne.patchDownload+122-21
0002-Simplify-call-to-rebuild-relcache-entry-for-indexes.patchtext/x-patch; charset=UTF-8; name=0002-Simplify-call-to-rebuild-relcache-entry-for-indexes.patchDownload+17-25
0003-Split-RelationClearRelation-into-three-different-fun.patchtext/x-patch; charset=UTF-8; name=0003-Split-RelationClearRelation-into-three-different-fun.patchDownload+141-164
#2jian he
jian.universality@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: Relcache refactoring

On Wed, Jun 5, 2024 at 9:56 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

## Patch 3: Split RelationClearRelation into three different functions

RelationClearRelation() is complicated. Depending on the 'rebuild'
argument and the circumstances, like if it's called in a transaction and
whether the relation is an index, a nailed relation, a regular table, or
a relation dropped in the same xact, it does different things:

- Remove the relation completely from the cache (rebuild == false),
- Mark the entry as invalid (rebuild == true, but not in xact), or
- Rebuild the entry (rebuild == true).

The callers have expectations on what they want it to do. Mostly the
callers with 'rebuild == false' expect the entry to be removed, and
callers with 'rebuild == true' expect it to be rebuilt or invalidated,
but there are exceptions. RelationForgetRelation() for example sets
rd_droppedSubid and expects RelationClearRelation() to then merely
invalidate it, and the call from RelationIdGetRelation() expects it to
rebuild, not merely invalidate it.

I propose to split RelationClearRelation() into three functions:

RelationInvalidateRelation: mark the relcache entry as invalid, so that
it it is rebuilt on next access.
RelationRebuildRelation: rebuild the relcache entry in-place.
RelationClearRelation: Remove the entry from the relcache.

This moves the responsibility of deciding the right action to the
callers. Which they were mostly already doing. Each of those actions
have different preconditions, e.g. RelationRebuildRelation() can only be
called in a valid transaction, and RelationClearRelation() can only be
called if the reference count is zero. Splitting them makes those
preconditions more clear, we can have assertions to document them in each.

one minor issue.

static void
RelationClearRelation(Relation relation)
{
Assert(RelationHasReferenceCountZero(relation));
Assert(!relation->rd_isnailed);

/*
* Relations created in the same transaction must never be removed, see
* RelationFlushRelation.
*/
Assert(relation->rd_createSubid == InvalidSubTransactionId);
Assert(relation->rd_firstRelfilelocatorSubid == InvalidSubTransactionId);
Assert(relation->rd_droppedSubid == InvalidSubTransactionId);

/* Ensure it's closed at smgr level */
RelationCloseSmgr(relation);

/* Free AM cached data, if any */
if (relation->rd_amcache)
pfree(relation->rd_amcache);
relation->rd_amcache = NULL;

/* Mark it as invalid (just pro forma since we will free it below) */
relation->rd_isvalid = false;

/* Remove it from the hash table */
RelationCacheDelete(relation);

/* And release storage */
RelationDestroyRelation(relation, false);
}

can be simplified as

static void
RelationClearRelation(Relation relation)
{
---bunch of Asserts

/* first mark it as invalid */
RelationInvalidateRelation(relation);

/* Remove it from the hash table */
RelationCacheDelete(relation);

/* And release storage */
RelationDestroyRelation(relation, false);
}
?

in RelationRebuildRelation
we can also use RelationInvalidateRelation?

* We assume that at the time we are called, we have at least AccessShareLock
* on the target index. (Note: in the calls from RelationClearRelation,
* this is legitimate because we know the rel has positive refcount.)

calling RelationClearRelation, rel->rd_refcnt == 0
seems conflicted with the above comments in RelationReloadIndexInfo.
so i am confused with the above comments.

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: jian he (#2)
Re: Relcache refactoring

On 23/09/2024 04:39, jian he wrote:

static void
RelationClearRelation(Relation relation)
{
Assert(RelationHasReferenceCountZero(relation));
Assert(!relation->rd_isnailed);

/*
* Relations created in the same transaction must never be removed, see
* RelationFlushRelation.
*/
Assert(relation->rd_createSubid == InvalidSubTransactionId);
Assert(relation->rd_firstRelfilelocatorSubid == InvalidSubTransactionId);
Assert(relation->rd_droppedSubid == InvalidSubTransactionId);

/* Ensure it's closed at smgr level */
RelationCloseSmgr(relation);

/* Free AM cached data, if any */
if (relation->rd_amcache)
pfree(relation->rd_amcache);
relation->rd_amcache = NULL;

/* Mark it as invalid (just pro forma since we will free it below) */
relation->rd_isvalid = false;

/* Remove it from the hash table */
RelationCacheDelete(relation);

/* And release storage */
RelationDestroyRelation(relation, false);
}

can be simplified as

static void
RelationClearRelation(Relation relation)
{
---bunch of Asserts

/* first mark it as invalid */
RelationInvalidateRelation(relation);

/* Remove it from the hash table */
RelationCacheDelete(relation);

/* And release storage */
RelationDestroyRelation(relation, false);
}
?

in RelationRebuildRelation
we can also use RelationInvalidateRelation?

Yep, that makes sense, changed them that way.

* We assume that at the time we are called, we have at least AccessShareLock
* on the target index. (Note: in the calls from RelationClearRelation,
* this is legitimate because we know the rel has positive refcount.)

calling RelationClearRelation, rel->rd_refcnt == 0
seems conflicted with the above comments in RelationReloadIndexInfo.
so i am confused with the above comments.

That comment is outdated, because after these patches, there's only once
caller, in RelationRebuildRelation. Fixed that, thanks for noticing!

Hmm, in case of a nailed index, this will get called with refcnt == 1,
even though we have no AccessShareLock on the index. But I guess that's
OK for a nailed index.

Committed with those fixes, thanks for the review!

--
Heikki Linnakangas
Neon (https://neon.tech)

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#3)
Re: Relcache refactoring

On 31/10/2024 10:14, Heikki Linnakangas wrote:

Committed with those fixes, thanks for the review!

There is a failure in the buildfarm animal 'prion', which builds with
-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE options. I'll look
into that later today. (I did test with -DRELCACHE_FORCE_RELEASE on my
laptop, but must've missed something).

--
Heikki Linnakangas
Neon (https://neon.tech)

#5Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#4)
Re: Relcache refactoring

On 31/10/2024 12:06, Heikki Linnakangas wrote:

On 31/10/2024 10:14, Heikki Linnakangas wrote:

Committed with those fixes, thanks for the review!

There is a failure in the buildfarm animal 'prion', which builds with
-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE options. I'll look
into that later today. (I did test with -DRELCACHE_FORCE_RELEASE on my
laptop, but must've missed something).

I was finally able to reproduce this, by running the failing pg_regress
tests concurrently with repeated "vacuum full pg_database" calls. It's
curious that 'prion' failed on the first run after the commit, I was not
able to reproduce it by just running the unmodified regression tests
with -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE.

Committed a fix. There was one codepath that was missing a call to
RelationInitPhysicalAddr(relation) after the patch. I alluded to this
earlier in this thread:

## RelationInitPhysicalAddr() call in RelationReloadNailed()

One question or little doubt I have: Before these patches,
RelationReloadNailed() calls RelationInitPhysicalAddr() even when it
leaves the relation as invalidated because we're not in a transaction or
if the relation isn't currently in use. That's a bit surprising, I'd
expect it to be done when the entry is reloaded, not when it's
invalidated. That's how it's done for non-nailed relations. And in fact,
for a nailed relation, RelationInitPhysicalAddr() is called *again* when
it's reloaded.

Is it important? Before commit a54e1f1587, nailed non-index relations
were not reloaded at all, except for the call to
RelationInitPhysicalAddr(), which seemed consistent. I think this was
unintentional in commit a54e1f1587, or perhaps just overly defensive, as
there's no harm in some extra RelationInitPhysicalAddr() calls.

This patch removes that extra call from the invalidation path, but if it
turns out to be wrong, we can easily add it to RelationInvalidateRelation.

Now this wasn't exactly that, but related.

--
Heikki Linnakangas
Neon (https://neon.tech)

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#5)
Re: Relcache refactoring

Heikki Linnakangas <hlinnaka@iki.fi> writes:

I was finally able to reproduce this, by running the failing pg_regress
tests concurrently with repeated "vacuum full pg_database" calls. It's
curious that 'prion' failed on the first run after the commit, I was not
able to reproduce it by just running the unmodified regression tests
with -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE.

Committed a fix. There was one codepath that was missing a call to
RelationInitPhysicalAddr(relation) after the patch.

Just as a note, BF members broadbill and mamushi showed similar
symptoms [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=broadbill&amp;dt=2024-10-31%2013%3A34%3A08[2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamushi&amp;dt=2024-10-31%2016%3A08%3A43. Neither of them use FORCE_RELEASE options AFAICS.
I'm guessing appearance of the failure is simply a timing issue.

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=broadbill&amp;dt=2024-10-31%2013%3A34%3A08
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamushi&amp;dt=2024-10-31%2016%3A08%3A43