Combine Prune and Freeze records emitted by vacuum

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Melanie Plageman (#1)

Re: Combine Prune and Freeze records emitted by vacuum

On 25/01/2024 00:49, Melanie Plageman wrote:

Generates 30% fewer WAL records and 12% fewer WAL bytes -- which,
depending on what else is happening on the system, can lead to vacuum
spending substantially less time on WAL writing and syncing (often 15%
less time on WAL writes and 10% less time on syncing WAL in my
testing).

Nice!

The attached patch set is broken up into many separate commits for
ease of review. Each patch does a single thing which can be explained
plainly in the commit message. Every commit passes tests and works on
its own.

About this very first change:

--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1526,8 +1526,7 @@ lazy_scan_prune(LVRelState *vacrel,
* that everyone sees it as committed?
*/
xmin = HeapTupleHeaderGetXmin(htup);
-					if (!TransactionIdPrecedes(xmin,
-											   vacrel->cutoffs.OldestXmin))
+					if (!GlobalVisTestIsRemovableXid(vacrel->vistest, xmin))
{
all_visible = false;
break;

Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly?

I read through all the patches in order, and aside from the above they
all look correct to me. Some comments on the set as whole:

I don't think we need XLOG_HEAP2_FREEZE_PAGE as a separate record type
anymore. By removing that, you also get rid of the freeze-only codepath
near the end of heap_page_prune(), and the
heap_freeze_execute_prepared() function.

The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than
XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for
the case that there's no pruning, just freezing. The record format
(xl_heap_prune) looks pretty complex as it is, I think it could be made
both more compact and more clear with some refactoring.

FreezeMultiXactId still takes a separate 'cutoffs' arg, but it could use
pagefrz->cutoffs now.

HeapPageFreeze has two "trackers", for the "freeze" and "no freeze"
cases. heap_page_prune() needs to track both, until it decides whether
to freeze or not. But it doesn't make much sense that the caller
(lazy_scan_prune()) has to initialize both, and has to choose which of
the values to use after the call depending on whether heap_page_prune()
froze or not. The two trackers should be just heap_page_prune()'s
internal business.

HeapPageFreeze is a bit confusing in general, as it's both an input and
an output to heap_page_prune(). Not sure what exactly to do there, but I
feel that we should make heap_page_prune()'s interface more clear.
Perhaps move the output fields to PruneResult.

Let's rename heap_page_prune() to heap_page_prune_and_freeze(), as
freezing is now an integral part of the function. And mention it in the
function comment, too.

In heap_prune_chain:

* Tuple visibility information is provided in presult->htsv.

Not this patch's fault directly, but it's not immediate clear what "is
provided" means here. Does the caller provide it, or does the function
set it, ie. is it an input or output argument? Looking at the code, it's
an input, but now it looks a bit weird that an input argument is called
'presult'.

--
Heikki Linnakangas
Neon (https://neon.tech)

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#2)

Re: Combine Prune and Freeze records emitted by vacuum

Thanks so much for the review!

On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 25/01/2024 00:49, Melanie Plageman wrote:

The attached patch set is broken up into many separate commits for
ease of review. Each patch does a single thing which can be explained
plainly in the commit message. Every commit passes tests and works on
its own.

About this very first change:
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1526,8 +1526,7 @@ lazy_scan_prune(LVRelState *vacrel,
* that everyone sees it as committed?
*/
xmin = HeapTupleHeaderGetXmin(htup);
-                                     if (!TransactionIdPrecedes(xmin,
-                                                                                        vacrel->cutoffs.OldestXmin))
+                                     if (!GlobalVisTestIsRemovableXid(vacrel->vistest, xmin))
{
all_visible = false;
break;
Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly?

Okay, so I thought a lot about this, and I don't understand how
GlobalVisTestIsRemovableXid() would not handle FrozenTransactionId
correctly.

vacrel->cutoffs.OldestXmin is computed initially from
GetOldestNonRemovableTransactionId() which uses ComputeXidHorizons().
GlobalVisState is updated by ComputeXidHorizons() (when it is
updated).

I do see that the comment above GlobalVisTestIsRemovableXid() says

* It is crucial that this only gets called for xids from a source that
* protects against xid wraparounds (e.g. from a table and thus protected by
* relfrozenxid).

and then in

* Convert 32 bit argument to FullTransactionId. We can do so safely
* because we know the xid has to, at the very least, be between
* [oldestXid, nextXid), i.e. within 2 billion of xid.

I'm not sure what oldestXid is here.
It's true that I don't see GlobalVisTestIsRemovableXid() being called
anywhere else with an xmin as an input. I think that hints that it is
not safe with FrozenTransactionId. But I don't see what could go
wrong.

Maybe it has something to do with converting it to a FullTransactionId?

FullTransactionIdFromU64(U64FromFullTransactionId(rel) + (int32)
(xid - rel_xid));

Sorry, I couldn't quite figure it out :(

I read through all the patches in order, and aside from the above they
all look correct to me. Some comments on the set as whole:

I don't think we need XLOG_HEAP2_FREEZE_PAGE as a separate record type
anymore. By removing that, you also get rid of the freeze-only codepath
near the end of heap_page_prune(), and the
heap_freeze_execute_prepared() function.

That makes sense to me.

The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than
XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for
the case that there's no pruning, just freezing. The record format
(xl_heap_prune) looks pretty complex as it is, I think it could be made
both more compact and more clear with some refactoring.

I'm happy to change up xl_heap_prune format. In its current form,
according to pahole, it has no holes and just 3 bytes of padding at
the end.

One way we could make it smaller is by moving the isCatalogRel member
to directly after snapshotConflictHorizon and then adding a flags
field and defining flags to indicate whether or not other members
exist at all. We could set bits for HAS_FREEZE_PLANS, HAS_REDIRECTED,
HAS_UNUSED, HAS_DEAD. Then I would remove the non-optional uint16
nredirected, nunused, nplans, ndead and just put the number of
redirected/unused/etc at the beginning of the arrays containing the
offsets. Then I could write various macros for accessing them. That
would make it smaller, but it definitely wouldn't make it less complex
(IMO).

FreezeMultiXactId still takes a separate 'cutoffs' arg, but it could use
pagefrz->cutoffs now.

Yep, I forgot to add a commit to do this. Thanks!

HeapPageFreeze has two "trackers", for the "freeze" and "no freeze"
cases. heap_page_prune() needs to track both, until it decides whether
to freeze or not. But it doesn't make much sense that the caller
(lazy_scan_prune()) has to initialize both, and has to choose which of
the values to use after the call depending on whether heap_page_prune()
froze or not. The two trackers should be just heap_page_prune()'s
internal business.

HeapPageFreeze is a bit confusing in general, as it's both an input and
an output to heap_page_prune(). Not sure what exactly to do there, but I
feel that we should make heap_page_prune()'s interface more clear.
Perhaps move the output fields to PruneResult.

Great point. Perhaps I just add NewRelfrozenXid and NewRelminMxid to
PruneResult (and call it PruneFreezeResult) and then make
VacuumCutoffs an optional argument to heap_page_prune() (used by
vacuum and not on-access pruning). Then I eliminate HeapPageFreeze as
a parameter altogether.

Let's rename heap_page_prune() to heap_page_prune_and_freeze(), as
freezing is now an integral part of the function. And mention it in the
function comment, too.

Agreed. Will do in the next version. I want to get some consensus on
what to do with xl_heap_prune before going on my rebase journey with
these 15 patches.

In heap_prune_chain:

* Tuple visibility information is provided in presult->htsv.

Not this patch's fault directly, but it's not immediate clear what "is
provided" means here. Does the caller provide it, or does the function
set it, ie. is it an input or output argument? Looking at the code, it's
an input, but now it looks a bit weird that an input argument is called
'presult'.

So, htsv is a member of PruneResult on master because
heap_page_prune() populates PruneResult->htsv for use in
lazy_scan_prune(). heap_prune_chain() doesn't have access to
PruneResult on master. Once I move PruneResult to being populated both
by heap_page_prune() and heap_prune_chain(), it gets more confusing.
htsv is always populated in heap_page_prune(), but it is not until
later patches in the set that I stop accessing it in
lazy_scan_prune(). Once I do so, I move htsv from PruneResult into
PruneState -- which fixes the heap_prune_chain() confusion.

So, only intermediate commits in the set have PruneResult->htsv used
in heap_prune_chain(). The end state is that heap_prune_chain()
accesses PruneState->htsv. However, I can document how it is used more
clearly in the function comment in the intermediate commits. Or, I can
simply leave htsv as a separate input argument to heap_prune_chain()
in the intermediate commits.

- Melanie

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Melanie Plageman (#3)

Re: Combine Prune and Freeze records emitted by vacuum

On 09/03/2024 22:41, Melanie Plageman wrote:

On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly?

Okay, so I thought a lot about this, and I don't understand how
GlobalVisTestIsRemovableXid() would not handle FrozenTransactionId
correctly.

vacrel->cutoffs.OldestXmin is computed initially from
GetOldestNonRemovableTransactionId() which uses ComputeXidHorizons().
GlobalVisState is updated by ComputeXidHorizons() (when it is
updated).

I do see that the comment above GlobalVisTestIsRemovableXid() says

* It is crucial that this only gets called for xids from a source that
* protects against xid wraparounds (e.g. from a table and thus protected by
* relfrozenxid).

and then in

* Convert 32 bit argument to FullTransactionId. We can do so safely
* because we know the xid has to, at the very least, be between
* [oldestXid, nextXid), i.e. within 2 billion of xid.

I'm not sure what oldestXid is here.
It's true that I don't see GlobalVisTestIsRemovableXid() being called
anywhere else with an xmin as an input. I think that hints that it is
not safe with FrozenTransactionId. But I don't see what could go
wrong.

Maybe it has something to do with converting it to a FullTransactionId?

FullTransactionIdFromU64(U64FromFullTransactionId(rel) + (int32)
(xid - rel_xid));

Sorry, I couldn't quite figure it out :(

I just tested it. No, GlobalVisTestIsRemovableXid does not work for
FrozenTransactionId. I just tested it with state->definitely_needed ==
{0, 4000000000} and xid == FrozenTransactionid, and it incorrectly
returned 'false'. It treats FrozenTransactionId as if was a regular xid '2'.

The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than
XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for
the case that there's no pruning, just freezing. The record format
(xl_heap_prune) looks pretty complex as it is, I think it could be made
both more compact and more clear with some refactoring.

I'm happy to change up xl_heap_prune format. In its current form,
according to pahole, it has no holes and just 3 bytes of padding at
the end.

One way we could make it smaller is by moving the isCatalogRel member
to directly after snapshotConflictHorizon and then adding a flags
field and defining flags to indicate whether or not other members
exist at all. We could set bits for HAS_FREEZE_PLANS, HAS_REDIRECTED,
HAS_UNUSED, HAS_DEAD. Then I would remove the non-optional uint16
nredirected, nunused, nplans, ndead and just put the number of
redirected/unused/etc at the beginning of the arrays containing the
offsets.

Sounds good.

Then I could write various macros for accessing them. That
would make it smaller, but it definitely wouldn't make it less complex
(IMO).

I don't know, it might turn out not that complex. If you define the
formats of each of those "sub-record types" as explicit structs, per
attached sketch, you won't need so many macros. Some care is still
needed with alignment though.

--
Heikki Linnakangas
Neon (https://neon.tech)

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#4)

Re: Combine Prune and Freeze records emitted by vacuum

On Mon, Mar 11, 2024 at 6:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 09/03/2024 22:41, Melanie Plageman wrote:

On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly?

Okay, so I thought a lot about this, and I don't understand how
GlobalVisTestIsRemovableXid() would not handle FrozenTransactionId
correctly.

vacrel->cutoffs.OldestXmin is computed initially from
GetOldestNonRemovableTransactionId() which uses ComputeXidHorizons().
GlobalVisState is updated by ComputeXidHorizons() (when it is
updated).

I do see that the comment above GlobalVisTestIsRemovableXid() says

* It is crucial that this only gets called for xids from a source that
* protects against xid wraparounds (e.g. from a table and thus protected by
* relfrozenxid).

and then in

* Convert 32 bit argument to FullTransactionId. We can do so safely
* because we know the xid has to, at the very least, be between
* [oldestXid, nextXid), i.e. within 2 billion of xid.

I'm not sure what oldestXid is here.
It's true that I don't see GlobalVisTestIsRemovableXid() being called
anywhere else with an xmin as an input. I think that hints that it is
not safe with FrozenTransactionId. But I don't see what could go
wrong.

Maybe it has something to do with converting it to a FullTransactionId?

FullTransactionIdFromU64(U64FromFullTransactionId(rel) + (int32)
(xid - rel_xid));

Sorry, I couldn't quite figure it out :(

I just tested it. No, GlobalVisTestIsRemovableXid does not work for
FrozenTransactionId. I just tested it with state->definitely_needed ==
{0, 4000000000} and xid == FrozenTransactionid, and it incorrectly
returned 'false'. It treats FrozenTransactionId as if was a regular xid '2'.

I see. Looking at the original code:
if (!TransactionIdPrecedes(xmin,
vacrel->cutoffs.OldestXmin))

And the source code for TransactionIdPrecedes:

if (!TransactionIdIsNormal(id1) || !TransactionIdIsNormal(id2))
return (id1 < id2);

	diff = (int32) (id1 - id2);
	return (diff < 0);

In your example, It seems like you mean GlobalVisState->maybe_needed is
0 and GlobalVisState->definitely_needed = 4000000000. In this example,
if vacrel->cutoffs.OldestXmin was 0, we would get a wrong answer also.

But, I do see that the comparison done by TransactionIdPrecedes() is is
quite different than that done by FullTransactionIdPrecedes() because of
the modulo 2**32 arithmetic.

Solving the handling of FrozenTransactionId specifically by
GlobalVisTestIsRemovableXid() seems like it would be done easily in our
case by wrapping it in a function which checks if
TransactionIdIsNormal() and returns true if it is not normal. But, I'm
not sure if I am missing the larger problem.

The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than
XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for
the case that there's no pruning, just freezing. The record format
(xl_heap_prune) looks pretty complex as it is, I think it could be made
both more compact and more clear with some refactoring.

I'm happy to change up xl_heap_prune format. In its current form,
according to pahole, it has no holes and just 3 bytes of padding at
the end.

One way we could make it smaller is by moving the isCatalogRel member
to directly after snapshotConflictHorizon and then adding a flags
field and defining flags to indicate whether or not other members
exist at all. We could set bits for HAS_FREEZE_PLANS, HAS_REDIRECTED,
HAS_UNUSED, HAS_DEAD. Then I would remove the non-optional uint16
nredirected, nunused, nplans, ndead and just put the number of
redirected/unused/etc at the beginning of the arrays containing the
offsets.

Sounds good.

Then I could write various macros for accessing them. That
would make it smaller, but it definitely wouldn't make it less complex
(IMO).

I don't know, it might turn out not that complex. If you define the
formats of each of those "sub-record types" as explicit structs, per
attached sketch, you won't need so many macros. Some care is still
needed with alignment though.

In the attached v2, I've done as you suggested and made all members
except flags and snapshotConflictHorizon optional in the xl_heap_prune
struct and obsoleted the xl_heap_freeze struct. I've kept the actual
xl_heap_freeze_page struct and heap_xlog_freeze_page() function so that
we can replay previously made XLOG_HEAP2_FREEZE_PAGE records.

Because we may set line pointers unused during vacuum's first pass, I
couldn't use the presence of nowunused without redirected or dead items
to indicate that this was a record emitted by vacuum's second pass. As
such, I haven't obsoleted the xl_heap_vacuum struct. I was thinking I
could add a flag that indicates the record was emitted by vacuum's
second pass? We would want to distinguish this so that we could set the
items unused without calling heap_page_prune_execute() -- because that
calls PageRepairFragmentation() which requires a full cleanup lock.

I introduced a few sub-record types similar to what you suggested --
they help a bit with alignment, so I think they are worth keeping. There
are comments around them, but perhaps a larger diagram of the layout of
a the new XLOG_HEAP2_PRUNE record would be helpful.

There is a bit of duplicated code between heap_xlog_prune() and
heap2_desc() since they both need to deserialize the record. Before the
code to do this was small and it didn't matter, but it might be worth
refactoring it that way now.

Note that I've made all of the changes related to obsoleting the
XLOG_HEAP2_FREEZE_PAGE record in separate commits on top of the rest of
the set for ease of review. However, I've rebased the other review
feedback into the relevant commits.

On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I don't think we need XLOG_HEAP2_FREEZE_PAGE as a separate record type
anymore. By removing that, you also get rid of the freeze-only codepath
near the end of heap_page_prune(), and the
heap_freeze_execute_prepared() function.

The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than
XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for
the case that there's no pruning, just freezing. The record format
(xl_heap_prune) looks pretty complex as it is, I think it could be made
both more compact and more clear with some refactoring.

On the point of removing the freeze-only code path from
heap_page_prune() (now heap_page_prune_and_freeze()): while doing this,
I realized that heap_pre_freeze_checks() was not being called in the
case that we decide to freeze because we emitted an FPI while setting
the hint bit. I've fixed that, however, I've done so by moving
heap_pre_freeze_checks() into the critical section. I think that is not
okay? I could move it earlier and not do call it when the hint bit FPI
leads us to freeze tuples. But, I think that would lead to us doing a
lot less validation of tuples being frozen when checksums are enabled.
Or, I could make two critical sections?

FreezeMultiXactId still takes a separate 'cutoffs' arg, but it could use
pagefrz->cutoffs now.

Fixed this.

HeapPageFreeze has two "trackers", for the "freeze" and "no freeze"
cases. heap_page_prune() needs to track both, until it decides whether
to freeze or not. But it doesn't make much sense that the caller
(lazy_scan_prune()) has to initialize both, and has to choose which of
the values to use after the call depending on whether heap_page_prune()
froze or not. The two trackers should be just heap_page_prune()'s
internal business.

I've added new_relminmxid and new_relfrozenxid to PruneFreezeResult and
set them appropriately in heap_page_prune_and_freeze().

It's a bit sad because if it wasn't for vacrel->skippedallvis,
vacrel->NewRelfrozenXid and vacrel->NewRelminMxid would be
vacrel->cutoffs.OldestXmin and vacrel->cutoffs.OldestMxact respectively
and we could avoid having lazy_scan_prune() initializing the
HeapPageFreeze altogether and just pass VacuumCutoffs (and
heap_page_prune_opt() could pass NULL) to heap_page_prune_and_freeze().

I think it is probably worse to add both of them as additional optional
arguments, so I've just left lazy_scan_prune() with the job of
initializing them.

In heap_page_prune_and_freeze(), I initialize new_relminmxid and
new_relfrozenxid to InvalidMultiXactId and InvalidTransactionId
respectively because on-access pruning doesn't have a value to set them
to. But I wasn't sure if this was okay -- since I don't see validation
that TransactionIdIsValid() in vac_update_relstats(). It will work now
-- just worried about future issues. I could add an assert there?

HeapPageFreeze is a bit confusing in general, as it's both an input and
an output to heap_page_prune(). Not sure what exactly to do there, but I
feel that we should make heap_page_prune()'s interface more clear.
Perhaps move the output fields to PruneResult.

HeapPageFrz is now only an input argument to
heap_page_prune_and_freeze() as of the commit in which
heap_page_prune_and_freeze() becomes responsible for executing freezing.
It is still an in/out param to earlier commits because we still need
info from it to execute freezing in lazy_scan_prune().

Let's rename heap_page_prune() to heap_page_prune_and_freeze(), as
freezing is now an integral part of the function. And mention it in the
function comment, too.

I've done this.

In heap_prune_chain:

* Tuple visibility information is provided in presult->htsv.

Not this patch's fault directly, but it's not immediate clear what "is
provided" means here. Does the caller provide it, or does the function
set it, ie. is it an input or output argument? Looking at the code, it's
an input, but now it looks a bit weird that an input argument is called
'presult'.

I haven't updated the comments about this in the intermediate commits
since it ends up in the PruneState as an input.

- Melanie

melanieplageman@gmail.com

about 2 years ago

In reply to: Melanie Plageman (#5)

Re: Combine Prune and Freeze records emitted by vacuum

On Wed, Mar 13, 2024 at 07:25:56PM -0400, Melanie Plageman wrote:

On Mon, Mar 11, 2024 at 6:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 09/03/2024 22:41, Melanie Plageman wrote:

On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly?

Okay, so I thought a lot about this, and I don't understand how
GlobalVisTestIsRemovableXid() would not handle FrozenTransactionId
correctly.

vacrel->cutoffs.OldestXmin is computed initially from
GetOldestNonRemovableTransactionId() which uses ComputeXidHorizons().
GlobalVisState is updated by ComputeXidHorizons() (when it is
updated).

I do see that the comment above GlobalVisTestIsRemovableXid() says

* It is crucial that this only gets called for xids from a source that
* protects against xid wraparounds (e.g. from a table and thus protected by
* relfrozenxid).

and then in

* Convert 32 bit argument to FullTransactionId. We can do so safely
* because we know the xid has to, at the very least, be between
* [oldestXid, nextXid), i.e. within 2 billion of xid.

I'm not sure what oldestXid is here.
It's true that I don't see GlobalVisTestIsRemovableXid() being called
anywhere else with an xmin as an input. I think that hints that it is
not safe with FrozenTransactionId. But I don't see what could go
wrong.

Maybe it has something to do with converting it to a FullTransactionId?

FullTransactionIdFromU64(U64FromFullTransactionId(rel) + (int32)
(xid - rel_xid));

Sorry, I couldn't quite figure it out :(

I just tested it. No, GlobalVisTestIsRemovableXid does not work for
FrozenTransactionId. I just tested it with state->definitely_needed ==
{0, 4000000000} and xid == FrozenTransactionid, and it incorrectly
returned 'false'. It treats FrozenTransactionId as if was a regular xid '2'.

I see. Looking at the original code:
if (!TransactionIdPrecedes(xmin,
vacrel->cutoffs.OldestXmin))

And the source code for TransactionIdPrecedes:

if (!TransactionIdIsNormal(id1) || !TransactionIdIsNormal(id2))
return (id1 < id2);
diff = (int32) (id1 - id2);
return (diff < 0);
In your example, It seems like you mean GlobalVisState->maybe_needed is
0 and GlobalVisState->definitely_needed = 4000000000. In this example,
if vacrel->cutoffs.OldestXmin was 0, we would get a wrong answer also.

But, I do see that the comparison done by TransactionIdPrecedes() is is
quite different than that done by FullTransactionIdPrecedes() because of
the modulo 2**32 arithmetic.

Solving the handling of FrozenTransactionId specifically by
GlobalVisTestIsRemovableXid() seems like it would be done easily in our
case by wrapping it in a function which checks if
TransactionIdIsNormal() and returns true if it is not normal. But, I'm
not sure if I am missing the larger problem.

I've made such a wrapper in attached v3.

The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than
XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for
the case that there's no pruning, just freezing. The record format
(xl_heap_prune) looks pretty complex as it is, I think it could be made
both more compact and more clear with some refactoring.

I'm happy to change up xl_heap_prune format. In its current form,
according to pahole, it has no holes and just 3 bytes of padding at
the end.

One way we could make it smaller is by moving the isCatalogRel member
to directly after snapshotConflictHorizon and then adding a flags
field and defining flags to indicate whether or not other members
exist at all. We could set bits for HAS_FREEZE_PLANS, HAS_REDIRECTED,
HAS_UNUSED, HAS_DEAD. Then I would remove the non-optional uint16
nredirected, nunused, nplans, ndead and just put the number of
redirected/unused/etc at the beginning of the arrays containing the
offsets.

Sounds good.

Then I could write various macros for accessing them. That
would make it smaller, but it definitely wouldn't make it less complex
(IMO).

I don't know, it might turn out not that complex. If you define the
formats of each of those "sub-record types" as explicit structs, per
attached sketch, you won't need so many macros. Some care is still
needed with alignment though.

In the attached v2, I've done as you suggested and made all members
except flags and snapshotConflictHorizon optional in the xl_heap_prune
struct and obsoleted the xl_heap_freeze struct. I've kept the actual
xl_heap_freeze_page struct and heap_xlog_freeze_page() function so that
we can replay previously made XLOG_HEAP2_FREEZE_PAGE records.

Because we may set line pointers unused during vacuum's first pass, I
couldn't use the presence of nowunused without redirected or dead items
to indicate that this was a record emitted by vacuum's second pass. As
such, I haven't obsoleted the xl_heap_vacuum struct. I was thinking I
could add a flag that indicates the record was emitted by vacuum's
second pass? We would want to distinguish this so that we could set the
items unused without calling heap_page_prune_execute() -- because that
calls PageRepairFragmentation() which requires a full cleanup lock.

Okay, so I was going to start using xl_heap_prune for vacuum here too,
but I realized it would be bigger because of the
snapshotConflictHorizon. Do you think there is a non-terrible way to
make the snapshotConflictHorizon optional? Like with a flag?

I introduced a few sub-record types similar to what you suggested --
they help a bit with alignment, so I think they are worth keeping. There
are comments around them, but perhaps a larger diagram of the layout of
a the new XLOG_HEAP2_PRUNE record would be helpful.

I started doing this, but I can't find a way of laying out the diagram
that pgindent doesn't destroy. I thought I remember other diagrams in
the source code showing the layout of something (something with pages
somewhere?) that don't get messed up by pgindent, but I couldn't find
them.

There is a bit of duplicated code between heap_xlog_prune() and
heap2_desc() since they both need to deserialize the record. Before the
code to do this was small and it didn't matter, but it might be worth
refactoring it that way now.

I have added a helper function to do the deserialization,
heap_xlog_deserialize_prune_and_freeze(). But I didn't start using it in
heap2_desc() because of the way the pg_waldump build file works. Do you
think the helper belongs in any of waldump's existing sources?

pg_waldump_sources = files(
'compat.c',
'pg_waldump.c',
'rmgrdesc.c',
)

pg_waldump_sources += rmgr_desc_sources
pg_waldump_sources += xlogreader_sources
pg_waldump_sources += files('../../backend/access/transam/xlogstats.c')

Otherwise, I assume I am suppose to avoid adding some big new include to
waldump.

On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I don't think we need XLOG_HEAP2_FREEZE_PAGE as a separate record type
anymore. By removing that, you also get rid of the freeze-only codepath
near the end of heap_page_prune(), and the
heap_freeze_execute_prepared() function.

The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than
XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for
the case that there's no pruning, just freezing. The record format
(xl_heap_prune) looks pretty complex as it is, I think it could be made
both more compact and more clear with some refactoring.

On the point of removing the freeze-only code path from
heap_page_prune() (now heap_page_prune_and_freeze()): while doing this,
I realized that heap_pre_freeze_checks() was not being called in the
case that we decide to freeze because we emitted an FPI while setting
the hint bit. I've fixed that, however, I've done so by moving
heap_pre_freeze_checks() into the critical section. I think that is not
okay? I could move it earlier and not do call it when the hint bit FPI
leads us to freeze tuples. But, I think that would lead to us doing a
lot less validation of tuples being frozen when checksums are enabled.
Or, I could make two critical sections?

I found another approach and just do the pre-freeze checks if we are
considering freezing except for the FPI criteria.

HeapPageFreeze has two "trackers", for the "freeze" and "no freeze"
cases. heap_page_prune() needs to track both, until it decides whether
to freeze or not. But it doesn't make much sense that the caller
(lazy_scan_prune()) has to initialize both, and has to choose which of
the values to use after the call depending on whether heap_page_prune()
froze or not. The two trackers should be just heap_page_prune()'s
internal business.

I've added new_relminmxid and new_relfrozenxid to PruneFreezeResult and
set them appropriately in heap_page_prune_and_freeze().

It's a bit sad because if it wasn't for vacrel->skippedallvis,
vacrel->NewRelfrozenXid and vacrel->NewRelminMxid would be
vacrel->cutoffs.OldestXmin and vacrel->cutoffs.OldestMxact respectively
and we could avoid having lazy_scan_prune() initializing the
HeapPageFreeze altogether and just pass VacuumCutoffs (and
heap_page_prune_opt() could pass NULL) to heap_page_prune_and_freeze().

I think it is probably worse to add both of them as additional optional
arguments, so I've just left lazy_scan_prune() with the job of
initializing them.

In heap_page_prune_and_freeze(), I initialize new_relminmxid and
new_relfrozenxid to InvalidMultiXactId and InvalidTransactionId
respectively because on-access pruning doesn't have a value to set them
to. But I wasn't sure if this was okay -- since I don't see validation
that TransactionIdIsValid() in vac_update_relstats(). It will work now
-- just worried about future issues. I could add an assert there?

I looked more closely at vac_update_relstats() and it won't update
relfrozenxid unless the proposed value is smaller than the existing one.
And for MultiXactIds, it checks if it is valid. So, this is not an
issue.

I've also optimized the member ordering of PruneFreezeResult. pahole
identified some avoidable holes. I went through each commit and
optimized the PruneResult data structure as members are being added and
removed.

- Melanie

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Melanie Plageman (#6)

Re: Combine Prune and Freeze records emitted by vacuum

On 15/03/2024 02:56, Melanie Plageman wrote:

Okay, so I was going to start using xl_heap_prune for vacuum here too,
but I realized it would be bigger because of the
snapshotConflictHorizon. Do you think there is a non-terrible way to
make the snapshotConflictHorizon optional? Like with a flag?

Yeah, another flag would do the trick.

I introduced a few sub-record types similar to what you suggested --
they help a bit with alignment, so I think they are worth keeping. There
are comments around them, but perhaps a larger diagram of the layout of
a the new XLOG_HEAP2_PRUNE record would be helpful.

I started doing this, but I can't find a way of laying out the diagram
that pgindent doesn't destroy. I thought I remember other diagrams in
the source code showing the layout of something (something with pages
somewhere?) that don't get messed up by pgindent, but I couldn't find
them.

See src/tools/pgindent/README, section "Cleaning up in case of failure
or ugly output":

/*----------
* Text here will not be touched by pgindent.
*/

There is a bit of duplicated code between heap_xlog_prune() and
heap2_desc() since they both need to deserialize the record. Before the
code to do this was small and it didn't matter, but it might be worth
refactoring it that way now.

I have added a helper function to do the deserialization,
heap_xlog_deserialize_prune_and_freeze(). But I didn't start using it in
heap2_desc() because of the way the pg_waldump build file works. Do you
think the helper belongs in any of waldump's existing sources?

pg_waldump_sources = files(
'compat.c',
'pg_waldump.c',
'rmgrdesc.c',
)

pg_waldump_sources += rmgr_desc_sources
pg_waldump_sources += xlogreader_sources
pg_waldump_sources += files('../../backend/access/transam/xlogstats.c')

Otherwise, I assume I am suppose to avoid adding some big new include to
waldump.

Didn't look closely at that, but there's some precedent with
commit/prepare/abort records. See ParseCommitRecord, xl_xact_commit,
xl_parsed_commit et al.

Note that we don't provide WAL compatibility across major versions. You
can fully remove the old xl_heap_freeze_page format. (We should bump
XLOG_PAGE_MAGIC when this is committed though)

On the point of removing the freeze-only code path from
heap_page_prune() (now heap_page_prune_and_freeze()): while doing this,
I realized that heap_pre_freeze_checks() was not being called in the
case that we decide to freeze because we emitted an FPI while setting
the hint bit. I've fixed that, however, I've done so by moving
heap_pre_freeze_checks() into the critical section. I think that is not
okay? I could move it earlier and not do call it when the hint bit FPI
leads us to freeze tuples. But, I think that would lead to us doing a
lot less validation of tuples being frozen when checksums are enabled.
Or, I could make two critical sections?

I found another approach and just do the pre-freeze checks if we are
considering freezing except for the FPI criteria.

Hmm, I think you can make this simpler if you use
XLogCheckBufferNeedsBackup() to check if the hint-bit update will
generate a full-page-image before entering the critical section. Like
you did to check if pruning will generate a full-page-image. I included
that change in the attached patches.

I don't fully understand this:

/*
* If we will freeze tuples on the page or they were all already frozen
* on the page, if we will set the page all-frozen in the visibility map,
* we can advance relfrozenxid and relminmxid to the values in
* pagefrz->FreezePageRelfrozenXid and pagefrz->FreezePageRelminMxid.
*/
if (presult->all_frozen || presult->nfrozen > 0)
{
presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid;
presult->new_relminmxid = pagefrz->FreezePageRelminMxid;
}
else
{
presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid;
presult->new_relminmxid = pagefrz->NoFreezePageRelminMxid;
}

Firstly, the comment is outdated, because we have already done the
freezing at this point. But more importantly, I don't understand the
logic here. Could we check just presult->nfrozen > 0 here and get the
same result?

I think it is probably worse to add both of them as additional optional
arguments, so I've just left lazy_scan_prune() with the job of
initializing them.

Ok.

Here are some patches on top of your patches for some further
refactorings. Some notable changes in heap_page_prune_and_freeze():

- I moved the heap_prepare_freeze_tuple() work from the 2nd loop to the
1st one. It seems more clear and efficient that way.

- I extracted the code to generate the WAL record to a separate function.

- Refactored the "will setting hint bit generate FPI" check as discussed
above

These patches are in a very rough WIP state, but I wanted to share
early. I haven't done much testing, and I'm not wedded to these changes,
but I think they make it more readable. Please include / squash in the
patch set if you agree with them.

Please also take a look at the comments I marked with HEIKKI or FIXME,
in the patches and commit messages.

I'll wait for a new version from you before reviewing more.

--
Heikki Linnakangas
Neon (https://neon.tech)

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#7)

Re: Combine Prune and Freeze records emitted by vacuum

On Mon, Mar 18, 2024 at 01:15:21AM +0200, Heikki Linnakangas wrote:

On 15/03/2024 02:56, Melanie Plageman wrote:

Okay, so I was going to start using xl_heap_prune for vacuum here too,
but I realized it would be bigger because of the
snapshotConflictHorizon. Do you think there is a non-terrible way to
make the snapshotConflictHorizon optional? Like with a flag?

Yeah, another flag would do the trick.

Okay, I've done this in attached v4 (including removing
XLOG_HEAP2_VACUUM). I had to put the snapshot conflict horizon in the
"main chunk" of data available at replay regardless of whether or not
the record ended up including an FPI.

I made it its own sub-record (xlhp_conflict_horizon) less to help with
alignment (though we can use all the help we can get there) and more to
keep it from getting lost. When you look at heapam_xlog.h, you can see
what a XLOG_HEAP2_PRUNE record will contain starting with the
xl_heap_prune struct and then all the sub-record types.

I introduced a few sub-record types similar to what you suggested --
they help a bit with alignment, so I think they are worth keeping. There
are comments around them, but perhaps a larger diagram of the layout of
a the new XLOG_HEAP2_PRUNE record would be helpful.

I started doing this, but I can't find a way of laying out the diagram
that pgindent doesn't destroy. I thought I remember other diagrams in
the source code showing the layout of something (something with pages
somewhere?) that don't get messed up by pgindent, but I couldn't find
them.

See src/tools/pgindent/README, section "Cleaning up in case of failure or
ugly output":

/*----------
* Text here will not be touched by pgindent.
*/

Cool. This version doesn't include the spiffy drawing I promised yet.

Note that we don't provide WAL compatibility across major versions. You can
fully remove the old xl_heap_freeze_page format. (We should bump
XLOG_PAGE_MAGIC when this is committed though)

I've removed the xl_heap_freeze (and xl_heap_prune). I didn't bump
XLOG_PAGE_MAGIC.

On the point of removing the freeze-only code path from
heap_page_prune() (now heap_page_prune_and_freeze()): while doing this,
I realized that heap_pre_freeze_checks() was not being called in the
case that we decide to freeze because we emitted an FPI while setting
the hint bit. I've fixed that, however, I've done so by moving
heap_pre_freeze_checks() into the critical section. I think that is not
okay? I could move it earlier and not do call it when the hint bit FPI
leads us to freeze tuples. But, I think that would lead to us doing a
lot less validation of tuples being frozen when checksums are enabled.
Or, I could make two critical sections?

I found another approach and just do the pre-freeze checks if we are
considering freezing except for the FPI criteria.

Hmm, I think you can make this simpler if you use
XLogCheckBufferNeedsBackup() to check if the hint-bit update will generate a
full-page-image before entering the critical section. Like you did to check
if pruning will generate a full-page-image. I included that change in the
attached patches.

I used your proposed structure. You had XLogCheckBufferNeedsBackup()
twice in your proposed version a few lines apart. I don't think there is
any point in checking it twice. If we are going to rely on
XLogCheckBufferNeedsBackup() to tell us whether or not setting the hint
bit is *likely* to emit an FPI, then we might as well just call
XLogCheckBufferNeedsBackup() once.

I don't fully understand this:

/*
* If we will freeze tuples on the page or they were all already frozen
* on the page, if we will set the page all-frozen in the visibility map,
* we can advance relfrozenxid and relminmxid to the values in
* pagefrz->FreezePageRelfrozenXid and pagefrz->FreezePageRelminMxid.
*/
if (presult->all_frozen || presult->nfrozen > 0)
{
presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid;
presult->new_relminmxid = pagefrz->FreezePageRelminMxid;
}
else
{
presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid;
presult->new_relminmxid = pagefrz->NoFreezePageRelminMxid;
}

Firstly, the comment is outdated, because we have already done the freezing
at this point. But more importantly, I don't understand the logic here.
Could we check just presult->nfrozen > 0 here and get the same result?

I've updated the comment. The reason I had
if (presult->all_frozen || presult->nfrozen > 0) was because of this
comment in heapam.h in the HeapPageFreeze struct

* "Freeze" NewRelfrozenXid/NewRelminMxid trackers.
*
* Trackers used when heap_freeze_execute_prepared freezes, or when there
* are zero freeze plans for a page. It is always valid for vacuumlazy.c
* to freeze any page, by definition. This even includes pages that have
* no tuples with storage to consider in the first place. That way the
* 'totally_frozen' results from heap_prepare_freeze_tuple can always be
* used in the same way, even when no freeze plans need to be executed to
* "freeze the page". Only the "freeze" path needs to consider the need
* to set pages all-frozen in the visibility map under this scheme.
*
* When we freeze a page, we generally freeze all XIDs < OldestXmin, only
* leaving behind XIDs that are ineligible for freezing, if any. And so
* you might wonder why these trackers are necessary at all; why should
* _any_ page that VACUUM freezes _ever_ be left with XIDs/MXIDs that
* ratchet back the top-level NewRelfrozenXid/NewRelminMxid trackers?
*
* It is useful to use a definition of "freeze the page" that does not
* overspecify how MultiXacts are affected. heap_prepare_freeze_tuple
* generally prefers to remove Multis eagerly, but lazy processing is used
* in cases where laziness allows VACUUM to avoid allocating a new Multi.
* The "freeze the page" trackers enable this flexibility.
*/

So, I don't really know if it is right to just check presult->nfrozen >
0 when updating relminmxid. I have changed it to the way you suggested.
But we can change it back.

Here are some patches on top of your patches for some further refactorings.
Some notable changes in heap_page_prune_and_freeze():

- I moved the heap_prepare_freeze_tuple() work from the 2nd loop to the 1st
one. It seems more clear and efficient that way.

cool. I kept this.

- I extracted the code to generate the WAL record to a separate function.

cool. kept this.

These patches are in a very rough WIP state, but I wanted to share early. I
haven't done much testing, and I'm not wedded to these changes, but I think
they make it more readable. Please include / squash in the patch set if you
agree with them.

I've squashed the changes into and across my nineteen patches :)

I cleaned up your sugestions a bit and made a few stylistic choices.

In this version, I also broke up the last couple commits which
streamlined the WAL record and eliminated XLOG_HEAP2_FREEZE/VACUUM and
redistributed those changes in a way that I thought made sense.

Now, the progression is that in one commit we merge the prune and freeze
record, eliminating the XLOG_HEAP2_FREEZE record. Then, in another
commit, we eliminate the XLOG_HEAP2_VACUUM record. Then a later commit
streamlines the new mega xl_heap_prune struct into the variable size
structure based on which modifications it includes.

From 622620a7875ae8c1626e9cd118156e0c734d44ed Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Sun, 17 Mar 2024 22:52:28 +0200
Subject: [PATCH v3heikki 1/4] Inline heap_frz_conflict_horizon() to the
caller.

FIXME: This frz_conflict_horizon business looks fishy to me. We have:
- local frz_conflict_horizon variable,
- presult->frz_conflict_horizon, and
- prstate.snapshotConflictHorizon

Yea, this is a mistake I made when I was rebasing some changes in. The
local variable is gone now.

should we really have all three, and what are the differences?

We do need both the prstate.snapshotConflictHorizon and the
presult->frz_conflict_horizon because the youngest freezable xmin will
often be different than the oldest removable xmax, so we have to track
both and take the most conservative one if we are both freezing and
pruning.

The third (local variable) one was an oops.

From 0219842487931f899abcf183c863c43270c098f0 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Sun, 17 Mar 2024 23:05:31 +0200
Subject: [PATCH v3heikki 2/4] Misc cleanup
---
src/backend/access/heap/pruneheap.c  | 16 +++++++---------
src/backend/access/heap/vacuumlazy.c |  7 ++++++-
2 files changed, 13 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8d3723faf3a..d3df7029667 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -433,7 +433,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
* heap_page_prune_and_freeze(). We expect vistest will always make
* heap_page_prune_and_freeze() remove any deleted tuple whose xmax is <
* OldestXmin.  lazy_scan_prune must never become confused about whether a
-	 * tuple should be frozen or removed.  (In the future we might want to
+	 * tuple should be frozen or removed.
+	 * HEIKKI: Is such confusion possible anymore?
+	 * (In the future we might want to
* teach lazy_scan_prune to recompute vistest from time to time, to
* increase the number of dead tuples it can prune away.)

TBH, I don't really know what this comment is even saying. But
lazy_scan_prune() doesn't do any freezing anymore, so I removed this
sentence.

*/
@@ -1387,6 +1389,9 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
* so we could deal with tuples that were DEAD to VACUUM, but nevertheless were
* left with storage after pruning.
*
+ * HEIKKI: does the above paragraph still make sense? We don't call
+ * HeapTupleSatisfiesVacuum() here anymore
+ *

Yea this whole comment definitely doesn't belong here anymore. Even
though we are calling HeapTupleSatisfiesVacuum() (from inside
heap_prune_satisifes_vacuum()) inside heap_page_prune_and_freeze(), the
comment really doesn't fit anywhere in there either. The comment is
describing a situation that is no longer possible. So describing a
situation that is no longer possible in a part of the code that it never
could have been possible does not make sense to me. I've removed the
comment.

* As of Postgres 17, we circumvent this problem altogether by reusing the
* result of heap_page_prune_and_freeze()'s visibility check. Without the
* second call to HeapTupleSatisfiesVacuum(), there is no new HTSV_Result and
--
2.39.2

From d72cebf13f9866112309883f72a382fc2cb57e17 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Sun, 17 Mar 2024 23:08:42 +0200
Subject: [PATCH v3heikki 3/4] Move work to the first loop
src/backend/access/heap/pruneheap.c | 141 ++++++++++------------------
1 file changed, 52 insertions(+), 89 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index b3573bb628b..3541628799a 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -78,9 +78,6 @@ static int	heap_prune_chain(Buffer buffer,
PruneState *prstate, PruneFreezeResult *presult);

static inline HTSV_Result htsv_get_valid_status(int status);
-static void prune_prepare_freeze_tuple(Page page, OffsetNumber offnum, PruneState *prstate,
-									   HeapPageFreeze *pagefrz, HeapTupleFreeze *frozen,
-									   PruneFreezeResult *presult);
static void heap_prune_record_prunable(PruneState *prstate, TransactionId xid);
static void heap_prune_record_redirect(PruneState *prstate,
OffsetNumber offnum, OffsetNumber rdoffnum,
@@ -322,6 +319,16 @@ heap_page_prune_and_freeze(Relation relation, Buffer buffer,
/* for recovery conflicts */
presult->frz_conflict_horizon = InvalidTransactionId;

+	/*
+	 * We will update the VM after pruning, collecting LP_DEAD items, and
+	 * freezing tuples. Keep track of whether or not the page is all_visible
+	 * and all_frozen and use this information to update the VM. all_visible
+	 * implies lpdead_items == 0, but don't trust all_frozen result unless
+	 * all_visible is also set to true.
+	 */
+	/* HEIKKI: the caller updates the VM? not us */

I've updated this comment.

Other questions and notes on v4:

xl_heap_prune->flags is a uint8, but we are already using 7 of the bits.
Should we make it a unit16?

Eventually, I would like to avoid emitting a separate XLOG_HEAP2_VISIBLE
record for vacuum's first and second passes and just include the VM
update flags in the xl_heap_prune record. xl_heap_visible->flags is a
uint8. If we made xl_heap_prune->flags uint16, we could probably combine
them (though maybe we want other bits available). Also vacuum's second
pass doesn't set a snapshotConflictHorizon, so if we combined
xl_heap_visible and xl_heap_prune for vacuum we would end up saving even
more space (since vacuum sets xl_heap_visible->snapshotConflictHorizon
to InvalidXLogRecPtr).

A note on sub-record naming: I kept xl_heap_freeze_plan's name but
prefixed the other sub-records with xlhp. Do you think it is worth
renaming it (to xlhp_freeze_plan)? Also, should I change xlhp_freeze to
xlhp_freeze_page?

- Melanie

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Melanie Plageman (#8)

Re: Combine Prune and Freeze records emitted by vacuum

On 20/03/2024 03:36, Melanie Plageman wrote:

On Mon, Mar 18, 2024 at 01:15:21AM +0200, Heikki Linnakangas wrote:

On 15/03/2024 02:56, Melanie Plageman wrote:

Okay, so I was going to start using xl_heap_prune for vacuum here too,
but I realized it would be bigger because of the
snapshotConflictHorizon. Do you think there is a non-terrible way to
make the snapshotConflictHorizon optional? Like with a flag?

Yeah, another flag would do the trick.

Okay, I've done this in attached v4 (including removing
XLOG_HEAP2_VACUUM). I had to put the snapshot conflict horizon in the
"main chunk" of data available at replay regardless of whether or not
the record ended up including an FPI.

I made it its own sub-record (xlhp_conflict_horizon) less to help with
alignment (though we can use all the help we can get there) and more to
keep it from getting lost. When you look at heapam_xlog.h, you can see
what a XLOG_HEAP2_PRUNE record will contain starting with the
xl_heap_prune struct and then all the sub-record types.

Ok, now that I look at this, I wonder if we're being overly cautious
about the WAL size. We probably could just always include the snapshot
field, and set it to InvalidTransactionId and waste 4 bytes when it's
not needed. For the sake of simplicity. I don't feel strongly either way
though, the flag is pretty simple too.

xl_heap_prune->flags is a uint8, but we are already using 7 of the bits.
Should we make it a unit16?

It doesn't matter much either way. We can also make it larger when we
need more bits, there's no need make room for them them beforehand.

Eventually, I would like to avoid emitting a separate XLOG_HEAP2_VISIBLE
record for vacuum's first and second passes and just include the VM
update flags in the xl_heap_prune record. xl_heap_visible->flags is a
uint8. If we made xl_heap_prune->flags uint16, we could probably combine
them (though maybe we want other bits available). Also vacuum's second
pass doesn't set a snapshotConflictHorizon, so if we combined
xl_heap_visible and xl_heap_prune for vacuum we would end up saving even
more space (since vacuum sets xl_heap_visible->snapshotConflictHorizon
to InvalidXLogRecPtr).

Makes sense.

A note on sub-record naming: I kept xl_heap_freeze_plan's name but
prefixed the other sub-records with xlhp. Do you think it is worth
renaming it (to xlhp_freeze_plan)?

Yeah, perhaps.

Also, should I change xlhp_freeze to xlhp_freeze_page?

I renamed it to xlhp_freeze_plans, for some consistency with
xlhp_prune_items.

I realized that the WAL record format changes are pretty independent
from the rest of the patches. They could be applied before the rest.
Without the rest of the changes, we'll still write two WAL records per
page in vacuum, one to prune and another one to freeze, but it's another
meaningful incremental step. So I reshuffled the patches, so that the
WAL format is changed first, before the rest of the changes.

0001-0008: These are the WAL format changes. There's some comment
cleanup needed, but as far as the code goes, I think these are pretty
much ready to be squashed & committed.

0009-: The rest of the v4 patches, rebased over the WAL format changes.
I also added a few small commits for little cleanups that caught my eye,
let me know if you disagree with those.

--
Heikki Linnakangas
Neon (https://neon.tech)

#10

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#9)

Re: Combine Prune and Freeze records emitted by vacuum

On Wed, Mar 20, 2024 at 03:15:49PM +0200, Heikki Linnakangas wrote:

On 20/03/2024 03:36, Melanie Plageman wrote:

On Mon, Mar 18, 2024 at 01:15:21AM +0200, Heikki Linnakangas wrote:

On 15/03/2024 02:56, Melanie Plageman wrote:

Okay, so I was going to start using xl_heap_prune for vacuum here too,
but I realized it would be bigger because of the
snapshotConflictHorizon. Do you think there is a non-terrible way to
make the snapshotConflictHorizon optional? Like with a flag?

Yeah, another flag would do the trick.

Okay, I've done this in attached v4 (including removing
XLOG_HEAP2_VACUUM). I had to put the snapshot conflict horizon in the
"main chunk" of data available at replay regardless of whether or not
the record ended up including an FPI.

I made it its own sub-record (xlhp_conflict_horizon) less to help with
alignment (though we can use all the help we can get there) and more to
keep it from getting lost. When you look at heapam_xlog.h, you can see
what a XLOG_HEAP2_PRUNE record will contain starting with the
xl_heap_prune struct and then all the sub-record types.

Ok, now that I look at this, I wonder if we're being overly cautious about
the WAL size. We probably could just always include the snapshot field, and
set it to InvalidTransactionId and waste 4 bytes when it's not needed. For
the sake of simplicity. I don't feel strongly either way though, the flag is
pretty simple too.

That will mean that all vacuum records are at least 3 bytes bigger than
before -- which makes it somewhat less defensible to get rid of
xl_heap_vacuum.

That being said, I ended up doing an unaligned access when I
packed it and made it optional, so maybe it is less user-friendly.

But I also think that making it optional is more clear for vacuum which
will never use it.

I realized that the WAL record format changes are pretty independent from
the rest of the patches. They could be applied before the rest. Without the
rest of the changes, we'll still write two WAL records per page in vacuum,
one to prune and another one to freeze, but it's another meaningful
incremental step. So I reshuffled the patches, so that the WAL format is
changed first, before the rest of the changes.

Ah, great idea! That eliminates the issue of preliminary commits having
larger WAL records that then get streamlined.

0001-0008: These are the WAL format changes. There's some comment cleanup
needed, but as far as the code goes, I think these are pretty much ready to
be squashed & committed.

My review in this email is *only* for 0001-0008. I have not looked at
the rest yet.

From 06d5ff5349a8aa95cbfd06a8043fe503b7b1bf7b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2024 14:50:14 +0200
Subject: [PATCH v5 01/26] Merge prune, freeze and vacuum WAL record formats

The new combined WAL record is now used for pruning, freezing and 2nd
pass of vacuum. This is in preparation of changing vacuuming to write
a combined prune+freeze record per page, instead of separate two
records. The new WAL record format now supports that, but the code
still always writes separate records for pruning and freezing.

Attached patch changes-for-0001.patch has a bunch of updated comments --
especially for heapam_xlog.h (including my promised diagram). And a few
suggestions (mostly changes that I should have made before).

XXX I tried to lift-and-shift the code from v4 patch set as unchanged
as possible, for easier review, but some noteworthy changes:

In the final commit message, I think it is worth calling out that all of
those freeze logging functions heap_log_freeze_eq/cmp/etc are lifted and
shifted from one file to another. When I am reviewing a big diff, it is
always helpful to know where I need to review line-by-line.

- Instead of passing PruneState and PageFreezeResult to
log_heap_prune_and_freeze(), pass the arrays of frozen, redirected
et al offsets directly. That way, it can be called from other places.

good idea.

- moved heap_xlog_deserialize_prune_and_freeze() from xactdesc.c to
heapdesc.c. (Because that's clearly where it belongs)

From cd6cdaebb362b014733e99ecd868896caf0fb3aa Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2024 13:45:01 +0200
Subject: [PATCH v5 02/26] Keep the original numbers for existing WAL records

Doesn't matter much because the WAL format is not compatible across
major versions anyway. But still seems nice to keep the identifiers
unchanged when we can. (There's some precedence for this if you search
the git history for "is free, was").

sounds good.

From d3207bb557aa1d2868a50d357a06318a6c0cb5cd Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2024 13:48:29 +0200
Subject: [PATCH v5 03/26] Rename record to XLOG_HEAP2_PRUNE_FREEZE

To clarify that it also freezes now, and to make it clear that it's
significantly different from the old XLOG_HEAP2_PRUNE format.

From 5d6fc2ffbdd839e0b69242af16446a46cf6a2dc7 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2024 13:49:59 +0200
Subject: [PATCH v5 04/26] 'nplans' is a pointer

I'm surprised the compiler didn't warn about this

oops. while looking at this, I noticed that the asserts I added that
nredirected > 0, ndead > 0, and nunused > 0 have the same problem.

---
src/backend/access/rmgrdesc/heapdesc.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index 8b94c869faf..9ef8a745982 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -155,8 +155,7 @@ heap_xlog_deserialize_prune_and_freeze(char *cursor, uint8 flags,
cursor += sizeof(OffsetNumber) * *nunused;
}

- if (nplans > 0)

From 59f3f80f82ed7a63d86c991d0cb025e4cde2caec Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2024 13:36:41 +0200
Subject: [PATCH v5 06/26] Fix logging snapshot conflict horizon.

- it was accessed without proper alignment, which won't work on
architectures that are strict about alignment. Use memcpy.

wow, oops. thanks for fixing this!

- in heap_xlog_prune_freeze, the code tried to access the xid with
"(xlhp_conflict_horizon *) (xlrec + SizeOfHeapPrune);" But 'xlrec'
was "xl_heap_prune *" rather than "char *". That happened to work,
because sizeof(xl_heap_prune) == 1, but make it more robust by
adding a cast to char *.

good catch.

- remove xlhp_conflict_horizon and store a TransactionId directly. A
separate struct would make sense if we needed to store anything else
there, but for now it just seems like more code.

makes sense. I just want to make sure heapam_xlog.h makes it extra clear
that this is happening. I see your comment addition. If you combine it
with my comment additions in the attached patch for 0001, hopefully that
makes it clear enough.

From 8af186ee9dd8c7dc20f37a69b34cab7b95faa43b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2024 14:03:06 +0200
Subject: [PATCH v5 07/26] Add comment to log_heap_prune_and_freeze().

XXX: This should be rewritten, but I tried to at least list some
important points.

Are you thinking that it needs to mention more things or that the things
it mentions need more detail?

From b26e36ba8614d907a6e15810ed4f684f8f628dd2 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2024 14:53:31 +0200
Subject: [PATCH v5 08/26] minor refactoring in log_heap_prune_and_freeze()

Mostly to make local variables more tightly-scoped.

So, I don't think you can move those sub-records into the tighter scope.
If you run tests with this applied, you'll see it crashes and a lot of
the data in the record is wrong. If you move the sub-record declarations
out to a wider scope, the tests pass.

The WAL record data isn't actually copied into the buffer until

recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE_FREEZE);

after registering everything.
So all of those sub-records you made are out of scope the time it tries
to copy from them.

I spent the better part of a day last week trying to figure out what was
happening after I did the exact same thing. I must say that I found the
xloginsert API incredibly unhelpful on this point.

I would like to review the rest of the suggested changes in this patch
after we fix the issue I just mentioned.

- Melanie

#11

pg@bowt.ie

about 2 years ago

In reply to: Heikki Linnakangas (#9)

Re: Combine Prune and Freeze records emitted by vacuum

On Wed, Mar 20, 2024 at 9:15 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I made it its own sub-record (xlhp_conflict_horizon) less to help with
alignment (though we can use all the help we can get there) and more to
keep it from getting lost. When you look at heapam_xlog.h, you can see
what a XLOG_HEAP2_PRUNE record will contain starting with the
xl_heap_prune struct and then all the sub-record types.

Ok, now that I look at this, I wonder if we're being overly cautious
about the WAL size. We probably could just always include the snapshot
field, and set it to InvalidTransactionId and waste 4 bytes when it's
not needed. For the sake of simplicity. I don't feel strongly either way
though, the flag is pretty simple too.

What about the issue of cleanup locks, which aren't needed and aren't
taken with the current heapam VACUUM record type? Will you preserve
that aspect of the existing design?

--
Peter Geoghegan

#12

melanieplageman@gmail.com

about 2 years ago

In reply to: Peter Geoghegan (#11)

Re: Combine Prune and Freeze records emitted by vacuum

On Wed, Mar 20, 2024 at 4:04 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Mar 20, 2024 at 9:15 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I made it its own sub-record (xlhp_conflict_horizon) less to help with
alignment (though we can use all the help we can get there) and more to
keep it from getting lost. When you look at heapam_xlog.h, you can see
what a XLOG_HEAP2_PRUNE record will contain starting with the
xl_heap_prune struct and then all the sub-record types.

Ok, now that I look at this, I wonder if we're being overly cautious
about the WAL size. We probably could just always include the snapshot
field, and set it to InvalidTransactionId and waste 4 bytes when it's
not needed. For the sake of simplicity. I don't feel strongly either way
though, the flag is pretty simple too.

What about the issue of cleanup locks, which aren't needed and aren't
taken with the current heapam VACUUM record type? Will you preserve
that aspect of the existing design?

Yep, we have a flag to indicate whether or not a cleanup lock is needed.

- Melanie

#13

pg@bowt.ie

about 2 years ago

In reply to: Melanie Plageman (#12)

Re: Combine Prune and Freeze records emitted by vacuum

On Wed, Mar 20, 2024 at 4:06 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

What about the issue of cleanup locks, which aren't needed and aren't
taken with the current heapam VACUUM record type? Will you preserve
that aspect of the existing design?

Yep, we have a flag to indicate whether or not a cleanup lock is needed.

Thanks for confirming.

I realize that this is fairly obvious, but thought it better to ask.

--
Peter Geoghegan

#14

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#9)

Re: Combine Prune and Freeze records emitted by vacuum

On Wed, Mar 20, 2024 at 03:15:49PM +0200, Heikki Linnakangas wrote:

0009-: The rest of the v4 patches, rebased over the WAL format changes. I
also added a few small commits for little cleanups that caught my eye, let
me know if you disagree with those.

This review is just of the patches containing changes you made in
0009-0026.

From d36138b5bf0a93557273b5e47f8cd5ea089057c7 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2024 11:47:42 +0200
Subject: [PATCH v5 13/26] still use a local 'cutoffs' variable

Given how often 'cutoffs' is used in the function, I think it still
makes sense to have a local variable for it, just to keep the source
lines shorter.

Works for me.

From 913617ed98cfddd678a6f620db7dee68d1d61c89 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2024 10:51:13 +0200
Subject: [PATCH v5 21/26] move whole_page_freezable to tighter scope

Works for me.

From e2b50f9b64f7e4255f4f764e2a348e1b446573dc Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2024 11:43:31 +0200
Subject: [PATCH v5 22/26] make 'all_visible_except_removable' local

The caller doesn't need it, so it doesn't belong in PruneFreezeResult

Makes sense to me.

From e993e0d98cd0ef1ecbd229f6ddbe23d59a427e3a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2024 11:40:34 +0200
Subject: [PATCH v5 26/26] reorder PruneFreezeResult fields

---
src/include/access/heapam.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ee0eca8ae02..b2015f5a1ac 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -202,14 +202,17 @@ typedef struct PruneFreezeResult
int			recently_dead_tuples;
int			ndeleted;		/* Number of tuples deleted from the page */
int			nnewlpdead;		/* Number of newly LP_DEAD items */
+	int			nfrozen;

Let's add a comment after int nfrozen like
/* Number of newly frozen tuples */

+
bool all_visible; /* Whether or not the page is all visible */
bool hastup; /* Does page make rel truncation unsafe */

+ /* The following fields are only set if freezing */

So, all_frozen will be set correctly if we are even considering freezing
(if pagefrz is passed). all_frozen will be true even if we didn't freeze
anything if the page is all-frozen and can be set as such in the VM. And
it will be false if the page is not all-frozen. So, maybe we say
"considering freezing".

But, I'm glad you thought to call out which of these fields will make
sense to the caller.

Also, maybe we should just name the members to which this applies. It is
a bit confusing that I can't tell if the comment also refers to the
other members following it (lpdead_items and deadoffsets) -- which it
doesn't.

+
/* Whether or not the page can be set all frozen in the VM */
bool all_frozen;

/* Number of newly frozen tuples */
- int nfrozen;
TransactionId frz_conflict_horizon; /* Newest xmin on the page */

/* New value of relfrozenxid found by heap_page_prune_and_freeze() */
--
2.39.2

- Melanie

#15

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Melanie Plageman (#14)

Re: Combine Prune and Freeze records emitted by vacuum

On 20/03/2024 23:03, Melanie Plageman wrote:

On Wed, Mar 20, 2024 at 03:15:49PM +0200, Heikki Linnakangas wrote:
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ee0eca8ae02..b2015f5a1ac 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -202,14 +202,17 @@ typedef struct PruneFreezeResult
int			recently_dead_tuples;
int			ndeleted;		/* Number of tuples deleted from the page */
int			nnewlpdead;		/* Number of newly LP_DEAD items */
+	int			nfrozen;
Let's add a comment after int nfrozen like
/* Number of newly frozen tuples */

+
bool all_visible; /* Whether or not the page is all visible */
bool hastup; /* Does page make rel truncation unsafe */

+ /* The following fields are only set if freezing */

So, all_frozen will be set correctly if we are even considering freezing
(if pagefrz is passed). all_frozen will be true even if we didn't freeze
anything if the page is all-frozen and can be set as such in the VM. And
it will be false if the page is not all-frozen. So, maybe we say
"considering freezing".

But, I'm glad you thought to call out which of these fields will make
sense to the caller.

Also, maybe we should just name the members to which this applies. It is
a bit confusing that I can't tell if the comment also refers to the
other members following it (lpdead_items and deadoffsets) -- which it
doesn't.

Right, sorry, I spotted the general issue that it's not clear which
fields are valid when. I added that comment to remind about that, but I
then forgot about it.

In heap_page_prune_and_freeze(), we now do some extra work on each live
tuple, to set the all_visible_except_removable correctly. And also to
update live_tuples, recently_dead_tuples and hastup. When we're not
freezing, that's a waste of cycles, the caller doesn't care. I hope it's
enough that it doesn't matter, but is it?

The first commit (after the WAL format changes) changes the all-visible
check to use GlobalVisTestIsRemovableXid. The commit message says that
it's because we don't have 'cutoffs' available, but we only care about
that when we're freezing, and when we're freezing, we actually do have
'cutoffs' in HeapPageFreeze. Using GlobalVisTestIsRemovableXid seems
sensible anyway, because that's what we use in
heap_prune_satisfies_vacuum() too, but just wanted to point that out.

The 'frz_conflict_horizon' stuff is still fuzzy to me. (Not necessarily
these patches's fault). This at least is wrong, because Max(a, b)
doesn't handle XID wraparound correctly:

if (do_freeze)
conflict_xid = Max(prstate.snapshotConflictHorizon,
presult->frz_conflict_horizon);
else
conflict_xid = prstate.snapshotConflictHorizon;

Then there's this in lazy_scan_prune():

/* Using same cutoff when setting VM is now unnecessary */
if (presult.all_frozen)
presult.frz_conflict_horizon = InvalidTransactionId;

This does the right thing in the end, but if all the tuples are frozen
shouldn't frz_conflict_horizon already be InvalidTransactionId? The
comment says it's "newest xmin on the page", and if everything was
frozen, all xmins are FrozenTransactionId. In other words, that should
be moved to heap_page_prune_and_freeze() so that it doesn't lie to its
caller. Also, frz_conflict_horizon is only set correctly if
'all_frozen==true', would be good to mention that in the comments too.

--
Heikki Linnakangas
Neon (https://neon.tech)

#16

[1]: /messages/by-id/CAH2-Wzmsevhox+HJpFmQgCxWWDgNzP0R9F+VBnpOK6TgxMPrRw@mail.gmail.com
/messages/by-id/CAH2-Wzmsevhox+HJpFmQgCxWWDgNzP0R9F+VBnpOK6TgxMPrRw@mail.gmail.com

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Melanie Plageman (#10)

Re: Combine Prune and Freeze records emitted by vacuum

On 20/03/2024 21:17, Melanie Plageman wrote:

Attached patch changes-for-0001.patch has a bunch of updated comments --
especially for heapam_xlog.h (including my promised diagram). And a few
suggestions (mostly changes that I should have made before).

New version of these WAL format changes attached. Squashed to one patch now.

+	// TODO: should we avoid this if we only froze? heap_xlog_freeze() doesn't
+	// do it

Ah yes, that makes sense, did that.

In the final commit message, I think it is worth calling out that all of
those freeze logging functions heap_log_freeze_eq/cmp/etc are lifted and
shifted from one file to another. When I am reviewing a big diff, it is
always helpful to know where I need to review line-by-line.

Done.

From 5d6fc2ffbdd839e0b69242af16446a46cf6a2dc7 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2024 13:49:59 +0200
Subject: [PATCH v5 04/26] 'nplans' is a pointer

I'm surprised the compiler didn't warn about this

oops. while looking at this, I noticed that the asserts I added that
nredirected > 0, ndead > 0, and nunused > 0 have the same problem.

Good catch, fixed.

- remove xlhp_conflict_horizon and store a TransactionId directly. A
separate struct would make sense if we needed to store anything else
there, but for now it just seems like more code.

makes sense. I just want to make sure heapam_xlog.h makes it extra clear
that this is happening. I see your comment addition. If you combine it
with my comment additions in the attached patch for 0001, hopefully that
makes it clear enough.

Thanks. I spent more time on the comments throughout the patch. And one
notable code change: I replaced the XLHP_LP_TRUNCATE_ONLY flag with
XLHP_CLEANUP_LOCK. XLHP_CLEANUP_LOCK directly indicates if you need a
cleanup lock to replay the record. It must always be set when
XLHP_HAS_REDIRECTIONS or XLHP_HAS_DEAD_ITEMS is set, because replaying
those always needs a cleanup lock. That felt easier to document and
understand than XLHP_LP_TRUNCATE_ONLY.

From 8af186ee9dd8c7dc20f37a69b34cab7b95faa43b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2024 14:03:06 +0200
Subject: [PATCH v5 07/26] Add comment to log_heap_prune_and_freeze().

XXX: This should be rewritten, but I tried to at least list some
important points.

Are you thinking that it needs to mention more things or that the things
it mentions need more detail?

I previously just quickly jotted down things that seemed worth
mentioning in the comment. It was not so bad actually, but I reworded it
a little.

From b26e36ba8614d907a6e15810ed4f684f8f628dd2 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2024 14:53:31 +0200
Subject: [PATCH v5 08/26] minor refactoring in log_heap_prune_and_freeze()

Mostly to make local variables more tightly-scoped.

So, I don't think you can move those sub-records into the tighter scope.
If you run tests with this applied, you'll see it crashes and a lot of
the data in the record is wrong. If you move the sub-record declarations
out to a wider scope, the tests pass.

The WAL record data isn't actually copied into the buffer until

recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE_FREEZE);

after registering everything.
So all of those sub-records you made are out of scope the time it tries
to copy from them.

I spent the better part of a day last week trying to figure out what was
happening after I did the exact same thing. I must say that I found the
xloginsert API incredibly unhelpful on this point.

Oops. I had that in mind and that was actually why I moved the
XLogRegisterData() call to the end of the function, because I found it
confusing to register the struct before filling it in completely, even
though it works perfectly fine. But then I missed it anyway when I moved
the local variables. I added a brief comment on that.

I would like to review the rest of the suggested changes in this patch
after we fix the issue I just mentioned.

Thanks, review is appreciated. I feel this is ready now, so barring any
big new issues, I plan to commit this early next week.

There is another patch in the commitfest that touches this area:
"Recording whether Heap2/PRUNE records are from VACUUM or from
opportunistic pruning" [1]/messages/by-id/CAH2-Wzmsevhox+HJpFmQgCxWWDgNzP0R9F+VBnpOK6TgxMPrRw@mail.gmail.com. That actually goes in the opposite direction
than this patch. That patch wants to add more information, to show
whether a record was emitted by VACUUM or on-access pruning, while this
patch makes the freezing and VACUUM's 2nd phase records also look the
same. We could easily add more flags to xl_heap_prune to distinguish
them. Or assign different xl_info code for them, like that other patch
proposed. But I don't think that needs to block this patch, that can be
added as a separate patch.

--
Heikki Linnakangas
Neon (https://neon.tech)

#17

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#15)

Re: Combine Prune and Freeze records emitted by vacuum

On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

In heap_page_prune_and_freeze(), we now do some extra work on each live
tuple, to set the all_visible_except_removable correctly. And also to
update live_tuples, recently_dead_tuples and hastup. When we're not
freezing, that's a waste of cycles, the caller doesn't care. I hope it's
enough that it doesn't matter, but is it?

Last year on an early version of the patch set I did some pgbench
tpcb-like benchmarks -- since there is a lot of on-access pruning in
that workload -- and I don't remember it being a showstopper. The code
has changed a fair bit since then. However, I think it might be safer
to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip
the rest of the loop after heap_prune_satisifies_vacuum() when
on-access pruning invokes it. I had avoided that because it felt ugly
and error-prone, however it addresses a few other of your points as
well.

For example, I think we should set a bit in the prune/freeze WAL
record's flags to indicate if pruning was done by vacuum or on access
(mentioned in another of your recent emails).

The first commit (after the WAL format changes) changes the all-visible
check to use GlobalVisTestIsRemovableXid. The commit message says that
it's because we don't have 'cutoffs' available, but we only care about
that when we're freezing, and when we're freezing, we actually do have
'cutoffs' in HeapPageFreeze. Using GlobalVisTestIsRemovableXid seems
sensible anyway, because that's what we use in
heap_prune_satisfies_vacuum() too, but just wanted to point that out.

Yes, this is true. If we skip this part of the loop when on-access
pruning invokes it, we can go back to using the OldestXmin. I have
done that as well as some other changes in the attached patch,
conflict_horizon_updates.diff. Note that this patch may not apply on
your latest patch as I wrote it on top of an older version. Switching
back to using OldestXmin for page visibility determination makes this
patch set more similar to master as well. We could keep the
alternative check (with GlobalVisState) to maintain the illusion that
callers passing by_vacuum as True can pass NULL for pagefrz, but I was
worried about the extra overhead.

It would be nice to pick a single reasonable visibility horizon (the
oldest running xid we compare things to) at the beginning of
heap_page_prune_and_freeze() and use it for determining if we can
remove tuples, if we can freeze tuples, and if the page is all
visible. It makes it confusing that we use OldestXmin for freezing and
setting the page visibility in the VM and GlobalVisState for
determining prunability. I think using GlobalVisState for freezing has
been discussed before and ruled out for a variety of reasons, and I
definitely don't suggest making that change in this patch set.

The 'frz_conflict_horizon' stuff is still fuzzy to me. (Not necessarily
these patches's fault). This at least is wrong, because Max(a, b)
doesn't handle XID wraparound correctly:

if (do_freeze)
conflict_xid = Max(prstate.snapshotConflictHorizon,
presult->frz_conflict_horizon);
else
conflict_xid = prstate.snapshotConflictHorizon;

Then there's this in lazy_scan_prune():

/* Using same cutoff when setting VM is now unnecessary */
if (presult.all_frozen)
presult.frz_conflict_horizon = InvalidTransactionId;

This does the right thing in the end, but if all the tuples are frozen
shouldn't frz_conflict_horizon already be InvalidTransactionId? The
comment says it's "newest xmin on the page", and if everything was
frozen, all xmins are FrozenTransactionId. In other words, that should
be moved to heap_page_prune_and_freeze() so that it doesn't lie to its
caller. Also, frz_conflict_horizon is only set correctly if
'all_frozen==true', would be good to mention that in the comments too.

Yes, this is a good point. I've spent some time swapping all of this
back into my head. I think we should change the names of all these
conflict horizon variables and introduce some local variables again.
In the attached patch, I've updated the name of the variable in
PruneFreezeResult to vm_conflict_horizon, as it is only used for
emitting a VM update record. Now, I don't set it until the end of
heap_page_prune_and_freeze(). It is only updated from
InvalidTransactionId if the page is not all frozen. As you say, if the
page is all frozen, there can be no conflict.

I've also changed PruneState->snapshotConflictHorizon to
PruneState->latest_xid_removed.

And I introduced the local variables visibility_cutoff_xid and
frz_conflict_horizon. I think it is important we distinguish between
the latest xid pruned, the latest xmin of tuples frozen, and the
latest xid of all live tuples on the page.

Though we end up using visibility_cutoff_xid as the freeze conflict
horizon if the page is all frozen, I think it is more clear to have
the three variables and name them what they are. Then, we calculate
the correct one for the combined WAL record before emitting it. I've
done that in the attached. I've tried to reduce the scope of the
variables as much as possible to keep it as clear as I could.

I think I've also fixed the issue with using Max() to compare
TransactionIds by using TransactionIdFollows() instead.

Note that I still don't think we have a resolution on what to
correctly update new_relfrozenxid and new_relminmxid to at the end
when presult->nfrozen == 0 and presult->all_frozen is true.

if (presult->nfrozen > 0)
{
presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid;
presult->new_relminmxid = pagefrz->FreezePageRelminMxid;
}
else
{
presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid;
presult->new_relminmxid = pagefrz->NoFreezePageRelminMxid;
}

- Melanie

#18

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#16)

Re: Combine Prune and Freeze records emitted by vacuum

On Sat, Mar 23, 2024 at 01:09:30AM +0200, Heikki Linnakangas wrote:

On 20/03/2024 21:17, Melanie Plageman wrote:

Attached patch changes-for-0001.patch has a bunch of updated comments --
especially for heapam_xlog.h (including my promised diagram). And a few
suggestions (mostly changes that I should have made before).

New version of these WAL format changes attached. Squashed to one patch now.
I spent more time on the comments throughout the patch. And one
notable code change: I replaced the XLHP_LP_TRUNCATE_ONLY flag with
XLHP_CLEANUP_LOCK. XLHP_CLEANUP_LOCK directly indicates if you need a
cleanup lock to replay the record. It must always be set when
XLHP_HAS_REDIRECTIONS or XLHP_HAS_DEAD_ITEMS is set, because replaying those
always needs a cleanup lock. That felt easier to document and understand
than XLHP_LP_TRUNCATE_ONLY.

Makes sense to me.

From b26e36ba8614d907a6e15810ed4f684f8f628dd2 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2024 14:53:31 +0200
Subject: [PATCH v5 08/26] minor refactoring in log_heap_prune_and_freeze()

Mostly to make local variables more tightly-scoped.

So, I don't think you can move those sub-records into the tighter scope.
If you run tests with this applied, you'll see it crashes and a lot of
the data in the record is wrong. If you move the sub-record declarations
out to a wider scope, the tests pass.

The WAL record data isn't actually copied into the buffer until

recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE_FREEZE);

after registering everything.
So all of those sub-records you made are out of scope the time it tries
to copy from them.

I spent the better part of a day last week trying to figure out what was
happening after I did the exact same thing. I must say that I found the
xloginsert API incredibly unhelpful on this point.

Oops. I had that in mind and that was actually why I moved the
XLogRegisterData() call to the end of the function, because I found it
confusing to register the struct before filling it in completely, even
though it works perfectly fine. But then I missed it anyway when I moved the
local variables. I added a brief comment on that.

Comment was a good idea.

There is another patch in the commitfest that touches this area: "Recording
whether Heap2/PRUNE records are from VACUUM or from opportunistic pruning"
[1]. That actually goes in the opposite direction than this patch. That
patch wants to add more information, to show whether a record was emitted by
VACUUM or on-access pruning, while this patch makes the freezing and
VACUUM's 2nd phase records also look the same. We could easily add more
flags to xl_heap_prune to distinguish them. Or assign different xl_info code
for them, like that other patch proposed. But I don't think that needs to
block this patch, that can be added as a separate patch.

[1] /messages/by-id/CAH2-Wzmsevhox+HJpFmQgCxWWDgNzP0R9F+VBnpOK6TgxMPrRw@mail.gmail.com

I think it would be very helpful to distinguish amongst vacuum pass 1,
2, and on-access pruning. I often want to know what did most of the
pruning -- and I could see also wanting to know if the first or second
vacuum pass was responsible for removing the tuples. I agree it could be
done separately, but it would be very helpful to have as soon as
possible now that the record type will be the same for all three.

From 042185d3de14dcb7088bbe50e9c64e365ac42c2a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 22 Mar 2024 23:10:22 +0200
Subject: [PATCH v6] Merge prune, freeze and vacuum WAL record formats
/*
- * Handles XLOG_HEAP2_PRUNE record type.
- *
- * Acquires a full cleanup lock.
+ * Replay XLOG_HEAP2_PRUNE_FREEZE record.
*/
static void
-heap_xlog_prune(XLogReaderState *record)
+heap_xlog_prune_freeze(XLogReaderState *record)
{
XLogRecPtr	lsn = record->EndRecPtr;
-	xl_heap_prune *xlrec = (xl_heap_prune *) XLogRecGetData(record);
+	char	   *ptr;
+	xl_heap_prune *xlrec;
Buffer		buffer;
RelFileLocator rlocator;
BlockNumber blkno;
XLogRedoAction action;
XLogRecGetBlockTag(record, 0, &rlocator, NULL, &blkno);
+ ptr = XLogRecGetData(record);

I don't love having two different pointers that alias each other and we
don't know which one is for what. Perhaps we could memcpy xlrec like in
my attached diff (log_updates.diff). It also might perform slightly
better than accessing flags through a xl_heap_prune
* -- since it wouldn't be doing pointer dereferencing.

+ xlrec = (xl_heap_prune *) ptr;
+ ptr += SizeOfHeapPrune;

/*
-	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
-	 * no queries running for which the removed tuples are still visible.
+	 * We will take an ordinary exclusive lock or a cleanup lock depending on
+	 * whether the XLHP_CLEANUP_LOCK flag is set.  With an ordinary exclusive
+	 * lock, we better not be doing anything that requires moving existing
+	 * tuple data.
*/
-	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
-											xlrec->isCatalogRel,
+	Assert((xlrec->flags & XLHP_CLEANUP_LOCK) != 0 ||
+		   (xlrec->flags & (XLHP_HAS_REDIRECTIONS | XLHP_HAS_DEAD_ITEMS)) == 0);
+
+	/*
+	 * We are about to remove and/or freeze tuples.  In Hot Standby mode,
+	 * ensure that there are no queries running for which the removed tuples
+	 * are still visible or which still consider the frozen xids as running.
+	 * The conflict horizon XID comes after xl_heap_prune.
+	 */
+	if (InHotStandby && (xlrec->flags & XLHP_HAS_CONFLICT_HORIZON) != 0)
+	{

My attached patch has a TODO here for the comment. It sticks out that
the serialization and deserialization conditions are different for the
snapshot conflict horizon. We don't deserialize if InHotStandby is
false. That's still correct now because we don't write anything else to
the main data chunk afterward. But, if someone were to add another
member after snapshot_conflict_horizon, they would want to know to
deserialize snapshot_conflict_horizon first even if InHotStandby is
false.

+		TransactionId snapshot_conflict_horizon;
+

I would throw a comment in about the memcpy being required because the
snapshot_conflict_horizon is in the buffer unaligned.

+		memcpy(&snapshot_conflict_horizon, ptr, sizeof(TransactionId));
+		ResolveRecoveryConflictWithSnapshot(snapshot_conflict_horizon,
+											(xlrec->flags & XLHP_IS_CATALOG_REL) != 0,
rlocator);
+	}

+
+/*
+ * Given a MAXALIGNed buffer returned by XLogRecGetBlockData() and pointed to
+ * by cursor and any xl_heap_prune flags, deserialize the arrays of
+ * OffsetNumbers contained in an XLOG_HEAP2_PRUNE_FREEZE record.
+ *
+ * This is in heapdesc.c so it can be shared between heap2_redo and heap2_desc
+ * code, the latter of which is used in frontend (pg_waldump) code.
+ */
+void
+heap_xlog_deserialize_prune_and_freeze(char *cursor, uint8 flags,
+									   int *nredirected, OffsetNumber **redirected,
+									   int *ndead, OffsetNumber **nowdead,
+									   int *nunused, OffsetNumber **nowunused,
+									   int *nplans, xlhp_freeze_plan **plans,
+									   OffsetNumber **frz_offsets)
+{
+	if (flags & XLHP_HAS_FREEZE_PLANS)
+	{
+		xlhp_freeze_plans *freeze_plans = (xlhp_freeze_plans *) cursor;
+
+		*nplans = freeze_plans->nplans;
+		Assert(*nplans > 0);
+		*plans = freeze_plans->plans;
+
+		cursor += offsetof(xlhp_freeze_plans, plans);
+		cursor += sizeof(xlhp_freeze_plan) * *nplans;
+	}

I noticed you decided to set these in the else statements. Is that to
emphasize that it is important to correctness that they be properly
initialized?

+	else
+	{
+		*nplans = 0;
+		*plans = NULL;
+	}
+

Thanks again for all your work on this set!

- Melanie

#19

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Melanie Plageman (#18)

Re: Combine Prune and Freeze records emitted by vacuum

On 24/03/2024 21:55, Melanie Plageman wrote:

On Sat, Mar 23, 2024 at 01:09:30AM +0200, Heikki Linnakangas wrote:

On 20/03/2024 21:17, Melanie Plageman wrote:
There is another patch in the commitfest that touches this area: "Recording
whether Heap2/PRUNE records are from VACUUM or from opportunistic pruning"
[1]. That actually goes in the opposite direction than this patch. That
patch wants to add more information, to show whether a record was emitted by
VACUUM or on-access pruning, while this patch makes the freezing and
VACUUM's 2nd phase records also look the same. We could easily add more
flags to xl_heap_prune to distinguish them. Or assign different xl_info code
for them, like that other patch proposed. But I don't think that needs to
block this patch, that can be added as a separate patch.

[1] /messages/by-id/CAH2-Wzmsevhox+HJpFmQgCxWWDgNzP0R9F+VBnpOK6TgxMPrRw@mail.gmail.com

I think it would be very helpful to distinguish amongst vacuum pass 1,
2, and on-access pruning. I often want to know what did most of the
pruning -- and I could see also wanting to know if the first or second
vacuum pass was responsible for removing the tuples. I agree it could be
done separately, but it would be very helpful to have as soon as
possible now that the record type will be the same for all three.

Ok, I used separate 'info' codes for records generated on on-access
pruning and vacuum's 1st and 2nd pass. Similar to Peter's patch on that
other thread, except that I didn't reserve the whole high bit for this,
but used three different 'info' codes. Freezing uses the same
XLOG_HEAP2_PRUNE_VACUUM_SCAN as the pruning in vacuum's 1st pass. You
can distinguish them based on whether the record has nfrozen > 0, and
with the rest of the patches, they will be merged anyway.

From 042185d3de14dcb7088bbe50e9c64e365ac42c2a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 22 Mar 2024 23:10:22 +0200
Subject: [PATCH v6] Merge prune, freeze and vacuum WAL record formats
/*
- * Handles XLOG_HEAP2_PRUNE record type.
- *
- * Acquires a full cleanup lock.
+ * Replay XLOG_HEAP2_PRUNE_FREEZE record.
*/
static void
-heap_xlog_prune(XLogReaderState *record)
+heap_xlog_prune_freeze(XLogReaderState *record)
{
XLogRecPtr	lsn = record->EndRecPtr;
-	xl_heap_prune *xlrec = (xl_heap_prune *) XLogRecGetData(record);
+	char	   *ptr;
+	xl_heap_prune *xlrec;
Buffer		buffer;
RelFileLocator rlocator;
BlockNumber blkno;
XLogRedoAction action;
XLogRecGetBlockTag(record, 0, &rlocator, NULL, &blkno);
+ ptr = XLogRecGetData(record);
I don't love having two different pointers that alias each other and we
don't know which one is for what. Perhaps we could memcpy xlrec like in
my attached diff (log_updates.diff). It also might perform slightly
better than accessing flags through a xl_heap_prune
* -- since it wouldn't be doing pointer dereferencing.

Ok.

/*
-	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
-	 * no queries running for which the removed tuples are still visible.
+	 * We will take an ordinary exclusive lock or a cleanup lock depending on
+	 * whether the XLHP_CLEANUP_LOCK flag is set.  With an ordinary exclusive
+	 * lock, we better not be doing anything that requires moving existing
+	 * tuple data.
*/
-	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
-											xlrec->isCatalogRel,
+	Assert((xlrec->flags & XLHP_CLEANUP_LOCK) != 0 ||
+		   (xlrec->flags & (XLHP_HAS_REDIRECTIONS | XLHP_HAS_DEAD_ITEMS)) == 0);
+
+	/*
+	 * We are about to remove and/or freeze tuples.  In Hot Standby mode,
+	 * ensure that there are no queries running for which the removed tuples
+	 * are still visible or which still consider the frozen xids as running.
+	 * The conflict horizon XID comes after xl_heap_prune.
+	 */
+	if (InHotStandby && (xlrec->flags & XLHP_HAS_CONFLICT_HORIZON) != 0)
+	{
My attached patch has a TODO here for the comment. It sticks out that
the serialization and deserialization conditions are different for the
snapshot conflict horizon. We don't deserialize if InHotStandby is
false. That's still correct now because we don't write anything else to
the main data chunk afterward. But, if someone were to add another
member after snapshot_conflict_horizon, they would want to know to
deserialize snapshot_conflict_horizon first even if InHotStandby is
false.

Good point. Fixed so that 'snapshot_conflict_horizon' is deserialized
even if !InHotStandby. A memcpy is cheap, and is probably optimized away
by the compiler anyway.

+		TransactionId snapshot_conflict_horizon;
+
I would throw a comment in about the memcpy being required because the
snapshot_conflict_horizon is in the buffer unaligned.

Added.

+/*
+ * Given a MAXALIGNed buffer returned by XLogRecGetBlockData() and pointed to
+ * by cursor and any xl_heap_prune flags, deserialize the arrays of
+ * OffsetNumbers contained in an XLOG_HEAP2_PRUNE_FREEZE record.
+ *
+ * This is in heapdesc.c so it can be shared between heap2_redo and heap2_desc
+ * code, the latter of which is used in frontend (pg_waldump) code.
+ */
+void
+heap_xlog_deserialize_prune_and_freeze(char *cursor, uint8 flags,
+									   int *nredirected, OffsetNumber **redirected,
+									   int *ndead, OffsetNumber **nowdead,
+									   int *nunused, OffsetNumber **nowunused,
+									   int *nplans, xlhp_freeze_plan **plans,
+									   OffsetNumber **frz_offsets)
+{
+	if (flags & XLHP_HAS_FREEZE_PLANS)
+	{
+		xlhp_freeze_plans *freeze_plans = (xlhp_freeze_plans *) cursor;
+
+		*nplans = freeze_plans->nplans;
+		Assert(*nplans > 0);
+		*plans = freeze_plans->plans;
+
+		cursor += offsetof(xlhp_freeze_plans, plans);
+		cursor += sizeof(xlhp_freeze_plan) * *nplans;
+	}

I noticed you decided to set these in the else statements. Is that to
emphasize that it is important to correctness that they be properly
initialized?

The point was to always initialize *nplans et al in the function. You
required the caller to initialize them to zero, but that seems error-prone.

I made one more last minute change: I changed the order of the array
arguments in heap_xlog_deserialize_prune_and_freeze() to match the order
in log_heap_prune_and_freeze().

Committed with the above little changes. Thank you! Now, back to the
rest of the patches :-).

--
Heikki Linnakangas
Neon (https://neon.tech)

#20

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Melanie Plageman (#17)

Re: Combine Prune and Freeze records emitted by vacuum

On 24/03/2024 18:32, Melanie Plageman wrote:

On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

In heap_page_prune_and_freeze(), we now do some extra work on each live
tuple, to set the all_visible_except_removable correctly. And also to
update live_tuples, recently_dead_tuples and hastup. When we're not
freezing, that's a waste of cycles, the caller doesn't care. I hope it's
enough that it doesn't matter, but is it?

Last year on an early version of the patch set I did some pgbench
tpcb-like benchmarks -- since there is a lot of on-access pruning in
that workload -- and I don't remember it being a showstopper. The code
has changed a fair bit since then. However, I think it might be safer
to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip
the rest of the loop after heap_prune_satisifies_vacuum() when
on-access pruning invokes it. I had avoided that because it felt ugly
and error-prone, however it addresses a few other of your points as
well.

Ok. I'm not a fan of the name 'by_vacuum' though. It'd be nice if the
argument described what it does, rather than who it's for. For example,
'need_all_visible'. If set to true, the function determines
'all_visible', otherwise it does not.

I started to look closer at the loops in heap_prune_chain() and how they
update all the various flags and counters. There's a lot going on there.
We have:

- live_tuples counter
- recently_dead_tuples counter
- all_visible[_except_removable]
- all_frozen
- visibility_cutoff_xid
- hastup
- prstate.frozen array
- nnewlpdead
- deadoffsets array

And that doesn't even include all the local variables and the final
dead/redirected arrays.

Some of those are set in the first loop that initializes 'htsv' for each
tuple on the page. Others are updated in heap_prune_chain(). Some are
updated in both. It's hard to follow which are set where.

I think recently_dead_tuples is updated incorrectly, for tuples that are
part of a completely dead HOT chain. For example, imagine a hot chain
with two tuples: RECENTLY_DEAD -> DEAD. heap_prune_chain() would follow
the chain, see the DEAD tuple at the end of the chain, and mark both
tuples for pruning. However, we already updated 'recently_dead_tuples'
in the first loop, which is wrong if we remove the tuple.

Maybe that's the only bug like this, but I'm a little scared. Is there
something we could do to make this simpler? Maybe move all the new work
that we added to the first loop, into heap_prune_chain() ? Maybe
introduce a few more helper heap_prune_record_*() functions, to update
the flags and counters also for live and insert/delete-in-progress
tuples and for dead line pointers? Something like
heap_prune_record_live() and heap_prune_record_lp_dead().

The 'frz_conflict_horizon' stuff is still fuzzy to me. (Not necessarily
these patches's fault). This at least is wrong, because Max(a, b)
doesn't handle XID wraparound correctly:

if (do_freeze)
conflict_xid = Max(prstate.snapshotConflictHorizon,
presult->frz_conflict_horizon);
else
conflict_xid = prstate.snapshotConflictHorizon;

Then there's this in lazy_scan_prune():

/* Using same cutoff when setting VM is now unnecessary */
if (presult.all_frozen)
presult.frz_conflict_horizon = InvalidTransactionId;

This does the right thing in the end, but if all the tuples are frozen
shouldn't frz_conflict_horizon already be InvalidTransactionId? The
comment says it's "newest xmin on the page", and if everything was
frozen, all xmins are FrozenTransactionId. In other words, that should
be moved to heap_page_prune_and_freeze() so that it doesn't lie to its
caller. Also, frz_conflict_horizon is only set correctly if
'all_frozen==true', would be good to mention that in the comments too.

Yes, this is a good point. I've spent some time swapping all of this
back into my head. I think we should change the names of all these
conflict horizon variables and introduce some local variables again.
In the attached patch, I've updated the name of the variable in
PruneFreezeResult to vm_conflict_horizon, as it is only used for
emitting a VM update record. Now, I don't set it until the end of
heap_page_prune_and_freeze(). It is only updated from
InvalidTransactionId if the page is not all frozen. As you say, if the
page is all frozen, there can be no conflict.

Makes sense.

I've also changed PruneState->snapshotConflictHorizon to
PruneState->latest_xid_removed.

And I introduced the local variables visibility_cutoff_xid and
frz_conflict_horizon. I think it is important we distinguish between
the latest xid pruned, the latest xmin of tuples frozen, and the
latest xid of all live tuples on the page.

Though we end up using visibility_cutoff_xid as the freeze conflict
horizon if the page is all frozen, I think it is more clear to have
the three variables and name them what they are.

Agreed.

Note that I still don't think we have a resolution on what to
correctly update new_relfrozenxid and new_relminmxid to at the end
when presult->nfrozen == 0 and presult->all_frozen is true.

if (presult->nfrozen > 0)
{
presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid;
presult->new_relminmxid = pagefrz->FreezePageRelminMxid;
}
else
{
presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid;
presult->new_relminmxid = pagefrz->NoFreezePageRelminMxid;
}

One approach is to take them out of the PageFreezeResult struct again,
and pass them as pointers:

void
heap_page_prune_and_freeze(Relation relation, Buffer buffer,
...
TransactionId *new_relfrozenxid,
MultiXactId *new_relminmxid,
...
)

That would be natural for the caller too, as it wouldn't need to set up
the old values to HeapPageFreeze before each call, nor copy the new
values to 'vacrel' after the call. I'm thinking that we'd move the
responsibility of setting up HeapPageFreeze to
heap_page_prune_and_freeze(), instead of having the caller set it up. So
the caller would look something like this:

heap_page_prune_and_freeze(rel, buf, vacrel->vistest,
&vacrel->cutoffs, &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid,
&presult,
PRUNE_VACUUM_SCAN, flags,
&vacrel->offnum);

In this setup, heap_page_prune_and_freeze() would update
*new_relfrozenxid and *new_relminmxid when it has a new value for them,
and leave them unchanged otherwise.

--
Heikki Linnakangas
Neon (https://neon.tech)

#21

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#20)

#22

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#20)

#23

melanieplageman@gmail.com

about 2 years ago

In reply to: Melanie Plageman (#22)

#24

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Melanie Plageman (#23)

#25

pg@bowt.ie

about 2 years ago

In reply to: Melanie Plageman (#8)

#26

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#24)

#27

melanieplageman@gmail.com

about 2 years ago

In reply to: Melanie Plageman (#26)

#28

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Melanie Plageman (#26)

#29

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#28)

#30

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Melanie Plageman (#29)

#31

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#30)

#32

melanieplageman@gmail.com

about 2 years ago

In reply to: Melanie Plageman (#31)

#33

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Melanie Plageman (#32)

#34

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#33)

#35

melanieplageman@gmail.com

about 2 years ago

In reply to: Melanie Plageman (#34)

#36

Robert Haas

robertmhaas@gmail.com

about 2 years ago

In reply to: Melanie Plageman (#35)

#37

melanieplageman@gmail.com

about 2 years ago

In reply to: Robert Haas (#36)

#38

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Melanie Plageman (#35)

#39

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#38)

#40

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#38)

#41

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Melanie Plageman (#39)

#42

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#41)

#43

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Melanie Plageman (#40)

#44

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Melanie Plageman (#40)

#45

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#44)

#46

melanieplageman@gmail.com

about 2 years ago

In reply to: Melanie Plageman (#45)

#47

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Heikki Linnakangas (#44)

#48

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#47)

#49

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Melanie Plageman (#48)

#50

melanieplageman@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#49)

#51