Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Started by Melanie Plagemanover 1 year ago92 messages
Jump to latest
#1Melanie Plageman
melanieplageman@gmail.com

Hi,

If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than
GlobalVisState->maybe_needed, it will ERROR out when determining
whether or not to freeze the tuple with "cannot freeze committed
xmax".

In back branches starting with 14, failing to remove tuples older than
OldestXmin during pruning caused vacuum to infinitely loop in
lazy_scan_prune(), as investigated on this [1]/messages/by-id/20240415173913.4zyyrwaftujxthf2@awork3.anarazel.de thread.

On master, after 1ccc1e05ae removed the retry loop in
lazy_scan_prune() and stopped comparing tuples to OldestXmin, the hang
could no longer happen, but we can still attempt to freeze dead tuples
visibly killed before OldestXmin -- resulting in an ERROR.

Pruning may fail to remove dead tuples with xmax before OldestXmin if
the tuple is not considered removable by GlobalVisState.

For vacuum, the GlobalVisState is initially calculated at the
beginning of vacuuming the relation -- at the same time and with the
same value as VacuumCutoffs->OldestXmin.

A backend's GlobalVisState may be updated again when it is accessed if
a new snapshot is taken or if something caused ComputeXidHorizons() to
be called.

This can happen, for example, at the end of a round of index vacuuming
when GetOldestNonRemovableTransactionId() is called.

Normally this may result in GlobalVisState's horizon moving forward --
potentially allowing more dead tuples to be removed.

However, if a disconnected standby with a running transaction older
than VacuumCutoffs->OldestXmin reconnects to the primary after vacuum
initially calculates GlobalVisState and OldestXmin but before
GlobalVisState is updated, the value of GlobalVisState->maybe_needed
could go backwards.

If this happens in the middle of vacuum's first pruning and freezing
pass, it is possible that pruning/freezing could then encounter a
tuple whose xmax is younger than GlobalVisState->maybe_needed and
older than VacuumCutoffs->OldestXmin. heap_prune_satisfies_vacuum()
would deem the tuple HEAPTUPLE_RECENTLY_DEAD and would not remove it.
But the heap_pre_freeze_checks() would ERROR out with "cannot freeze
committed xmax". This check is to avoid freezing dead tuples.

We can fix this by always removing tuples considered dead before
VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
has a transaction that sees that tuple as alive, because it will
simply wait to replay the removal until it would be correct to do so
or recovery conflict handling will cancel the transaction that sees
the tuple as alive and allow replay to continue.

Attached is the suggested fix for master plus a repro. I wrote it as a
recovery suite TAP test, but I am _not_ proposing we add it to the
ongoing test suite. It is, amongst other things, definitely prone to
flaking. I also had to use loads of data to force two index vacuuming
passes now that we have TIDStore, so it is a slow test.

If you want to run the repro with meson, you'll have to add
't/099_vacuum_hang.pl' to src/test/recovery/meson.build and then run it with:

meson test postgresql:recovery / recovery/099_vacuum_hang

If you use autotools, you can run it with:
make check PROVE_TESTS="t/099_vacuum_hang.pl"

The repro forces a round of index vacuuming after the standby
reconnects and before pruning a dead tuple whose xmax is older than
OldestXmin.

At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
calls GetOldestNonRemovableTransactionId(), thereby updating the
backend's GlobalVisState and moving maybe_needed backwards.

Then vacuum's first pass will continue with pruning and find our later
inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.

I make sure that the standby reconnects between vacuum_get_cutoffs()
and pruning because I have a cursor on the page keeping VACUUM FREEZE
from getting a cleanup lock.

See the repro for step-by-step explanations of how it works.

I have a modified version of this that repros the infinite loop on
14-16 with substantially less data. See it here [2]/messages/by-id/CAAKRu_Y_NJzF4-8gzTTeaOuUL3CcGoXPjXcAHbTTygT8AyVqag@mail.gmail.com. Also, the repro
attached to this mail won't work on 14 and 15 because of changes to
background_psql.

- Melanie

[1]: /messages/by-id/20240415173913.4zyyrwaftujxthf2@awork3.anarazel.de
[2]: /messages/by-id/CAAKRu_Y_NJzF4-8gzTTeaOuUL3CcGoXPjXcAHbTTygT8AyVqag@mail.gmail.com

Attachments:

v1-0001-Repro-for-vacuum-ERROR-out-trying-to-freeze-old-d.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Repro-for-vacuum-ERROR-out-trying-to-freeze-old-d.patchDownload+267-1
v1-0002-Ensure-vacuum-removes-all-dead-tuples-older-than-.patchtext/x-patch; charset=US-ASCII; name=v1-0002-Ensure-vacuum-removes-all-dead-tuples-older-than-.patchDownload+28-9
In reply to: Melanie Plageman (#1)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than
GlobalVisState->maybe_needed, it will ERROR out when determining
whether or not to freeze the tuple with "cannot freeze committed
xmax".

In back branches starting with 14, failing to remove tuples older than
OldestXmin during pruning caused vacuum to infinitely loop in
lazy_scan_prune(), as investigated on this [1] thread.

This is a great summary.

We can fix this by always removing tuples considered dead before
VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
has a transaction that sees that tuple as alive, because it will
simply wait to replay the removal until it would be correct to do so
or recovery conflict handling will cancel the transaction that sees
the tuple as alive and allow replay to continue.

I think that this is the right general approach.

The repro forces a round of index vacuuming after the standby
reconnects and before pruning a dead tuple whose xmax is older than
OldestXmin.

At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
calls GetOldestNonRemovableTransactionId(), thereby updating the
backend's GlobalVisState and moving maybe_needed backwards.

Right. I saw details exactly consistent with this when I used GDB
against a production instance.

I'm glad that you were able to come up with a repro that involves
exactly the same basic elements, including index page deletion.

--
Peter Geoghegan

#3Alena Rybakina
lena.ribackina@yandex.ru
In reply to: Melanie Plageman (#1)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Hi, Melanie! I'm glad to hear you that you have found a root case of the
problem) Thank you for that!

On 21.06.2024 02:42, Melanie Plageman wrote:

Hi,

If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than
GlobalVisState->maybe_needed, it will ERROR out when determining
whether or not to freeze the tuple with "cannot freeze committed
xmax".

In back branches starting with 14, failing to remove tuples older than
OldestXmin during pruning caused vacuum to infinitely loop in
lazy_scan_prune(), as investigated on this [1] thread.

On master, after 1ccc1e05ae removed the retry loop in
lazy_scan_prune() and stopped comparing tuples to OldestXmin, the hang
could no longer happen, but we can still attempt to freeze dead tuples
visibly killed before OldestXmin -- resulting in an ERROR.

Pruning may fail to remove dead tuples with xmax before OldestXmin if
the tuple is not considered removable by GlobalVisState.

For vacuum, the GlobalVisState is initially calculated at the
beginning of vacuuming the relation -- at the same time and with the
same value as VacuumCutoffs->OldestXmin.

A backend's GlobalVisState may be updated again when it is accessed if
a new snapshot is taken or if something caused ComputeXidHorizons() to
be called.

This can happen, for example, at the end of a round of index vacuuming
when GetOldestNonRemovableTransactionId() is called.

Normally this may result in GlobalVisState's horizon moving forward --
potentially allowing more dead tuples to be removed.

However, if a disconnected standby with a running transaction older
than VacuumCutoffs->OldestXmin reconnects to the primary after vacuum
initially calculates GlobalVisState and OldestXmin but before
GlobalVisState is updated, the value of GlobalVisState->maybe_needed
could go backwards.

If this happens in the middle of vacuum's first pruning and freezing
pass, it is possible that pruning/freezing could then encounter a
tuple whose xmax is younger than GlobalVisState->maybe_needed and
older than VacuumCutoffs->OldestXmin. heap_prune_satisfies_vacuum()
would deem the tuple HEAPTUPLE_RECENTLY_DEAD and would not remove it.
But the heap_pre_freeze_checks() would ERROR out with "cannot freeze
committed xmax". This check is to avoid freezing dead tuples.

We can fix this by always removing tuples considered dead before
VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
has a transaction that sees that tuple as alive, because it will
simply wait to replay the removal until it would be correct to do so
or recovery conflict handling will cancel the transaction that sees
the tuple as alive and allow replay to continue.

Thisis an interestinganddifficultcase)Inoticedthatwheninitializingthe
cluster,inmyopinion,we provideexcessivefreezing.Initializationtakesa
longtime,whichcanlead,for example,tolongertestexecution.Igot ridofthisby
addingthe OldestMxact checkboxisnotFirstMultiXactId,anditworksfine.

if (prstate->cutoffs &&
TransactionIdIsValid(prstate->cutoffs->OldestXmin) &&
prstate->cutoffs->OldestMxact != FirstMultiXactId &&
NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin))
    return HEAPTUPLE_DEAD;

CanI keepit?

Attached is the suggested fix for master plus a repro. I wrote it as a
recovery suite TAP test, but I am _not_ proposing we add it to the
ongoing test suite. It is, amongst other things, definitely prone to
flaking. I also had to use loads of data to force two index vacuuming
passes now that we have TIDStore, so it is a slow test.

If you want to run the repro with meson, you'll have to add
't/099_vacuum_hang.pl' to src/test/recovery/meson.build and then run it with:

meson test postgresql:recovery / recovery/099_vacuum_hang

If you use autotools, you can run it with:
make check PROVE_TESTS="t/099_vacuum_hang.pl

The repro forces a round of index vacuuming after the standby
reconnects and before pruning a dead tuple whose xmax is older than
OldestXmin.

At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
calls GetOldestNonRemovableTransactionId(), thereby updating the
backend's GlobalVisState and moving maybe_needed backwards.

Then vacuum's first pass will continue with pruning and find our later
inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.

I make sure that the standby reconnects between vacuum_get_cutoffs()
and pruning because I have a cursor on the page keeping VACUUM FREEZE
from getting a cleanup lock.

See the repro for step-by-step explanations of how it works.

I have a modified version of this that repros the infinite loop on
14-16 with substantially less data. See it here [2]. Also, the repro
attached to this mail won't work on 14 and 15 because of changes to
background_psql.

[1]/messages/by-id/20240415173913.4zyyrwaftujxthf2@awork3.anarazel.de
[2]/messages/by-id/CAAKRu_Y_NJzF4-8gzTTeaOuUL3CcGoXPjXcAHbTTygT8AyVqag@mail.gmail.com

To be honest, the meson test is new for me, but I see its useful
features. I think I will use it for checking my features)

I couldn't understand why the replica is necessary here. Now I am
digging why I got the similar behavior without replica when I have only
one instance. I'm stillcheckingthisinmytest,butI
believethispatchfixesthe originalproblembecausethesymptomswerethesame.

--
Regards,
Alena Rybakina
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Geoghegan (#2)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On 21/06/2024 03:02, Peter Geoghegan wrote:

On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than
GlobalVisState->maybe_needed, it will ERROR out when determining
whether or not to freeze the tuple with "cannot freeze committed
xmax".

In back branches starting with 14, failing to remove tuples older than
OldestXmin during pruning caused vacuum to infinitely loop in
lazy_scan_prune(), as investigated on this [1] thread.

This is a great summary.

+1

We can fix this by always removing tuples considered dead before
VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
has a transaction that sees that tuple as alive, because it will
simply wait to replay the removal until it would be correct to do so
or recovery conflict handling will cancel the transaction that sees
the tuple as alive and allow replay to continue.

I think that this is the right general approach.

+1

The repro forces a round of index vacuuming after the standby
reconnects and before pruning a dead tuple whose xmax is older than
OldestXmin.

At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
calls GetOldestNonRemovableTransactionId(), thereby updating the
backend's GlobalVisState and moving maybe_needed backwards.

Right. I saw details exactly consistent with this when I used GDB
against a production instance.

I'm glad that you were able to come up with a repro that involves
exactly the same basic elements, including index page deletion.

Would it be possible to make it robust so that we could always run it
with "make check"? This seems like an important corner case to
regression test.

--
Heikki Linnakangas
Neon (https://neon.tech)

#5Melanie Plageman
melanieplageman@gmail.com
In reply to: Alena Rybakina (#3)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 4:10 AM Alena Rybakina <lena.ribackina@yandex.ru> wrote:

We can fix this by always removing tuples considered dead before
VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
has a transaction that sees that tuple as alive, because it will
simply wait to replay the removal until it would be correct to do so
or recovery conflict handling will cancel the transaction that sees
the tuple as alive and allow replay to continue.

This is an interesting and difficult case) I noticed that when initializing the cluster, in my opinion, we provide excessive freezing. Initialization takes a long time, which can lead, for example, to longer test execution. I got rid of this by adding the OldestMxact checkbox is not FirstMultiXactId, and it works fine.

if (prstate->cutoffs &&
TransactionIdIsValid(prstate->cutoffs->OldestXmin) &&
prstate->cutoffs->OldestMxact != FirstMultiXactId &&
NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin))
return HEAPTUPLE_DEAD;

Can I keep it?

This looks like an addition to the new criteria I added to
heap_prune_satisfies_vacuum(). Is that what you are suggesting? If so,
it looks like it would only return HEAPTUPLE_DEAD (and thus only
remove) a subset of the tuples my original criteria would remove. When
vacuum calculates OldestMxact as FirstMultiXactId, it would not remove
those tuples deleted before OldestXmin. It seems like OldestMxact will
equal FirstMultiXactID sometimes right after initdb and after
transaction ID wraparound. I'm not sure I totally understand the
criteria.

One thing I find confusing about this is that this would actually
remove less tuples than with my criteria -- which could lead to more
freezing. When vacuum calculates OldestMxact == FirstMultiXactID, we
would not remove tuples deleted before OldestXmin and thus return
HEAPTUPLE_RECENTLY_DEAD for those tuples. Then we would consider
freezing them. So, it seems like we would do more freezing by adding
this criteria.

Could you explain more about how the criteria you are suggesting
works? Are you saying it does less freezing than master or less
freezing than with my patch?

Attached is the suggested fix for master plus a repro. I wrote it as a
recovery suite TAP test, but I am _not_ proposing we add it to the
ongoing test suite. It is, amongst other things, definitely prone to
flaking. I also had to use loads of data to force two index vacuuming
passes now that we have TIDStore, so it is a slow test.

-- snip --

I have a modified version of this that repros the infinite loop on
14-16 with substantially less data. See it here [2]. Also, the repro
attached to this mail won't work on 14 and 15 because of changes to
background_psql.

I couldn't understand why the replica is necessary here. Now I am digging why I got the similar behavior without replica when I have only one instance. I'm still checking this in my test, but I believe this patch fixes the original problem because the symptoms were the same.

Did you get similar behavior on master or on back branches? Was the
behavior you observed the infinite loop or the error during
heap_prepare_freeze_tuple()?

In my examples, the replica is needed because something has to move
the horizon on the primary backwards. When a standby reconnects with
an older oldest running transaction ID than any of the running
transactions on the primary and the vacuuming backend recomputes its
RecentXmin, the horizon may move backwards when compared to the
horizon calculated at the beginning of the vacuum. Vacuum does not
recompute cutoffs->OldestXmin during vacuuming a relation but it may
recompute the values in the GlobalVisState it uses for pruning.

We knew of only one other way that the horizon could move backwards
which Matthias describes here [1]/messages/by-id/CAEze2WjMTh4KS0=QEQB-Jq+tDLPR+0+zVBMfVwSPK5A=WZa95Q@mail.gmail.com. However, this is thought to be its
own concurrency-related bug in the commit-abort path that should be
fixed -- as opposed to the standby reconnecting with an older oldest
running transaction ID which can be expected.

Do you know if you were seeing the effects of the scenario Matthias describes?

- Melanie

[1]: /messages/by-id/CAEze2WjMTh4KS0=QEQB-Jq+tDLPR+0+zVBMfVwSPK5A=WZa95Q@mail.gmail.com

#6Melanie Plageman
melanieplageman@gmail.com
In reply to: Heikki Linnakangas (#4)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 21/06/2024 03:02, Peter Geoghegan wrote:

On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

The repro forces a round of index vacuuming after the standby
reconnects and before pruning a dead tuple whose xmax is older than
OldestXmin.

At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
calls GetOldestNonRemovableTransactionId(), thereby updating the
backend's GlobalVisState and moving maybe_needed backwards.

Right. I saw details exactly consistent with this when I used GDB
against a production instance.

I'm glad that you were able to come up with a repro that involves
exactly the same basic elements, including index page deletion.

Would it be possible to make it robust so that we could always run it
with "make check"? This seems like an important corner case to
regression test.

I'd have to look into how to ensure I can stabilize some of the parts
that seem prone to flaking. I can probably stabilize the vacuum bit
with a query of pg_stat_activity making sure it is waiting to acquire
the cleanup lock.

I don't, however, see a good way around the large amount of data
required to trigger more than one round of index vacuuming. I could
generate the data more efficiently than I am doing here
(generate_series() in the from clause). Perhaps with a copy? I know it
is too slow now to go in an ongoing test, but I don't have an
intuition around how fast it would have to be to be acceptable. Is
there a set of additional tests that are slower that we don't always
run? I didn't follow how the wraparound test ended up, but that seems
like one that would have been slow.

- Melanie

#7Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Melanie Plageman (#6)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On 6/24/24 16:53, Melanie Plageman wrote:

On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 21/06/2024 03:02, Peter Geoghegan wrote:

On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

The repro forces a round of index vacuuming after the standby
reconnects and before pruning a dead tuple whose xmax is older than
OldestXmin.

At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
calls GetOldestNonRemovableTransactionId(), thereby updating the
backend's GlobalVisState and moving maybe_needed backwards.

Right. I saw details exactly consistent with this when I used GDB
against a production instance.

I'm glad that you were able to come up with a repro that involves
exactly the same basic elements, including index page deletion.

Would it be possible to make it robust so that we could always run it
with "make check"? This seems like an important corner case to
regression test.

I'd have to look into how to ensure I can stabilize some of the parts
that seem prone to flaking. I can probably stabilize the vacuum bit
with a query of pg_stat_activity making sure it is waiting to acquire
the cleanup lock.

I don't, however, see a good way around the large amount of data
required to trigger more than one round of index vacuuming. I could
generate the data more efficiently than I am doing here
(generate_series() in the from clause). Perhaps with a copy? I know it
is too slow now to go in an ongoing test, but I don't have an
intuition around how fast it would have to be to be acceptable. Is
there a set of additional tests that are slower that we don't always
run? I didn't follow how the wraparound test ended up, but that seems
like one that would have been slow.

I think it depends on what is the impact on the 'make check' duration.
If it could be added to one of the existing test groups, then it depends
on how long the slowest test in that group is. If the new test needs to
be in a separate group, it probably needs to be very fast.

But I was wondering how much time are we talking about, so I tried

creating a table, filling it with 300k rows => 250ms
creating an index => 180ms
delete 90% data => 200ms
vacuum t => 130ms

which with m_w_m=1MB does two rounds of index cleanup. That's ~760ms.
I'm not sure how much more stuff does the test need to do, but this
would be pretty reasonable, if we could add it to an existing group.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Robert Haas
robertmhaas@gmail.com
In reply to: Melanie Plageman (#1)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

We can fix this by always removing tuples considered dead before
VacuumCutoffs->OldestXmin.

I don't have a great feeling about this fix. It's not that I think
it's wrong. It's just that the underlying problem here is that we have
heap_page_prune_and_freeze() getting both GlobalVisState *vistest and
struct VacuumCutoffs *cutoffs, and the vistest wants to be in charge
of deciding what gets pruned, but that doesn't actually work, because
as I pointed out in
/messages/by-id/CA+Tgmob1BtWcP6R5-toVHB5wqHasPTSR2TJkcDCutMzaUYBaHQ@mail.gmail.com
it's not properly synchronized with vacrel->cutoffs.OldestXmin. Your
fix is to consider both variables, which again may be totally correct,
but wouldn't it be a lot better if we didn't have two variables
fighting for control of the same behavior?

(I'm not trying to be a nuisance here -- I think it's great that
you've done the work to pin this down and perhaps there is no better
fix than what you've proposed.)

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#8)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 11:44 AM Robert Haas <robertmhaas@gmail.com> wrote:

I don't have a great feeling about this fix. It's not that I think
it's wrong. It's just that the underlying problem here is that we have
heap_page_prune_and_freeze() getting both GlobalVisState *vistest and
struct VacuumCutoffs *cutoffs, and the vistest wants to be in charge
of deciding what gets pruned, but that doesn't actually work, because
as I pointed out in
/messages/by-id/CA+Tgmob1BtWcP6R5-toVHB5wqHasPTSR2TJkcDCutMzaUYBaHQ@mail.gmail.com
it's not properly synchronized with vacrel->cutoffs.OldestXmin. Your
fix is to consider both variables, which again may be totally correct,
but wouldn't it be a lot better if we didn't have two variables
fighting for control of the same behavior?

Why would it be better? It's to our advantage to have vistest prune
away extra tuples when possible. Andres placed a lot of emphasis on
that during the snapshot scalability work -- vistest can be updated on
the fly.

The problem here is that OldestXmin is supposed to be more
conservative than vistest, which it almost always is, except in this
one edge case. I don't think that plugging that hole changes the basic
fact that there is one source of truth about what *needs* to be
pruned. There is such a source of truth: OldestXmin.

--
Peter Geoghegan

#10Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#9)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan <pg@bowt.ie> wrote:

The problem here is that OldestXmin is supposed to be more
conservative than vistest, which it almost always is, except in this
one edge case. I don't think that plugging that hole changes the basic
fact that there is one source of truth about what *needs* to be
pruned. There is such a source of truth: OldestXmin.

Well, another approach could be to make it so that OldestXmin actually
is always more conservative than vistest rather than almost always.

I agree with you that letting the pruning horizon move forward during
vacuum is desirable. I'm just wondering if having the vacuum code need
to know a second horizon is really the best way to address that.

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#10)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 1:05 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan <pg@bowt.ie> wrote:

The problem here is that OldestXmin is supposed to be more
conservative than vistest, which it almost always is, except in this
one edge case. I don't think that plugging that hole changes the basic
fact that there is one source of truth about what *needs* to be
pruned. There is such a source of truth: OldestXmin.

Well, another approach could be to make it so that OldestXmin actually
is always more conservative than vistest rather than almost always.

If we did things like that then it would still be necessary to write a
patch like the one Melanie came up with, on the grounds that we'd
really need to be paranoid about having missed some subtlety. We might
as well just rely on the mechanism directly. I just don't think that
it makes much difference.

--
Peter Geoghegan

#12Melanie Plageman
melanieplageman@gmail.com
In reply to: Robert Haas (#10)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 1:05 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan <pg@bowt.ie> wrote:

The problem here is that OldestXmin is supposed to be more
conservative than vistest, which it almost always is, except in this
one edge case. I don't think that plugging that hole changes the basic
fact that there is one source of truth about what *needs* to be
pruned. There is such a source of truth: OldestXmin.

Well, another approach could be to make it so that OldestXmin actually
is always more conservative than vistest rather than almost always.

For the purposes of pruning, we are effectively always using the more
conservative of the two with this patch.

Are you more concerned about having a single horizon for pruning or
about having a horizon that does not move backwards after being
established at the beginning of vacuuming the relation?

Right now, in master, we do use a single horizon when determining what
is pruned -- that from GlobalVisState. OldestXmin is only used for
freezing and full page visibility determinations. Using a different
horizon for pruning by vacuum than freezing is what is causing the
error on master.

I agree with you that letting the pruning horizon move forward during
vacuum is desirable. I'm just wondering if having the vacuum code need
to know a second horizon is really the best way to address that.

I was thinking about this some more and I realized I don't really get
why we think using GlobalVisState for pruning will let us remove more
tuples in the common case.

I had always thought it was because the vacuuming backend's
GlobalVisState will get updated periodically throughout vacuum and so,
assuming the oldest running transaction changes, our horizon for
vacuum would change. But, in writing this repro, it is actually quite
hard to get GlobalVisState to update. Our backend's RecentXmin needs
to have changed. And there aren't very many places where we take a new
snapshot after starting to vacuum a relation. One of those is at the
end of index vacuuming, but that can only affect the pruning horizon
if we have to do multiple rounds of index vacuuming. Is that really
the case we are thinking of when we say we want the pruning horizon to
move forward during vacuum?

- Melanie

In reply to: Melanie Plageman (#12)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I had always thought it was because the vacuuming backend's
GlobalVisState will get updated periodically throughout vacuum and so,
assuming the oldest running transaction changes, our horizon for
vacuum would change.

I believe that it's more of an aspirational thing at this point. That
it is currently aspirational (it happens to some extent but isn't ever
particularly useful) shouldn't change the analysis about how to fix
this bug.

One of those is at the
end of index vacuuming, but that can only affect the pruning horizon
if we have to do multiple rounds of index vacuuming. Is that really
the case we are thinking of when we say we want the pruning horizon to
move forward during vacuum?

No, that's definitely not what we were thinking of. It's just an
accident that it's almost the only thing that'll do that.

--
Peter Geoghegan

#14Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#1)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than
GlobalVisState->maybe_needed, it will ERROR out when determining
whether or not to freeze the tuple with "cannot freeze committed
xmax".

One thing I don't understand is why it is okay to freeze the xmax of a
dead tuple just because it is from an aborted update.
heap_prepare_freeze_tuple() is called on HEAPTUPLE_RECENTLY_DEAD
tuples with normal xmaxes (non-multis) so that it can freeze tuples
from aborted updates. The only case in which we freeze dead tuples
with a non-multi xmax is if the xmax is from before OldestXmin and is
also not committed (so from an aborted update). Freezing dead tuples
replaces their xmax with InvalidTransactionId -- which would make them
look alive. So, it makes sense we don't do this for dead tuples in the
common case. But why is it 1) okay and 2) desirable to freeze xmaxes
of tuples from aborted updates? Won't it make them look alive again?

- Melanie

#15Robert Haas
robertmhaas@gmail.com
In reply to: Melanie Plageman (#12)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Are you more concerned about having a single horizon for pruning or
about having a horizon that does not move backwards after being
established at the beginning of vacuuming the relation?

I'm not sure I understand. The most important thing here is fixing the
bug. But if we have a choice of how to fix the bug, I'd prefer to do
it by having the pruning code test one horizon that is always correct,
rather than (as I think the patch does) having it test against two
horizons because as a way of covering possible discrepancies between
those values.

Right now, in master, we do use a single horizon when determining what
is pruned -- that from GlobalVisState. OldestXmin is only used for
freezing and full page visibility determinations. Using a different
horizon for pruning by vacuum than freezing is what is causing the
error on master.

Agreed.

I had always thought it was because the vacuuming backend's
GlobalVisState will get updated periodically throughout vacuum and so,
assuming the oldest running transaction changes, our horizon for
vacuum would change. But, in writing this repro, it is actually quite
hard to get GlobalVisState to update. Our backend's RecentXmin needs
to have changed. And there aren't very many places where we take a new
snapshot after starting to vacuum a relation. One of those is at the
end of index vacuuming, but that can only affect the pruning horizon
if we have to do multiple rounds of index vacuuming. Is that really
the case we are thinking of when we say we want the pruning horizon to
move forward during vacuum?

I thought the idea was that the GlobalVisTest stuff would force a
recalculation now and then, but maybe it doesn't actually do that?

Suppose process A begins a transaction, acquires an XID, and then goes
idle. Process B now begins a giant vacuum. At some point in the middle
of the vacuum, A ends the transaction. Are you saying that B's
GlobalVisTest never really notices that this has happened?

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Melanie Plageman (#14)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 3:36 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

One thing I don't understand is why it is okay to freeze the xmax of a
dead tuple just because it is from an aborted update.

We don't do that with XID-based xmaxs. Though perhaps we should, since
we'll already prune-away the successor tuple, and so might as well go
one tiny step further and clear the xmax for the original tuple via
freezing/setting it InvalidTransactionId. Instead we just leave the
original tuple largely undisturbed, with its original xmax.

We do something like that with Multi-based xmax fields, though not
with the specific goal of cleaning up after aborts in mind (we can
also remove lockers that are no longer running, regardless of where
they are relative to OldestXmin, stuff like that). The actual goal
with that is to enforce MultiXactCutoff, independently of whether or
not their member XIDs are < FreezeLimit yet.

The only case in which we freeze dead tuples
with a non-multi xmax is if the xmax is from before OldestXmin and is
also not committed (so from an aborted update).

Perhaps I misunderstand, but: we simply don't freeze DEAD (not
RECENTLY_DEAD) tuples in the first place, because we don't have to
(pruning removes them instead). It doesn't matter if they're DEAD due
to being from aborted transactions or DEAD due to being
deleted/updated by a transaction that committed (committed and <
OldestXmin).

The freezing related code paths in heapam.c don't particularly care
whether a tuple xmax is RECENTLY_DEAD or LIVE to HTSV + OldestXmin.
Just as long as it's not fully DEAD (then it should have been pruned).

--
Peter Geoghegan

In reply to: Robert Haas (#15)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 4:36 PM Robert Haas <robertmhaas@gmail.com> wrote:

I'm not sure I understand. The most important thing here is fixing the
bug. But if we have a choice of how to fix the bug, I'd prefer to do
it by having the pruning code test one horizon that is always correct,
rather than (as I think the patch does) having it test against two
horizons because as a way of covering possible discrepancies between
those values.

Your characterizing of OldestXmin + vistest as two horizons seems
pretty arbitrary to me. I know what you mean, of course, but it seems
like a distinction without a difference.

I thought the idea was that the GlobalVisTest stuff would force a
recalculation now and then, but maybe it doesn't actually do that?

It definitely can do that. Just not in a way that meaningfully
increases the number of heap tuples that we can recognize as DEAD and
remove. At least not currently.

Suppose process A begins a transaction, acquires an XID, and then goes
idle. Process B now begins a giant vacuum. At some point in the middle
of the vacuum, A ends the transaction. Are you saying that B's
GlobalVisTest never really notices that this has happened?

That's my understanding, yes. That is, vistest is approximately the
same thing as OldestXmin anyway. At least for now.

--
Peter Geoghegan

#18Melanie Plageman
melanieplageman@gmail.com
In reply to: Peter Geoghegan (#16)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 4:42 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jun 24, 2024 at 3:36 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

One thing I don't understand is why it is okay to freeze the xmax of a
dead tuple just because it is from an aborted update.

We don't do that with XID-based xmaxs. Though perhaps we should, since
we'll already prune-away the successor tuple, and so might as well go
one tiny step further and clear the xmax for the original tuple via
freezing/setting it InvalidTransactionId. Instead we just leave the
original tuple largely undisturbed, with its original xmax.

I thought that was the case too, but we call
heap_prepare_freeze_tuple() on HEAPTUPLE_RECENTLY_DEAD tuples and then

else if (TransactionIdIsNormal(xid))
{
/* Raw xmax is normal XID */
freeze_xmax = TransactionIdPrecedes(xid, cutoffs->OldestXmin);
}

And then later we

if (freeze_xmax)
frz->xmax = InvalidTransactionId;

and then when we execute freezing the tuple in heap_execute_freeze_tuple()

HeapTupleHeaderSetXmax(tuple, frz->xmax);

Which sets the xmax to InvalidTransactionId. Or am I missing something?

The only case in which we freeze dead tuples
with a non-multi xmax is if the xmax is from before OldestXmin and is
also not committed (so from an aborted update).

Perhaps I misunderstand, but: we simply don't freeze DEAD (not
RECENTLY_DEAD) tuples in the first place, because we don't have to
(pruning removes them instead). It doesn't matter if they're DEAD due
to being from aborted transactions or DEAD due to being
deleted/updated by a transaction that committed (committed and <
OldestXmin).

Right, I'm talking about HEAPTUPLE_RECENTLY_DEAD tuples.
HEAPTUPLE_DEAD tuples are pruned away. But we can't replace the xmax
of a tuple that has been deleted or updated by a transaction that
committed with InvalidTransactionId. And it seems like the code does
that? Why even call heap_prepare_freeze_tuple() on
HEAPTUPLE_RECENTLY_DEAD tuples? Is it mainly to handle MultiXact
freezing?

The freezing related code paths in heapam.c don't particularly care
whether a tuple xmax is RECENTLY_DEAD or LIVE to HTSV + OldestXmin.
Just as long as it's not fully DEAD (then it should have been pruned).

But it just seems like we shouldn't freeze RECENTLY_DEAD either.

- Melanie

#19Melanie Plageman
melanieplageman@gmail.com
In reply to: Peter Geoghegan (#17)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 4:51 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jun 24, 2024 at 4:36 PM Robert Haas <robertmhaas@gmail.com> wrote:

I thought the idea was that the GlobalVisTest stuff would force a
recalculation now and then, but maybe it doesn't actually do that?

It definitely can do that. Just not in a way that meaningfully
increases the number of heap tuples that we can recognize as DEAD and
remove. At least not currently.

Suppose process A begins a transaction, acquires an XID, and then goes
idle. Process B now begins a giant vacuum. At some point in the middle
of the vacuum, A ends the transaction. Are you saying that B's
GlobalVisTest never really notices that this has happened?

That's my understanding, yes. That is, vistest is approximately the
same thing as OldestXmin anyway. At least for now.

Exactly. Something has to cause this backend to update its view of the
horizon. At the end of index vacuuming,
GetOldestNonRemovableTransactionId() will explicitly
ComputeXidHorizons() which will update our backend's GlobalVisStates.
Otherwise, if our backend's RecentXmin is updated, by taking a new
snapshot, then we may update our GlobalVisStates. See
GlobalVisTestShouldUpdate() for the conditions under which we would
update our GlobalVisStates during the normal visibility checks
happening during pruning.

Vacuum used to open indexes after calculating horizons before starting
its first pass. This led to a recomputation of the horizon. But, in
master, there aren't many obvious places where such a thing would be
happening.

- Melanie

#20Noah Misch
noah@leadboat.com
In reply to: Peter Geoghegan (#17)
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 04:51:24PM -0400, Peter Geoghegan wrote:

On Mon, Jun 24, 2024 at 4:36 PM Robert Haas <robertmhaas@gmail.com> wrote:

I'm not sure I understand. The most important thing here is fixing the
bug. But if we have a choice of how to fix the bug, I'd prefer to do
it by having the pruning code test one horizon that is always correct,
rather than (as I think the patch does) having it test against two
horizons because as a way of covering possible discrepancies between
those values.

Your characterizing of OldestXmin + vistest as two horizons seems
pretty arbitrary to me. I know what you mean, of course, but it seems
like a distinction without a difference.

"Two horizons" matches how I model it. If the two were _always_ indicating
the same notion of visibility, we wouldn't have this thread.

On Mon, Jun 24, 2024 at 03:23:39PM -0400, Melanie Plageman wrote:

Right now, in master, we do use a single horizon when determining what
is pruned -- that from GlobalVisState. OldestXmin is only used for
freezing and full page visibility determinations. Using a different
horizon for pruning by vacuum than freezing is what is causing the
error on master.

Agreed, and I think using different sources for pruning and freezing is a
recipe for future bugs. Fundamentally, both are about answering "is
snapshot_considers_xid_in_progress(snapshot, xid) false for every snapshot?"
That's not to say this thread shall unify the two, but I suspect that's the
right long-term direction.

In reply to: Noah Misch (#20)
#22Noah Misch
noah@leadboat.com
In reply to: Peter Geoghegan (#21)
#23Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#15)
#24Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#23)
#25Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#24)
#26Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#25)
#27Alena Rybakina
lena.ribackina@yandex.ru
In reply to: Melanie Plageman (#5)
#28Melanie Plageman
melanieplageman@gmail.com
In reply to: Robert Haas (#26)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Melanie Plageman (#28)
#30Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#29)
#31Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#30)
#32Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#31)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#32)
#34Melanie Plageman
melanieplageman@gmail.com
In reply to: Heikki Linnakangas (#4)
#35Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#34)
In reply to: Melanie Plageman (#35)
#37Melanie Plageman
melanieplageman@gmail.com
In reply to: Peter Geoghegan (#36)
#38Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#37)
In reply to: Melanie Plageman (#38)
#40Melanie Plageman
melanieplageman@gmail.com
In reply to: Peter Geoghegan (#39)
#41Tom Lane
tgl@sss.pgh.pa.us
In reply to: Melanie Plageman (#40)
In reply to: Tom Lane (#41)
#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#42)
In reply to: Tom Lane (#43)
#45Melanie Plageman
melanieplageman@gmail.com
In reply to: Tom Lane (#41)
#46Melanie Plageman
melanieplageman@gmail.com
In reply to: Tom Lane (#43)
#47Melanie Plageman
melanieplageman@gmail.com
In reply to: Peter Geoghegan (#42)
#48Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Melanie Plageman (#46)
In reply to: Melanie Plageman (#47)
#50Tom Lane
tgl@sss.pgh.pa.us
In reply to: Melanie Plageman (#46)
#51Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#49)
#52Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#48)
In reply to: Tom Lane (#51)
#54Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#41)
#55Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#50)
#56Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#55)
#57Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#53)
#58Melanie Plageman
melanieplageman@gmail.com
In reply to: Robert Haas (#55)
In reply to: Andres Freund (#57)
#60Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#58)
#61Tom Lane
tgl@sss.pgh.pa.us
In reply to: Melanie Plageman (#58)
#62Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#56)
#63Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#62)
#64Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Melanie Plageman (#60)
#65Melanie Plageman
melanieplageman@gmail.com
In reply to: Tom Lane (#61)
#66Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Melanie Plageman (#65)
#67Melanie Plageman
melanieplageman@gmail.com
In reply to: Masahiko Sawada (#66)
#68Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Melanie Plageman (#67)
#69John Naylor
john.naylor@enterprisedb.com
In reply to: Masahiko Sawada (#68)
#70John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#69)
#71Melanie Plageman
melanieplageman@gmail.com
In reply to: Masahiko Sawada (#64)
#72Melanie Plageman
melanieplageman@gmail.com
In reply to: John Naylor (#70)
#73Melanie Plageman
melanieplageman@gmail.com
In reply to: John Naylor (#69)
#74John Naylor
john.naylor@enterprisedb.com
In reply to: Melanie Plageman (#73)
#75Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Melanie Plageman (#71)
In reply to: Melanie Plageman (#1)
#77Melanie Plageman
melanieplageman@gmail.com
In reply to: Peter Geoghegan (#76)
#78John Naylor
john.naylor@enterprisedb.com
In reply to: Melanie Plageman (#72)
#79John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#78)
#80Andres Freund
andres@anarazel.de
In reply to: John Naylor (#79)
#81John Naylor
john.naylor@enterprisedb.com
In reply to: Melanie Plageman (#77)
#82Melanie Plageman
melanieplageman@gmail.com
In reply to: John Naylor (#81)
#83John Naylor
john.naylor@enterprisedb.com
In reply to: Melanie Plageman (#82)
#84Melanie Plageman
melanieplageman@gmail.com
In reply to: John Naylor (#83)
#85John Naylor
john.naylor@enterprisedb.com
In reply to: Melanie Plageman (#84)
#86Melanie Plageman
melanieplageman@gmail.com
In reply to: John Naylor (#85)
#87John Naylor
john.naylor@enterprisedb.com
In reply to: Melanie Plageman (#86)
#88Melanie Plageman
melanieplageman@gmail.com
In reply to: John Naylor (#87)
#89John Naylor
john.naylor@enterprisedb.com
In reply to: Melanie Plageman (#88)
#90Melanie Plageman
melanieplageman@gmail.com
In reply to: John Naylor (#89)
#91John Naylor
john.naylor@enterprisedb.com
In reply to: Melanie Plageman (#90)
#92Melanie Plageman
melanieplageman@gmail.com
In reply to: John Naylor (#91)