visibility map corruption

Started by Floris Van Neeover 4 years ago31 messages
#1Floris Van Nee
florisvannee@Optiver.com

Hi hackers,

We recently ran into an issue where the visibility map of a relation was corrupt, running Postgres 12.4. The error we'd get when running a SELECT * from this table is:

could not access status of transaction 3704450152
DETAIL: Could not open file "pg_xact/0DCC": No such file or directory.

On the lists I could find several similar reports, but corruption like this could obviously have a very wide range of root causes.. see [1]https://postgrespro.com/list/thread-id/2422376 [2]https://postgrespro.com/list/thread-id/2501800 [3]https://postgrespro.com/list/thread-id/2321949 for example - not all of them have their root cause known.

This particular case was similar to reported cases above, but also has some differences.

The following query returns ~21.000 rows, which indicates something inconsistent between the visibility map and the pd_all_visible flag on the page:

select * from pg_check_frozen('tbl');

Looking at one of the affected pages with pageinspect:

=# SELECT lp,lp_off,lp_flags,lp_len,t_xmin,t_xmax,t_field3,t_ctid,t_infomask2,t_infomask,t_hoff,t_oid FROM heap_page_items(get_raw_page('tbl', 726127));
┌────┬────────┬──────────┬────────┬────────────┬────────────┬──────────┬────────────┬─────────────┬────────────┬────────┬───────┐
│ lp │ lp_off │ lp_flags │ lp_len │ t_xmin │ t_xmax │ t_field3 │ t_ctid │ t_infomask2 │ t_infomask │ t_hoff │ t_oid │
├────┼────────┼──────────┼────────┼────────────┼────────────┼──────────┼────────────┼─────────────┼────────────┼────────┼───────┤
│ 1 │ 6328 │ 1 │ 1864 │ 3704450155 │ 3704450155 │ 1 │ (726127,1) │ 249 │ 8339 │ 56 │ ∅ │
│ 2 │ 4464 │ 1 │ 1864 │ 3704450155 │ 3704450155 │ 1 │ (726127,2) │ 249 │ 8339 │ 56 │ ∅ │
│ 3 │ 2600 │ 1 │ 1864 │ 3704450155 │ 3704450155 │ 1 │ (726127,3) │ 249 │ 8339 │ 56 │ ∅ │
│ 4 │ 680 │ 1 │ 1920 │ 3704450155 │ 3704450155 │ 1 │ (726127,4) │ 249 │ 8339 │ 56 │ ∅ │
└────┴────────┴──────────┴────────┴────────────┴────────────┴──────────┴────────────┴─────────────┴────────────┴────────┴───────┘

t_infomask shows that HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID bits are both unset.
This pg_visibility() call shows the inconsistency between VM and page, with PD_ALL_VISIBLE=false

=# select * from pg_visibility('tbl', 726127);
┌─────────────┬────────────┬────────────────┐
│ all_visible │ all_frozen │ pd_all_visible │
├─────────────┼────────────┼────────────────┤
│ t │ t │ f │
└─────────────┴────────────┴────────────────┘

Looking at other pages show the same information.
What's interesting is that out of the affected tuples returned by pg_check_frozen, over 99% belong to 1 transaction (3704450155 as seen above) and the remaining few are from one other transaction that occurred at roughly the same time.
I find it hard to believe that this is due to some random bit flipping, because many pages are affected, but the "incorrect" ones are in total only from two specific transactions which occurred at roughly the same time. There were also no server crashes or other known failures around the time of this transaction. I'm not ruling out any other kind of failure still, but I also cannot really explain how this could have happened.

The server has PG12.4 with full_page_writes=on, data_checksums=off. It's a large analytics database. The VM inconsistencies also occur on the streaming replicas.

I realize these cases are pretty rare and hard to debug, but I wanted to share the information I found so far here for reference. Maybe someone has an idea what occurred, or maybe someone in the future finds it useful when he finds something similar.

I have no idea how the inconsistency between VM and PD_ALL_VISIBLE started - from looking through the code I can't really find any way how this could occur. However, for it to lead to the problem described here, I believe there should be *no* SELECT that touches that particular page after the insert/update transaction and before the transaction log gets truncated. If the page is read before the transaction log gets truncated, then the hint bit HEAP_XMIN_COMMITTED will get set and future reads will succeed regardless of tx log truncation. One of the replica's had this happen to it: the affected pages are identical to the primary except that the HEAP_XMIN_COMMITTED flag is set (note that the VM inconsistency is still there on the replica though: PD_ALL_VISIBLE=false even though VM shows that all_frozen=all_visible=true). But I can query these rows on the replica without issues, because it doesn't check the tx log when it sees that HEAP_XMIN_COMMITTED is set.

-Floris

[1]: https://postgrespro.com/list/thread-id/2422376
[2]: https://postgrespro.com/list/thread-id/2501800
[3]: https://postgrespro.com/list/thread-id/2321949

In reply to: Floris Van Nee (#1)
Re: visibility map corruption

On Sun, Jul 4, 2021 at 1:44 PM Floris Van Nee <florisvannee@optiver.com> wrote:

We recently ran into an issue where the visibility map of a relation was corrupt, running Postgres 12.4. The error we'd get when running a SELECT * from this table is:

could not access status of transaction 3704450152
DETAIL: Could not open file "pg_xact/0DCC": No such file or directory.

Have you ever used pg_upgrade on this database?

--
Peter Geoghegan

#3Floris Van Nee
florisvannee@Optiver.com
In reply to: Peter Geoghegan (#2)
RE: visibility map corruption

On Sun, Jul 4, 2021 at 1:44 PM Floris Van Nee <florisvannee@optiver.com>
wrote:

We recently ran into an issue where the visibility map of a relation was

corrupt, running Postgres 12.4. The error we'd get when running a SELECT *
from this table is:

could not access status of transaction 3704450152
DETAIL: Could not open file "pg_xact/0DCC": No such file or directory.

Have you ever used pg_upgrade on this database?

Yes. The last time (from v11 to v12) was in October 2020. The transaction id in the tuples (the one PG is trying to check in the tx log) dated from February 2021. I do believe (but am not 100% certain) that the affected table already existed at the time of the last pg_upgrade though.

In reply to: Floris Van Nee (#3)
Re: visibility map corruption

On Sun, Jul 4, 2021 at 2:26 PM Floris Van Nee <florisvannee@optiver.com> wrote:

Have you ever used pg_upgrade on this database?

Yes. The last time (from v11 to v12) was in October 2020. The transaction id in the tuples (the one PG is trying to check in the tx log) dated from February 2021. I do believe (but am not 100% certain) that the affected table already existed at the time of the last pg_upgrade though.

I wonder if it's related to this issue:

/messages/by-id/20210423234256.hwopuftipdmp3okf@alap3.anarazel.de

Have you increased autovacuum_freeze_max_age from its default? This
already sounds like the kind of database where that would make sense.

--
Peter Geoghegan

#5Floris Van Nee
florisvannee@Optiver.com
In reply to: Peter Geoghegan (#4)
RE: visibility map corruption

I wonder if it's related to this issue:

https://www.postgresql.org/message-
id/20210423234256.hwopuftipdmp3okf@alap3.anarazel.de

Have you increased autovacuum_freeze_max_age from its default? This
already sounds like the kind of database where that would make sense.

autovacuum_freeze_max_age is increased in our setup indeed (it is set to 500M). However, we do regularly run manual VACUUM (FREEZE) on individual tables in the database, including this one. A lot of tables in the database follow an INSERT-only pattern and since it's not running v13 yet, this meant that these tables would only rarely be touched by autovacuum. Autovacuum would sometimes kick in on some of these tables at the same time at unfortunate moments. Therefore we have some regular job running that VACUUM (FREEZE)s tables with a xact age higher than a (low, 10M) threshold ourselves.

#6Bruce Momjian
bruce@momjian.us
In reply to: Floris Van Nee (#5)
Re: visibility map corruption

On Sun, Jul 4, 2021 at 10:28:25PM +0000, Floris Van Nee wrote:

I wonder if it's related to this issue:

https://www.postgresql.org/message-
id/20210423234256.hwopuftipdmp3okf@alap3.anarazel.de

Have you increased autovacuum_freeze_max_age from its default? This
already sounds like the kind of database where that would make
sense.

autovacuum_freeze_max_age is increased in our setup indeed (it is
set to 500M). However, we do regularly run manual VACUUM (FREEZE)
on individual tables in the database, including this one. A lot of
tables in the database follow an INSERT-only pattern and since it's
not running v13 yet, this meant that these tables would only rarely
be touched by autovacuum. Autovacuum would sometimes kick in on some
of these tables at the same time at unfortunate moments. Therefore we
have some regular job running that VACUUM (FREEZE)s tables with a xact
age higher than a (low, 10M) threshold ourselves.

OK, this is confirmation that the pg_resetwal bug, and its use by
pg_upgrade, is a serious issue that needs to be addressed. I am
prepared to work on it now.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

In reply to: Bruce Momjian (#6)
Re: visibility map corruption

On Tue, Jul 6, 2021 at 10:27 AM Bruce Momjian <bruce@momjian.us> wrote:

OK, this is confirmation that the pg_resetwal bug, and its use by
pg_upgrade, is a serious issue that needs to be addressed. I am
prepared to work on it now.

To be clear, I'm not 100% sure that this is related to the pg_upgrade
+ "pg_resetwal sets oldestXid to an invented value" issue. I am sure
that that is a serious issue that needs to be addressed sooner rather
than later, though.

--
Peter Geoghegan

#8Bruce Momjian
bruce@momjian.us
In reply to: Peter Geoghegan (#7)
Re: visibility map corruption

On Tue, Jul 6, 2021 at 10:32:24AM -0700, Peter Geoghegan wrote:

On Tue, Jul 6, 2021 at 10:27 AM Bruce Momjian <bruce@momjian.us> wrote:

OK, this is confirmation that the pg_resetwal bug, and its use by
pg_upgrade, is a serious issue that needs to be addressed. I am
prepared to work on it now.

To be clear, I'm not 100% sure that this is related to the pg_upgrade
+ "pg_resetwal sets oldestXid to an invented value" issue. I am sure
that that is a serious issue that needs to be addressed sooner rather
than later, though.

Well, pg_upgrade corruptions are rare, but so is modifying
autovacuum_freeze_max_age. If we have a corruption and we know
autovacuum_freeze_max_age was modified, odds are that is the cause.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

In reply to: Bruce Momjian (#8)
Re: visibility map corruption

On Tue, Jul 6, 2021 at 10:58 AM Bruce Momjian <bruce@momjian.us> wrote:

Well, pg_upgrade corruptions are rare, but so is modifying
autovacuum_freeze_max_age. If we have a corruption and we know
autovacuum_freeze_max_age was modified, odds are that is the cause.

My point is that there isn't necessarily that much use in trying to
determine what really happened here. It would be nice to know for
sure, but it shouldn't affect what we do about the bug.

In a way the situation with the bug is simple. Clearly Tom wasn't
thinking of pg_upgrade when he wrote the relevant pg_resetwal code
that sets oldestXid, because pg_upgrade didn't exist at the time. He
was thinking of restoring the database to a relatively sane state in
the event of some kind of corruption, with all of the uncertainty that
goes with that. Nobody noticed that pg_upgrade gets this same behavior
until much more recently.

Now that we see the problem laid out, there isn't much to think about
that will affect the response to the issue. At least as far as I can
tell. We know that pg_upgrade uses pg_resetwal's -x flag in a context
where there is no reason at all to think that the database is corrupt
-- Tom can't have anticipated that all those years ago. It's easy to
see that the behavior is wrong for pg_upgrade, and it's very hard to
imagine any way in which it might have accidentally made some sense
all along. We should just carry forward the original oldestXid.

--
Peter Geoghegan

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#9)
Re: visibility map corruption

Peter Geoghegan <pg@bowt.ie> writes:

... We should just carry forward the original oldestXid.

Yup. It's a bit silly that we recognized the need to do that
for oldestMultiXid yet not for oldestXid.

BTW, is it really necessary for copy_xact_xlog_xid to invoke pg_resetwal
so many times? Why can't we pass all of the update-this options in one
call?

Who's going to do the legwork on this?

regards, tom lane

#11Drouvot, Bertrand
bdrouvot@amazon.com
In reply to: Tom Lane (#10)
Re: visibility map corruption

Hi,

On 7/6/21 8:57 PM, Tom Lane wrote:

Peter Geoghegan <pg@bowt.ie> writes:

... We should just carry forward the original oldestXid.

Yup. It's a bit silly that we recognized the need to do that
for oldestMultiXid yet not for oldestXid.

FWIW there is a commitfest entry for it:
https://commitfest.postgresql.org/33/3105/

Thanks

Bertrand

In reply to: Tom Lane (#10)
Re: visibility map corruption

On Tue, Jul 6, 2021 at 11:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <pg@bowt.ie> writes:

... We should just carry forward the original oldestXid.

Yup. It's a bit silly that we recognized the need to do that
for oldestMultiXid yet not for oldestXid.

True. But at the same time it somehow doesn't seem silly at all. IME
some of the most devious bugs evade detection by hiding in plain
sight.

It looks like amcheck's verify_heapam.c functionality almost catches
bugs like this one. Something for Mark (CC'd) to consider. Does it
matter that we usually "ctx.oldest_xid = ctx.relfrozenxid", and so
usually use pg_class.relfrozenxid as our oldest_xid (and not
ShmemVariableCache->oldestXid)? In other words, could we be doing more
to sanitize ShmemVariableCache->oldestXid, especially when the
relation's pg_class.relfrozenxid happens to be set to a real XID?

BTW, is it really necessary for copy_xact_xlog_xid to invoke pg_resetwal
so many times? Why can't we pass all of the update-this options in one
call?

No opinion here.

Who's going to do the legwork on this?

Can Bruce take care of committing the patch for this? Bruce?

This should definitely be in the next point release IMV.

--
Peter Geoghegan

#13Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Peter Geoghegan (#12)
Re: visibility map corruption

On Jul 6, 2021, at 2:27 PM, Peter Geoghegan <pg@bowt.ie> wrote:

It looks like amcheck's verify_heapam.c functionality almost catches
bugs like this one. Something for Mark (CC'd) to consider. Does it
matter that we usually "ctx.oldest_xid = ctx.relfrozenxid", and so
usually use pg_class.relfrozenxid as our oldest_xid (and not
ShmemVariableCache->oldestXid)? In other words, could we be doing more
to sanitize ShmemVariableCache->oldestXid, especially when the
relation's pg_class.relfrozenxid happens to be set to a real XID?

Thanks, Peter, for drawing my attention to this. I had already been following this thread, but had not yet thought about the problem in terms of amcheck.

I will investigate possible solutions in verify_heapam().


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In reply to: Mark Dilger (#13)
Re: visibility map corruption

On Tue, Jul 6, 2021 at 3:12 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

Thanks, Peter, for drawing my attention to this. I had already been following this thread, but had not yet thought about the problem in terms of amcheck.

I will investigate possible solutions in verify_heapam().

Thanks! Great that we might be able to make a whole class of bugs
detectable with the new amcheck stuff. Glad that I didn't forget about
amcheck myself -- I almost forgot.

When I was working on the btree amcheck code, I looked for interesting
historical bugs and made sure that I could detect them. That seems
even more important with heapam. I wouldn't be surprised if one or two
important invariants were missed, in part because the heapam design
didn't have invariants as a starting point.

--
Peter Geoghegan

#15Bruce Momjian
bruce@momjian.us
In reply to: Peter Geoghegan (#12)
Re: visibility map corruption

On Tue, Jul 6, 2021 at 02:27:34PM -0700, Peter Geoghegan wrote:

BTW, is it really necessary for copy_xact_xlog_xid to invoke pg_resetwal
so many times? Why can't we pass all of the update-this options in one
call?

No opinion here.

Who's going to do the legwork on this?

Can Bruce take care of committing the patch for this? Bruce?

This should definitely be in the next point release IMV.

Yes, I can, though it seems like a much bigger issue than pg_upgrade.
I will be glad to dig into it.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

In reply to: Bruce Momjian (#15)
Re: visibility map corruption

On Tue, Jul 6, 2021 at 3:30 PM Bruce Momjian <bruce@momjian.us> wrote:

Yes, I can, though it seems like a much bigger issue than pg_upgrade.
I will be glad to dig into it.

I'm not sure what you mean by that. Technically this would be an issue
for any program that uses "pg_resetwal -x" in the way that pg_upgrade
does, with those same expectations. But isn't pg_upgrade the only
known program that behaves like that?

I don't see any reason why this wouldn't be treated as a pg_upgrade
bug in the release notes, regardless of the exact nature or provenance
of the issue -- the pg_upgrade framing seems useful because this is a
practical problem for pg_upgrade users alone. Have I missed something?

--
Peter Geoghegan

#17Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#15)
Re: visibility map corruption

On Tue, Jul 6, 2021 at 06:30:41PM -0400, Bruce Momjian wrote:

On Tue, Jul 6, 2021 at 02:27:34PM -0700, Peter Geoghegan wrote:

BTW, is it really necessary for copy_xact_xlog_xid to invoke pg_resetwal
so many times? Why can't we pass all of the update-this options in one
call?

No opinion here.

Who's going to do the legwork on this?

Can Bruce take care of committing the patch for this? Bruce?

This should definitely be in the next point release IMV.

Yes, I can, though it seems like a much bigger issue than pg_upgrade.
I will be glad to dig into it.

Bertrand Drouvot provided a patch in the thread, so I will start from
that; CC'ing him too.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#18Bruce Momjian
bruce@momjian.us
In reply to: Peter Geoghegan (#16)
Re: visibility map corruption

On Tue, Jul 6, 2021 at 03:46:48PM -0700, Peter Geoghegan wrote:

On Tue, Jul 6, 2021 at 3:30 PM Bruce Momjian <bruce@momjian.us> wrote:

Yes, I can, though it seems like a much bigger issue than pg_upgrade.
I will be glad to dig into it.

I'm not sure what you mean by that. Technically this would be an issue
for any program that uses "pg_resetwal -x" in the way that pg_upgrade
does, with those same expectations. But isn't pg_upgrade the only
known program that behaves like that?

I don't see any reason why this wouldn't be treated as a pg_upgrade
bug in the release notes, regardless of the exact nature or provenance
of the issue -- the pg_upgrade framing seems useful because this is a
practical problem for pg_upgrade users alone. Have I missed something?

My point is that there are a lot internals involved here that are not
part of pg_upgrade, though it probably only affects pg_upgrade. Anyway,
Bertrand patch seems to have what I need.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

In reply to: Bruce Momjian (#18)
Re: visibility map corruption

On Tue, Jul 6, 2021 at 3:49 PM Bruce Momjian <bruce@momjian.us> wrote:

My point is that there are a lot internals involved here that are not
part of pg_upgrade, though it probably only affects pg_upgrade. Anyway,
Bertrand patch seems to have what I need.

I was confused by your remarks because I am kind of looking at it from
the opposite angle. At least now that I've thought about it a bit.

Since the snippet of pg_resetwal code that sets oldestXid wasn't ever
intended to be used by pg_upgrade, but was anyway, what we have is a
something that's clearly totally wrong (at least in the pg_upgrade
case). It's not just wrong for pg_upgrade to do things that way --
it's also wildly unreasonable. We heard a complaint about this from
Reddit only because it worked "as designed", and so made the cluster
immediately have an anti-wraparound autovacuum. But why would anybody
want that behavior, even if it was implemented correctly? It simply
makes no sense.

The consequences of this bug are indeed complicated and subtle and
will probably never be fully understood. But at the same time fixing
the bug now seems kind of simple. (Working backwards to arrive here
was a bit tricky, mind you.)

--
Peter Geoghegan

#20Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#18)
Re: visibility map corruption

On Tue, Jul 6, 2021 at 06:49:10PM -0400, Bruce Momjian wrote:

On Tue, Jul 6, 2021 at 03:46:48PM -0700, Peter Geoghegan wrote:

On Tue, Jul 6, 2021 at 3:30 PM Bruce Momjian <bruce@momjian.us> wrote:

Yes, I can, though it seems like a much bigger issue than pg_upgrade.
I will be glad to dig into it.

I'm not sure what you mean by that. Technically this would be an issue
for any program that uses "pg_resetwal -x" in the way that pg_upgrade
does, with those same expectations. But isn't pg_upgrade the only
known program that behaves like that?

I don't see any reason why this wouldn't be treated as a pg_upgrade
bug in the release notes, regardless of the exact nature or provenance
of the issue -- the pg_upgrade framing seems useful because this is a
practical problem for pg_upgrade users alone. Have I missed something?

My point is that there are a lot internals involved here that are not
part of pg_upgrade, though it probably only affects pg_upgrade. Anyway,
Bertrand patch seems to have what I need.

One question is how do we want to handle cases where -x next_xid is used
but -u oldestXid is not used? Compute a value for oldestXid like we did
previously? Throw an error? Leave oldestXid unchanged? I am thinking
the last option.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#21Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#20)
1 attachment(s)
Re: visibility map corruption

On Tue, Jul 6, 2021 at 08:36:13PM -0400, Bruce Momjian wrote:

On Tue, Jul 6, 2021 at 06:49:10PM -0400, Bruce Momjian wrote:

My point is that there are a lot internals involved here that are not
part of pg_upgrade, though it probably only affects pg_upgrade. Anyway,
Bertrand patch seems to have what I need.

One question is how do we want to handle cases where -x next_xid is used
but -u oldestXid is not used? Compute a value for oldestXid like we did
previously? Throw an error? Leave oldestXid unchanged? I am thinking
the last option.

Here is a modified version of Bertrand's patch, with docs, that does the
last option.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

Attachments:

oldestxid.diff.gzapplication/gzipDownload
���`oldestxid.diff�ZyW�H��|���2������a82~8K�Yv�yz����Hj��3�w����OV6~	�[U��u�����c�D�&NL����j����VCjW�������[�A2~�f��-:#������n�Q���v�����R��:����i������Ns��3�G�r��������x�O��& !�����=�����D�S�tO|2B�pc{*����2��wz�v�%�$��k������E�b`��G��x:s���
��IU��T��L�s-�Z~d�8�9V���\�*��2�/����u�lL�)%B��\TR��d��uF"����pP�������Zd<'.c��?!6�d>�*tL��o0u*�v`�>��8>'����1�����S�hyR]���RH���Xr�r
���>�7vw�
�n�Z�����6���W�B�-�)P�6�"b��W*sJg�EM�L��8I"aXjD�(#Q4�\���J+Sp�
�S����L{e��bwu�0'
N�Hx �r���Y��q�2\�����-��#��������D��'���0q\4�20K�'��\��'�����h3;���TE�n�lF�������t��s�ZEcD����������o�H4>��-�B���T��N�i�v{���%��m:D�N��v�#D^��_��':�e������r\���x<��3A�&#��{)��1�LE}bnD9������aa.q�i�s�E��������{�	F���i��q������t0s
�)b�D����*iGK�,�" 9�����M��f2�sb=�v����������z�!<������&fY�>?��Gi�
�p���_,����o�er����L��oeA�3�9������&gr��4�m��U�!���������k��s���x�  #�}��.���Z:\�&L�l���z�����A�A����7E1��T�.�'��B�2)�����n����7��z�'��Z"�K�D���I	��]r��%��_h����!oS���@��_�Vn�w��Z��!5���/�Q�k�+le]�1K\Ye��� ��c�`��A��AG�~%�� ,���vP����`�N���J��X`�4x�"7�w��%�t����N[�b(@@])��Xy(�����
�\��x�=^�V�b�LR�Z�3'V���o�u��A� �EI��n���K�,���%%����nPP�Y,@�Lx=B�%��+'�b1��"�+.��ro����$���%�����)A85�m�h��n$�������X�����w�[�p����-��?}�}Z[������K_MttO�qXn���j����6��U��Ov2�7�3��w,emS/KE���s$����3�+�{xY1��|��@�s�+������9V�:n_:.[�(����{E�`���&�xW��_�w �j(��a���*yU�\)��fh�������4��������=o��������1����p����x��[�J�J��Q��&F�'�LO�94�hg�1��� �ib���yJa�������D0b��f�|�t�^���[&#.E9\M:>P��AB�")2 b>ts�2��#\�1I+W2f����Pn��"\(�$Ex����@����T��k����/TR�������[�����H�����Ck''/�����������,��!7�w���%m�9�E[��&�xe��*^L���F���������b*����������`������@YB~~
��T*rq6:��=�{��}(��m<"D�4��;�lx�Q�~pQ����}I����2s�9�0�j��c���B�nd�Q�4�x�BF�&�OhI�N[]����.��YQ�i����#�,	D�(��6h�����_7V���Yjq�l�6
���Z� 	,4�<dc>���(�a��=��]_
�/S�<��8�c*�����5����-�C��&#�C��,-}���
�-�G��-�6�T�?����jG�u���I ��CT5�]��T��Z���������
��Ra���P�
#$�z�$����WJ��8��2+[�l����u��|i�J���4����n0��Kz�m��CN��<�Q��/��:3��c�&�$��Uw������|���A��G��=����}��5��G 5�� 5�E!5
e����vF�
�	J���Z@F��A����-��y��-pS��[xS�����Z�=�w��!����7'��Sy�����g.�_>��m�BY8�V�X'r�u��������w�����[�$���E��B�����$�w-������lV*Fk�iv��Z��;�lQ[W�d����P��
��b��it$R�]H_ ��{S|)���y�:������Ba���w��kx�e?i�A�n�^=���~�q>s��_PfR�.3�C=0�H�u�-.(�H�z���7�Q����P�O��5gL��-��a�\�@���0�q�HAQ�R�@0�?e�� E�P�_��6f����op�+��k~����/���P��;�`��'8K��r��G�@�R�$'}R�Guq6�����zde�����C���o�]�=����_n/u}1M���(���{���<���S�W���XY��Z�&aIU��}�%o�-���������a��i�����3�`����YJ`�x"��`�7���5�!
w���U����������h�/C��I��)���������''��3b�'�}r��z4�?;�W�������g����g#��������z�@�_���&>g�&��1���]q���y�����'�������X[�����"�`�X�G����>i�K�8�(�K��Hn�\���b��pQO��o����/m��r������4�J��]��F�������&��
s�����&�0����2Y0�?�1�0��P����#��x(B%y�����b�,i���ePU@�o����X��3f	�G=��=*������I�?���2��>L;)H���5u<"S>�������������T�M��KO���~��S���X<����N�?�i�&GBL���L�K��f��v*k|��4��9����b*~pX?8��{�xP����%fF?�\/q��!�d<���|j�Cy��>2\g����u~����F8�8�?^�0�*
#22Drouvot, Bertrand
bdrouvot@amazon.com
In reply to: Bruce Momjian (#21)
Re: visibility map corruption

Hi,

On 7/7/21 3:49 AM, Bruce Momjian wrote:

On Tue, Jul 6, 2021 at 08:36:13PM -0400, Bruce Momjian wrote:

On Tue, Jul 6, 2021 at 06:49:10PM -0400, Bruce Momjian wrote:

My point is that there are a lot internals involved here that are not
part of pg_upgrade, though it probably only affects pg_upgrade. Anyway,
Bertrand patch seems to have what I need.

One question is how do we want to handle cases where -x next_xid is used
but -u oldestXid is not used? Compute a value for oldestXid like we did
previously? Throw an error? Leave oldestXid unchanged? I am thinking
the last option.

Here is a modified version of Bertrand's patch, with docs, that does the
last option.

Thanks for having looked at it.

It looks good to me, but i have one question:

+��� printf(_("� -u, --oldest-transaction-id=XID� set oldest transaction
ID\n"));

and

+������������������ if (!TransactionIdIsNormal(set_oldest_xid))
+������������������ {
+����������������������� pg_log_error("oldest transaction ID (-u) must 
be greater or equal to %u", FirstNormalTransactionId);
+����������������������� exit(1);
+������������������ }

I am wondering if we should not keep my original proposal "oldest
unfrozen transaction" (as compare to "oldest transaction") in both
output to:

- make the wording similar with what we can found in StartupXLOG():

��� ereport(DEBUG1,
����������� (errmsg_internal("oldest unfrozen transaction ID: %u, in
database %u",
���������������������������� checkPoint.oldestXid,
checkPoint.oldestXidDB)));

- give the new� "-u" a sense (somehow) from a naming point of view.

What do you think?

Thanks

Bertrand

#23Bruce Momjian
bruce@momjian.us
In reply to: Drouvot, Bertrand (#22)
2 attachment(s)
Re: visibility map corruption

On Thu, Jul 8, 2021 at 07:35:58AM +0200, Drouvot, Bertrand wrote:

Thanks for having looked at it.

It looks good to me, but i have one question:

+    printf(_("  -u, --oldest-transaction-id=XID  set oldest transaction
ID\n"));

and

+                   if (!TransactionIdIsNormal(set_oldest_xid))
+                   {
+                        pg_log_error("oldest transaction ID (-u) must be
greater or equal to %u", FirstNormalTransactionId);
+                        exit(1);
+                   }

I am wondering if we should not keep my original proposal "oldest unfrozen
transaction" (as compare to "oldest transaction") in both output to:

- make the wording similar with what we can found in StartupXLOG():

    ereport(DEBUG1,
            (errmsg_internal("oldest unfrozen transaction ID: %u, in
database %u",
                             checkPoint.oldestXid,
checkPoint.oldestXidDB)));

- give the new  "-u" a sense (somehow) from a naming point of view.

What do you think?

I was wondering about that too. We don't use the term "unfrozen" in the
pg_control output, and only in a few places in our docs. I added the
word "unfrozen" for the -u doc description in this updated patch ---
not sure how much farther to go in using this term, but I am afraid if I
use it in the areas you suggested above, it will confuse people who are
trying to match it to the pg_control output.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

Attachments:

oldestxid.diff.gzapplication/gzipDownload
oldestxid.diff.gzapplication/gzipDownload
#24Justin Pryzby
pryzby@telsasoft.com
In reply to: Drouvot, Bertrand (#22)
Re: visibility map corruption

Also, the pg_upgrade status message still seems to be misplaced:

In 20210706190612.GM22043@telsasoft.com, Justin Pryzby wrote:

I re-arranged the pg_upgrade output of that patch: it was in the middle of the
two halves: "Setting next transaction ID and epoch for new cluster"

+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -473,6 +473,12 @@ copy_xact_xlog_xid(void)
                          "\"%s/pg_resetwal\" -f -x %u \"%s\"",
                          new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid,
                          new_cluster.pgdata);
+       check_ok();
+       prep_status("Setting oldest XID for new cluster");
+       exec_prog(UTILITY_LOG_FILE, NULL, true, true,
+                         "\"%s/pg_resetwal\" -f -u %u \"%s\"",
+                         new_cluster.bindir, old_cluster.controldata.chkpnt_oldstxid,
+                         new_cluster.pgdata);
        exec_prog(UTILITY_LOG_FILE, NULL, true, true,
                          "\"%s/pg_resetwal\" -f -e %u \"%s\"",
                          new_cluster.bindir, old_cluster.controldata.chkpnt_nxtepoch,

--
Justin

#25Bruce Momjian
bruce@momjian.us
In reply to: Justin Pryzby (#24)
1 attachment(s)
Re: visibility map corruption

On Thu, Jul 8, 2021 at 08:11:14AM -0500, Justin Pryzby wrote:

Also, the pg_upgrade status message still seems to be misplaced:

In 20210706190612.GM22043@telsasoft.com, Justin Pryzby wrote:

I re-arranged the pg_upgrade output of that patch: it was in the middle of the
two halves: "Setting next transaction ID and epoch for new cluster"

+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -473,6 +473,12 @@ copy_xact_xlog_xid(void)
"\"%s/pg_resetwal\" -f -x %u \"%s\"",
new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid,
new_cluster.pgdata);
+       check_ok();
+       prep_status("Setting oldest XID for new cluster");
+       exec_prog(UTILITY_LOG_FILE, NULL, true, true,
+                         "\"%s/pg_resetwal\" -f -u %u \"%s\"",
+                         new_cluster.bindir, old_cluster.controldata.chkpnt_oldstxid,
+                         new_cluster.pgdata);
exec_prog(UTILITY_LOG_FILE, NULL, true, true,
"\"%s/pg_resetwal\" -f -e %u \"%s\"",
new_cluster.bindir, old_cluster.controldata.chkpnt_nxtepoch,

Wow, you are 100% correct. Updated patch attached.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

Attachments:

oldestxid.diff.gzapplication/gzipDownload
#26Drouvot, Bertrand
bdrouvot@amazon.com
In reply to: Bruce Momjian (#23)
Re: visibility map corruption

On 7/8/21 3:08 PM, Bruce Momjian wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

On Thu, Jul 8, 2021 at 07:35:58AM +0200, Drouvot, Bertrand wrote:

Thanks for having looked at it.

It looks good to me, but i have one question:

+ printf(_("  -u, --oldest-transaction-id=XID set oldest transaction
ID\n"));

and

+                   if (!TransactionIdIsNormal(set_oldest_xid))
+                   {
+                        pg_log_error("oldest transaction ID (-u) must be
greater or equal to %u", FirstNormalTransactionId);
+                        exit(1);
+                   }

I am wondering if we should not keep my original proposal "oldest unfrozen
transaction" (as compare to "oldest transaction") in both output to:

- make the wording similar with what we can found in StartupXLOG():

ereport(DEBUG1,
(errmsg_internal("oldest unfrozen transaction ID: %u, in
database %u",
checkPoint.oldestXid,
checkPoint.oldestXidDB)));

- give the new "-u" a sense (somehow) from a naming point of view.

What do you think?

I was wondering about that too. We don't use the term "unfrozen" in the
pg_control output, and only in a few places in our docs. I added the
word "unfrozen" for the -u doc description in this updated patch

Thanks!

---
not sure how much farther to go in using this term, but I am afraid if I
use it in the areas you suggested above, it will confuse people who are
trying to match it to the pg_control output.

Makes sense, thanks for your feedback.

Bertrand

#27Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#25)
Re: visibility map corruption

On Thu, Jul 8, 2021 at 09:51:47AM -0400, Bruce Momjian wrote:

On Thu, Jul 8, 2021 at 08:11:14AM -0500, Justin Pryzby wrote:

Also, the pg_upgrade status message still seems to be misplaced:

In 20210706190612.GM22043@telsasoft.com, Justin Pryzby wrote:

I re-arranged the pg_upgrade output of that patch: it was in the middle of the
two halves: "Setting next transaction ID and epoch for new cluster"

+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -473,6 +473,12 @@ copy_xact_xlog_xid(void)
"\"%s/pg_resetwal\" -f -x %u \"%s\"",
new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid,
new_cluster.pgdata);
+       check_ok();
+       prep_status("Setting oldest XID for new cluster");
+       exec_prog(UTILITY_LOG_FILE, NULL, true, true,
+                         "\"%s/pg_resetwal\" -f -u %u \"%s\"",
+                         new_cluster.bindir, old_cluster.controldata.chkpnt_oldstxid,
+                         new_cluster.pgdata);
exec_prog(UTILITY_LOG_FILE, NULL, true, true,
"\"%s/pg_resetwal\" -f -e %u \"%s\"",
new_cluster.bindir, old_cluster.controldata.chkpnt_nxtepoch,

Wow, you are 100% correct. Updated patch attached.

OK, I have the patch ready to apply to all supported Postgres versions,
and it passes all my cross-version pg_upgrade tests.

However, I am now stuck on the commit message text, and I think this is
the point Peter Geoghegan was trying to make earlier --- while we know
that preserving the oldest xid in pg_control is the right thing to do,
and that setting it to the current xid - 2 billion (the old behavior)
causes vacuum freeze to run on all tables, but what else does this patch
affect?

As far as I know, seeing a very low oldest xid causes autovacuum to
check all objects and make sure their relfrozenxid is less then
autovacuum_freeze_max_age, but isn't that just a check? Would that
cause any table scans? I would think not. And would this cause
incorrect truncation of pg_xact or fsm or vm files? I would think not
too.

Even if the old and new cluster had mismatched autovacuum_freeze_max_age
values, I don't see how that would cause any corruption either.

I could perhaps see corruption happening if pg_control's oldest xid
value was closer to the current xid value than it should be, but I can't
see how having it 2-billion away could cause harm, unless perhaps
pg_upgrade itself used enough xids to cause the counter to wrap more
than 2^31 away from the oldest xid recorded in pg_control.

What I am basically asking is how to document this and what it fixes.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

In reply to: Bruce Momjian (#27)
Re: visibility map corruption

On Fri, Jul 23, 2021 at 5:08 PM Bruce Momjian <bruce@momjian.us> wrote:

However, I am now stuck on the commit message text, and I think this is
the point Peter Geoghegan was trying to make earlier --- while we know
that preserving the oldest xid in pg_control is the right thing to do,
and that setting it to the current xid - 2 billion (the old behavior)
causes vacuum freeze to run on all tables, but what else does this patch
affect?

As far as I know the only other thing that it might affect is the
traditional use of pg_resetwal: recovering likely-corrupt data.
Getting the database to limp along for long enough to pg_dump. That is
the only interpretation that makes sense, because the code in question
predates pg_upgrade.

AFAICT that was the original spirit of the code that we're changing here.

As far as I know, seeing a very low oldest xid causes autovacuum to
check all objects and make sure their relfrozenxid is less then
autovacuum_freeze_max_age, but isn't that just a check? Would that
cause any table scans? I would think not. And would this cause
incorrect truncation of pg_xact or fsm or vm files? I would think not
too.

Tom actually wrote this code. I believe that he questioned the whole
basis of it himself quite recently.

Whether or not it's okay to change the behavior in contexts outside of
pg_upgrade (contexts where the user invokes pg_resetwal -x to get the
system to start) is perhaps debatable. It probably doesn't matter very
much if you preserve that behavior for non-pg_upgrade cases -- hard to
say. At the same time it's now easy to see that pg_upgrade shouldn't
be doing this.

Even if the old and new cluster had mismatched autovacuum_freeze_max_age
values, I don't see how that would cause any corruption either.

Sometimes the pg_control value for oldest XID is used as the oldest
non-frozen XID that's expected in the table. Other times it's
relfrozenxid itself IIRC.

I could perhaps see corruption happening if pg_control's oldest xid
value was closer to the current xid value than it should be, but I can't
see how having it 2-billion away could cause harm, unless perhaps
pg_upgrade itself used enough xids to cause the counter to wrap more
than 2^31 away from the oldest xid recorded in pg_control.

What I am basically asking is how to document this and what it fixes.

ISTM that this is a little like commits 78db307bb2 and a61daa14. Maybe
take a look at those?

--
Peter Geoghegan

#29Bruce Momjian
bruce@momjian.us
In reply to: Peter Geoghegan (#28)
Re: visibility map corruption

On Fri, Jul 23, 2021 at 05:47:18PM -0700, Peter Geoghegan wrote:

I could perhaps see corruption happening if pg_control's oldest xid
value was closer to the current xid value than it should be, but I can't
see how having it 2-billion away could cause harm, unless perhaps
pg_upgrade itself used enough xids to cause the counter to wrap more
than 2^31 away from the oldest xid recorded in pg_control.

What I am basically asking is how to document this and what it fixes.

ISTM that this is a little like commits 78db307bb2 and a61daa14. Maybe
take a look at those?

Agreed. I just wanted to make sure I wasn't missing an important aspect
of this patch. Thanks.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#30Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#29)
Re: visibility map corruption

On Fri, Jul 23, 2021 at 09:01:18PM -0400, Bruce Momjian wrote:

On Fri, Jul 23, 2021 at 05:47:18PM -0700, Peter Geoghegan wrote:

I could perhaps see corruption happening if pg_control's oldest xid
value was closer to the current xid value than it should be, but I can't
see how having it 2-billion away could cause harm, unless perhaps
pg_upgrade itself used enough xids to cause the counter to wrap more
than 2^31 away from the oldest xid recorded in pg_control.

What I am basically asking is how to document this and what it fixes.

ISTM that this is a little like commits 78db307bb2 and a61daa14. Maybe
take a look at those?

Agreed. I just wanted to make sure I wasn't missing an important aspect
of this patch. Thanks.

Another question --- with the previous code, the oldest xid was always
set to a reasonable value, -2 billion less than the current xid. With
the new code, the oldest xid might be slightly higher than the current
xid if they use -x but not -u. Is that acceptable? I think we agreed it
was. pg_upgrade will always set both.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#31Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#30)
Re: visibility map corruption

On Sat, Jul 24, 2021 at 10:01:05AM -0400, Bruce Momjian wrote:

On Fri, Jul 23, 2021 at 09:01:18PM -0400, Bruce Momjian wrote:

On Fri, Jul 23, 2021 at 05:47:18PM -0700, Peter Geoghegan wrote:

I could perhaps see corruption happening if pg_control's oldest xid
value was closer to the current xid value than it should be, but I can't
see how having it 2-billion away could cause harm, unless perhaps
pg_upgrade itself used enough xids to cause the counter to wrap more
than 2^31 away from the oldest xid recorded in pg_control.

What I am basically asking is how to document this and what it fixes.

ISTM that this is a little like commits 78db307bb2 and a61daa14. Maybe
take a look at those?

Agreed. I just wanted to make sure I wasn't missing an important aspect
of this patch. Thanks.

Another question --- with the previous code, the oldest xid was always
set to a reasonable value, -2 billion less than the current xid. With
the new code, the oldest xid might be slightly higher than the current
xid if they use -x but not -u. Is that acceptable? I think we agreed it
was. pg_upgrade will always set both.

This patch has been applied back to 9.6 and will appear in the next
minor release.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.