POC: make mxidoff 64 bits

Started by Maxim Orlovalmost 2 years ago111 messages
Jump to latest
#1Maxim Orlov
orlovmg@gmail.com

Hi!

I've been trying to introduce 64-bit transaction identifications to
Postgres for quite a while [0]/messages/by-id/CACG=ezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe=pyyjVWA@mail.gmail.com. All this implies,
of course, an enormous amount of change that will have to take place in
various modules. Consider this, the
patch set become too big to be committed “at once”.

The obvious solutions was to try to split the patch set into smaller ones.
But here it comes a new challenge,
not every one of these parts, make Postgres better at the moment. Actually,
even switching to a
FullTransactionId in PGPROC will have disadvantage in increasing of WAL
size [1]/messages/by-id/CACG=ezY7msw+jip=rtfvnfz051dRqz4s-diuO46v3rAoAE0T0g@mail.gmail.com.

In fact, I believe, we're dealing with the chicken and the egg problem
here. Not able to commit full patch set
since it is too big to handle and not able to commit parts of it, since
they make sense all together and do not
help improve Postgres at the moment.

But it's not that bad. Since the commit 4ed8f0913bfdb5f, added in [3]/messages/by-id/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq+vfkmTF5Q@mail.gmail.com, we
are capable to use 64 bits to
indexing SLRUs.

PROPOSAL
Make multixact offsets 64 bit.

RATIONALE
It is not very difficult to overflow 32-bit mxidoff. Since, it is created
for every unique combination of the
transaction for each tuple, including XIDs and respective flags. And when a
transaction is added to a
specific multixact, it is rewritten with a new offset. In other words, it
is possible to overflow the offsets of
multixacts without overflowing the multixacts themselves and/or ordinary
transactions. I believe, there
was something about in the hackers mail lists, but I could not find links
now.

PFA, patch. Here is a WIP version. Upgrade machinery should be added later.

As always, any opinions on a subject a very welcome!

[0]: /messages/by-id/CACG=ezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe=pyyjVWA@mail.gmail.com
/messages/by-id/CACG=ezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe=pyyjVWA@mail.gmail.com
[1]: /messages/by-id/CACG=ezY7msw+jip=rtfvnfz051dRqz4s-diuO46v3rAoAE0T0g@mail.gmail.com
/messages/by-id/CACG=ezY7msw+jip=rtfvnfz051dRqz4s-diuO46v3rAoAE0T0g@mail.gmail.com
[3]: /messages/by-id/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq+vfkmTF5Q@mail.gmail.com
/messages/by-id/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq+vfkmTF5Q@mail.gmail.com

--
Best regards,
Maxim Orlov.

Attachments:

0001-WIP-mxidoff-to-64bit.patchapplication/octet-stream; name=0001-WIP-mxidoff-to-64bit.patchDownload+58-211
#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Maxim Orlov (#1)
Re: POC: make mxidoff 64 bits

On 23/04/2024 11:23, Maxim Orlov wrote:

PROPOSAL
Make multixact offsets 64 bit.

+1, this is a good next step and useful regardless of 64-bit XIDs.

@@ -156,7 +148,7 @@
((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))

/* page in which a member is to be found */
-#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId) MULTIXACT_MEMBERS_PER_PAGE)
+#define MXOffsetToMemberPage(xid) ((xid) / (MultiXactOffset) MULTIXACT_MEMBERS_PER_PAGE)
#define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) / SLRU_PAGES_PER_SEGMENT)

/* Location (byte offset within page) of flag word for a given member */

This is really a bug fix. It didn't matter when TransactionId and
MultiXactOffset were both typedefs of uint32, but it was always wrong.
The argument name 'xid' is also misleading.

I think there are some more like that, MXOffsetToFlagsBitShift for example.

--
Heikki Linnakangas
Neon (https://neon.tech)

#3Andrey Borodin
amborodin@acm.org
In reply to: Maxim Orlov (#1)
Re: POC: make mxidoff 64 bits

On 23 Apr 2024, at 11:23, Maxim Orlov <orlovmg@gmail.com> wrote:

Make multixact offsets 64 bit.

- ereport(ERROR,
- (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
- errmsg("multixact \"members\" limit exceeded"),
Personally, I'd be happy with this! We had some incidents where the only mitigation was vacuum settings tweaking.

BTW as a side note... I see lot's of casts to (unsigned long long), can't we just cast to MultiXactOffset?

Best regards, Andrey Borodin.

#4wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Heikki Linnakangas (#2)
Re: POC: make mxidoff 64 bits

Hi Maxim Orlov
Thank you so much for your tireless work on this. Increasing the WAL
size by a few bytes should have very little impact with today's disk
performance(Logical replication of this feature wal log is also increased a
lot, logical replication is a milestone new feature, and the community has
been improving the logical replication of functions),I believe removing
troubled postgresql Transaction ID Wraparound was also a milestone new
feature adding a few bytes is worth it!

Best regards

On Tue, 23 Apr 2024 at 17:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Show quoted text

On 23/04/2024 11:23, Maxim Orlov wrote:

PROPOSAL
Make multixact offsets 64 bit.

+1, this is a good next step and useful regardless of 64-bit XIDs.

@@ -156,7 +148,7 @@
((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))

/* page in which a member is to be found */
-#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId)

MULTIXACT_MEMBERS_PER_PAGE)

+#define MXOffsetToMemberPage(xid) ((xid) / (MultiXactOffset)

MULTIXACT_MEMBERS_PER_PAGE)

#define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) /

SLRU_PAGES_PER_SEGMENT)

/* Location (byte offset within page) of flag word for a given member */

This is really a bug fix. It didn't matter when TransactionId and
MultiXactOffset were both typedefs of uint32, but it was always wrong.
The argument name 'xid' is also misleading.

I think there are some more like that, MXOffsetToFlagsBitShift for example.

--
Heikki Linnakangas
Neon (https://neon.tech)

#5Maxim Orlov
orlovmg@gmail.com
In reply to: wenhui qiu (#4)
Re: POC: make mxidoff 64 bits

On Tue, 23 Apr 2024 at 12:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

This is really a bug fix. It didn't matter when TransactionId and
MultiXactOffset were both typedefs of uint32, but it was always wrong.
The argument name 'xid' is also misleading.

I think there are some more like that, MXOffsetToFlagsBitShift for example.

Yeah, I always thought so too. I believe, this is just a copy-paste. You
mean, it is worth creating a separate CF
entry for these fixes?

On Tue, 23 Apr 2024 at 16:03, Andrey M. Borodin <x4mmm@yandex-team.ru>
wrote:

BTW as a side note... I see lot's of casts to (unsigned long long), can't
we just cast to MultiXactOffset?

Actually, first versions of the 64xid patch set have such a cast to types
TransactionID, MultiXact and so on. But,
after some discussions, we are switched to unsigned long long cast.
Unfortunately, I could not find an exact link
for that discussion. On the other hand, such a casting is already used
throughout the code. So, just for the
sake of the consistency, I would like to stay with these casts.

On Tue, 23 Apr 2024 at 16:03, wenhui qiu <qiuwenhuifx@gmail.com> wrote:

Hi Maxim Orlov
Thank you so much for your tireless work on this. Increasing the WAL
size by a few bytes should have very little impact with today's disk
performance(Logical replication of this feature wal log is also increased a
lot, logical replication is a milestone new feature, and the community has
been improving the logical replication of functions),I believe removing
troubled postgresql Transaction ID Wraparound was also a milestone new
feature adding a few bytes is worth it!

I'm 100% agree. Maybe, I should return to this approach and find some
benefits for having FXIDs in WAL.

#6Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#5)
Re: POC: make mxidoff 64 bits

Hi!

Sorry for delay. I was a bit busy last month. Anyway, here is my
proposal for making multioffsets 64 bit.
The patch set consists of three parts:
0001 - making user output of offsets 64-bit ready;
0002 - making offsets 64-bit;
0003 - provide 32 to 64 bit conversion in pg_upgarde.

I'm pretty sure this is just a beginning of the conversation, so any
opinions and reviews, as always, are very welcome!

--
Best regards,
Maxim Orlov.

Attachments:

v1-0002-Use-64-bit-multixact-offsets.patchapplication/x-patch; name=v1-0002-Use-64-bit-multixact-offsets.patchDownload+16-175
v1-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/x-patch; name=v1-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload+31-26
v1-0003-Make-pg_upgrade-convert-multixact-offsets.patchapplication/x-patch; name=v1-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload+391-6
#7Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#6)
Re: POC: make mxidoff 64 bits

Here is rebase. Apparently I'll have to do it often, since the
CATALOG_VERSION_NO changed in the patch.

--
Best regards,
Maxim Orlov.

Attachments:

v2-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/octet-stream; name=v2-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload+31-26
v2-0003-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v2-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload+391-6
v2-0002-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v2-0002-Use-64-bit-multixact-offsets.patchDownload+11-170
#8Alexander Korotkov
aekorotkov@gmail.com
In reply to: Maxim Orlov (#7)
Re: POC: make mxidoff 64 bits

On Tue, Sep 3, 2024 at 4:30 PM Maxim Orlov <orlovmg@gmail.com> wrote:

Here is rebase. Apparently I'll have to do it often, since the CATALOG_VERSION_NO changed in the patch.

I don't think you need to maintain CATALOG_VERSION_NO change in your
patch for the exact reason you have mentioned: patch will get conflict
each time CATALOG_VERSION_NO is advanced. It's responsibility of
committer to advance CATALOG_VERSION_NO when needed.

------
Regards,
Alexander Korotkov
Supabase

#9Maxim Orlov
orlovmg@gmail.com
In reply to: Alexander Korotkov (#8)
Re: POC: make mxidoff 64 bits

On Tue, 3 Sept 2024 at 16:32, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

I don't think you need to maintain CATALOG_VERSION_NO change in your
patch for the exact reason you have mentioned: patch will get conflict
each time CATALOG_VERSION_NO is advanced. It's responsibility of
committer to advance CATALOG_VERSION_NO when needed.

OK, I got it. My intention here was to help to test the patch. If someone
wants to have a
look at the patch, he won't need to make changes in the code. In the next
iteration, I'll
remove CATALOG_VERSION_NO version change.

--
Best regards,
Maxim Orlov.

#10Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#9)
Re: POC: make mxidoff 64 bits

Here is v3. I removed CATALOG_VERSION_NO change, so this should be done by
the actual commiter.

--
Best regards,
Maxim Orlov.

Attachments:

v3-0002-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v3-0002-Use-64-bit-multixact-offsets.patchDownload+11-170
v3-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/octet-stream; name=v3-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload+31-26
v3-0003-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v3-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload+390-5
#11Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Maxim Orlov (#10)
Re: POC: make mxidoff 64 bits

Hi, Maxim!

Previously we accessed offsets in shared MultiXactState without locks as
32-bit read is always atomic. But I'm not sure it's so when offset become
64-bit.
E.g. GetNewMultiXactId():

nextOffset = MultiXactState->nextOffset;
is outside lock.

There might be other places we do the same as well.

Regards,
Pavel Borisov
Supabase

#12Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Pavel Borisov (#11)
Re: POC: make mxidoff 64 bits

On Thu, 12 Sept 2024 at 16:09, Pavel Borisov <pashkin.elfe@gmail.com> wrote:

Hi, Maxim!

Previously we accessed offsets in shared MultiXactState without locks as
32-bit read is always atomic. But I'm not sure it's so when offset become
64-bit.
E.g. GetNewMultiXactId():

nextOffset = MultiXactState->nextOffset;
is outside lock.

There might be other places we do the same as well.

I think the replacement of plain assignments by
pg_atomic_read_u64/pg_atomic_write_u64 would be sufficient.

(The same I think is needed for the patchset [1]/messages/by-id/CAJ7c6TMvPz8q+nC=JoKniy7yxPzQYcCTnNFYmsDP-nnWsAOJ2g@mail.gmail.com)
[1]: /messages/by-id/CAJ7c6TMvPz8q+nC=JoKniy7yxPzQYcCTnNFYmsDP-nnWsAOJ2g@mail.gmail.com
/messages/by-id/CAJ7c6TMvPz8q+nC=JoKniy7yxPzQYcCTnNFYmsDP-nnWsAOJ2g@mail.gmail.com

Regards,
Pavel Borisov

#13Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Pavel Borisov (#11)
Re: POC: make mxidoff 64 bits

On 2024-Sep-12, Pavel Borisov wrote:

Hi, Maxim!

Previously we accessed offsets in shared MultiXactState without locks as
32-bit read is always atomic. But I'm not sure it's so when offset become
64-bit.
E.g. GetNewMultiXactId():

nextOffset = MultiXactState->nextOffset;
is outside lock.

Good though. But fortunately I think it's not a problem. The one you
say is with MultiXactGetLock held in shared mode -- and that works OK,
as the assignment (in line 1263 at the bottom of the same routine) is
done with exclusive lock held.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

#14Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Maxim Orlov (#10)
Re: POC: make mxidoff 64 bits

On 07/09/2024 07:36, Maxim Orlov wrote:

Here is v3.

MultiXactMemberFreezeThreshold looks quite bogus now. Now that
MaxMultiXactOffset==2^64-1, you cannot get anywhere near the
MULTIXACT_MEMBER_SAFE_THRESHOLD and MULTIXACT_MEMBER_DANGER_THRESHOLD
values anymore. Can we just get rid of MultiXactMemberFreezeThreshold? I
guess it would still be useful to trigger autovacuum if multixacts
members grows large though, to release the disk space, even if you can't
run out of members as such anymore. What should the logic for that look
like?

I'd love to see some tests for the pg_upgrade code. Something like a
little perl script to generate test clusters with different wraparound
scenarios etc. using the old version, and a TAP test to run pg_upgrade
on them and verify that queries on the upgraded cluster works correctly.
We don't have tests like that in the repository today, and I don't know
if we'd want to commit these permanently either, but it would be highly
useful now as a one-off thing, to show that the code works.

On upgrade, are there really no changes required to
pg_multixact/members? I imagined that the segment files would need to be
renamed around wraparound, so that if you previously had files like this:

pg_multixact/members/FFFE
pg_multixact/members/FFFF
pg_multixact/members/0000
pg_multixact/members/0001

after upgrade you would need to have:

pg_multixact/members/00000000FFFE
pg_multixact/members/00000000FFFF
pg_multixact/members/000000010000
pg_multixact/members/000000010001

Thanks for working on this!

--
Heikki Linnakangas
Neon (https://neon.tech)

#15Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#14)
Re: POC: make mxidoff 64 bits

On Tue, 22 Oct 2024 at 12:43, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

MultiXactMemberFreezeThreshold looks quite bogus now. Now that
MaxMultiXactOffset==2^64-1, you cannot get anywhere near the
MULTIXACT_MEMBER_SAFE_THRESHOLD and MULTIXACT_MEMBER_DANGER_THRESHOLD
values anymore. Can we just get rid of MultiXactMemberFreezeThreshold? I
guess it would still be useful to trigger autovacuum if multixacts
members grows large though, to release the disk space, even if you can't
run out of members as such anymore. What should the logic for that look
like?

Yep, you're totally correct. The MultiXactMemberFreezeThreshold call is not
necessary any more and can be safely removed.
I made this as a separate commit in v4. But, as you rightly say, it will be
useful to trigger autovacuum in some cases. The obvious
place for this machinery is in the GetNewMultiXactId. I imagine this like
"if nextOff - oldestOff > threshold kick autovac". So, the
question is: what kind of threshold we want here? Is it a hard coded define
or GUC? If it is a GUC (32–bit), what values should it be?

And the other issue I feel a little regretful about. We still must be
holding MultiXactGenLock in order to track oldestOffset to do
"nextOff - oldestOff" calculation.

I'd love to see some tests for the pg_upgrade code. Something like a
little perl script to generate test clusters with different wraparound
scenarios etc.

Agree. I'll address this as soon as I can.

--
Best regards,
Maxim Orlov.

Attachments:

v4-0004-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v4-0004-Make-pg_upgrade-convert-multixact-offsets.patchDownload+390-5
v4-0003-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchapplication/octet-stream; name=v4-0003-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchDownload+7-225
v4-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/octet-stream; name=v4-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload+31-26
v4-0002-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v4-0002-Use-64-bit-multixact-offsets.patchDownload+11-170
#16Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#15)
Re: POC: make mxidoff 64 bits

After a bit of thought, I've realized that to be conservative here is the
way to go.

We can reuse a maximum of existing logic. I mean, we can remove offset
wraparound "error logic" and reuse "warning logic". But set the threshold
for "warning logic" to a much higher value. For now, I choose 2^32-1. In
other world, legit logic, in my view, here would be to trigger autovacuum
if the number of offsets (i.e. difference nextOffset - oldestOffset)
exceeds 2^32-1. PFA patch set.

--
Best regards,
Maxim Orlov.

Attachments:

v5-0002-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v5-0002-Use-64-bit-multixact-offsets.patchDownload+11-170
v5-0003-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v5-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload+390-5
v5-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchapplication/octet-stream; name=v5-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchDownload+14-110
v5-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/octet-stream; name=v5-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload+31-26
#17wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Maxim Orlov (#16)
Re: POC: make mxidoff 64 bits

HI Maxim Orlov

After a bit of thought, I've realized that to be conservative here is the

way to go.

We can reuse a maximum of existing logic. I mean, we can remove offset

wraparound "error logic" and reuse "warning logic". But set the threshold
for "warning >logic" to a much higher value. For now, I choose 2^32-1. In
other world, legit logic, in my view, here would be to trigger autovacuum
if the number of offsets (i.e. >difference nextOffset - oldestOffset)
exceeds 2^32-1. PFA patch set.
good point ,Couldn't agree with you more. xid64 is the solution to the
wraparound problem,The previous error log is no longer meaningful ,But we
might want to refine the output waring log a little(For example, checking
the underlying reasons why age has been increasing),Though we don't have to
worry about xid wraparound

+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)
Can we refine this annotation a bit? for example
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space ,reduce table
bloat if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)

Thanks

Maxim Orlov <orlovmg@gmail.com> 于2024年10月23日周三 23:55写道:

Show quoted text

After a bit of thought, I've realized that to be conservative here is the
way to go.

We can reuse a maximum of existing logic. I mean, we can remove offset
wraparound "error logic" and reuse "warning logic". But set the threshold
for "warning logic" to a much higher value. For now, I choose 2^32-1. In
other world, legit logic, in my view, here would be to trigger autovacuum
if the number of offsets (i.e. difference nextOffset - oldestOffset)
exceeds 2^32-1. PFA patch set.

--
Best regards,
Maxim Orlov.

#18Maxim Orlov
orlovmg@gmail.com
In reply to: wenhui qiu (#17)
Re: POC: make mxidoff 64 bits

On Fri, 25 Oct 2024 at 06:39, wenhui qiu <qiuwenhuifx@gmail.com> wrote:

+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value,
we
+ * trigger autovacuum in order to release the disk space if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)
Can we refine this annotation a bit? for example

Thank you, fixed.

Sorry for a late reply. There was a problem in upgrade with offset
wraparound. Here is a fixed version. Test also added. I decide to use my
old patch to set a non-standard multixacts for the old cluster, fill it
with data and do pg_upgrade.

Here is how to test. All the patches are for 14e87ffa5c543b5f3 master
branch.
1) Get the 14e87ffa5c543b5f3 master branch apply patches
0001-Add-initdb-option-to-initialize-cluster-with-non-sta.patch and
0002-TEST-lower-SLRU_PAGES_PER_SEGMENT.patch
2) Get the 14e87ffa5c543b5f3 master branch in a separate directory and
apply v6 patch set.
3) Build two branches.
4) Use ENV oldinstall to run the test: PROVE_TESTS=t/005_mxidoff.pl
oldinstall=/home/orlov/proj/pgsql-new PG_TEST_NOCLEAN=1 make check -C
src/bin/pg_upgrade/

Maybe, I'll make a shell script to automate this steps if required.

--
Best regards,
Maxim Orlov.

Attachments:

v6-0001-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v6-0001-Use-64-bit-multixact-offsets.patchDownload+11-170
v6-0002-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v6-0002-Make-pg_upgrade-convert-multixact-offsets.patchDownload+572-7
v6-0004-TEST-lower-SLRU_PAGES_PER_SEGMENT-set-bump-catver.patchapplication/octet-stream; name=v6-0004-TEST-lower-SLRU_PAGES_PER_SEGMENT-set-bump-catver.patchDownload+4-5
v6-0005-TEST-initdb-option-to-initialize-cluster-with-non.patchapplication/octet-stream; name=v6-0005-TEST-initdb-option-to-initialize-cluster-with-non.patchDownload+382-15
v6-0003-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchapplication/octet-stream; name=v6-0003-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchDownload+15-110
v6-0006-TEST-add-basic-mxidoff64-tests-005_mxidoff.pl.patchapplication/octet-stream; name=v6-0006-TEST-add-basic-mxidoff64-tests-005_mxidoff.pl.patchDownload+389-1
0002-TEST-lower-SLRU_PAGES_PER_SEGMENT.patchapplication/octet-stream; name=0002-TEST-lower-SLRU_PAGES_PER_SEGMENT.patchDownload+1-2
0001-Add-initdb-option-to-initialize-cluster-with-non-sta.patchapplication/octet-stream; name=0001-Add-initdb-option-to-initialize-cluster-with-non-sta.patchDownload+460-33
#19Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Maxim Orlov (#18)
Re: POC: make mxidoff 64 bits

On 08/11/2024 20:10, Maxim Orlov wrote:

Sorry for a late reply. There was a problem in upgrade with offset
wraparound. Here is a fixed version. Test also added. I decide to use my
old patch to set a non-standard multixacts for the old cluster, fill it
with data and do pg_upgrade.

The wraparound logic is still not correct. To test, I created a cluster
where multixids have wrapped around, so that:

$ ls -l data-old/pg_multixact/offsets/
total 720
-rw------- 1 heikki heikki 212992 Nov 12 01:11 0000
-rw-r--r-- 1 heikki heikki 262144 Nov 12 00:55 FFFE
-rw------- 1 heikki heikki 262144 Nov 12 00:56 FFFF

After running pg_upgrade:

$ ls -l data-new/pg_multixact/offsets/
total 1184
-rw------- 1 heikki heikki 155648 Nov 12 01:12 0001
-rw------- 1 heikki heikki 262144 Nov 12 01:11 1FFFD
-rw------- 1 heikki heikki 262144 Nov 12 01:11 1FFFE
-rw------- 1 heikki heikki 262144 Nov 12 01:11 1FFFF
-rw------- 1 heikki heikki 262144 Nov 12 01:11 20000
-rw------- 1 heikki heikki 155648 Nov 12 01:11 20001

That's not right. The segments 20000 and 20001 were created by the new
pg_upgrade conversion code from old segment '0000'. But multixids are
still 32-bit values, so after segment 1FFFF, you should still wrap
around to 0000. The new segments should be '0000' and '0001'. The
segment '0001' is created when postgres is started after upgrade, but
it's created from scratch and doesn't contain the upgraded values.

When I try to select from a table after upgrade that contains
post-wraparound multixids:

TRAP: failed Assert("offset != 0"), File:
"../src/backend/access/transam/multixact.c", Line: 1353, PID: 63386

On a different note, I'm surprised you're rewriting member segments from
scratch, parsing all the individual member groups and writing them out
again. There's no change to the members file format, except for the
numbering of the files, so you could just copy the files under the new
names without paying attention to the contents. It's not wrong to parse
them in detail, but I'd assume that it would be simpler not to.

Here is how to test. All the patches are for 14e87ffa5c543b5f3 master
branch.
1) Get the 14e87ffa5c543b5f3 master branch apply patches 0001-Add-
initdb-option-to-initialize-cluster-with-non-sta.patch and 0002-TEST-
lower-SLRU_PAGES_PER_SEGMENT.patch
2) Get the 14e87ffa5c543b5f3 master branch in a separate directory and
apply v6 patch set.
3) Build two branches.
4) Use ENV oldinstall to run the test: PROVE_TESTS=t/005_mxidoff.pl
<http://005_mxidoff.pl&gt; oldinstall=/home/orlov/proj/pgsql-new
PG_TEST_NOCLEAN=1 make check -C src/bin/pg_upgrade/

Maybe, I'll make a shell script to automate this steps if required.

Yeah, I think we need something to automate this. I did the testing
manually. I used the attached python script to consume multixids faster,
but it's still tedious.

I used pg_resetwal to quickly create a cluster that's close to multixid
wrapround:

initdb -D data
pg_resetwal -D data -m 4294900001,4294900000
dd if=/dev/zero of=data/pg_multixact/offsets/FFFE bs=8192 count=32

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

multixids.pytext/x-python; charset=UTF-8; name=multixids.pyDownload
#20Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#19)
Re: POC: make mxidoff 64 bits

On Tue, 12 Nov 2024 at 02:31, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

The wraparound logic is still not correct.

Yep, my fault. I forget to reset segment counter if wraparound is happened.
Fixed.

When I try to select from a table after upgrade that contains

post-wraparound multixids:

TRAP: failed Assert("offset != 0"), File:
"../src/backend/access/transam/multixact.c", Line: 1353, PID: 63386

The problem was in converting offset segments. The new_entry index should
also bypass the invalid offset (0) value. Fixed.

On a different note, I'm surprised you're rewriting member segments from
scratch, parsing all the individual member groups and writing them out
again. There's no change to the members file format, except for the
numbering of the files, so you could just copy the files under the new
names without paying attention to the contents. It's not wrong to parse
them in detail, but I'd assume that it would be simpler not to.

Yes, at the beginning I also thought that it would be possible to get by
with simple copying. But in case of wraparound, we must "bypass" invalid
zero offset value. See, old 32 bit offsets a wrapped at 2^32, thus 0 values
appears in multixact.c So, they must be handled. Bypass, in fact. When we
are switched to the 64-bit offsets, we have two options:
1). Bypass every ((uint32) offset == 0) value in multixact.c;
2). Convert members and bypass invalid value once.

The first options seem too weird for me. So, we have to repack members and
bypass invalid value.

All patches are for master@38c18710b37a2d

--
Best regards,
Maxim Orlov.

Attachments:

v7-0001-Use-64-bit-format-output-for-multixact-offsets.patchapplication/octet-stream; name=v7-0001-Use-64-bit-format-output-for-multixact-offsets.patchDownload+31-26
v7-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchapplication/octet-stream; name=v7-0004-Get-rid-of-MultiXactMemberFreezeThreshold-call.patchDownload+15-110
v7-0005-TEST-bump-catver.patchapplication/octet-stream; name=v7-0005-TEST-bump-catver.patchDownload+2-3
v7-0003-Make-pg_upgrade-convert-multixact-offsets.patchapplication/octet-stream; name=v7-0003-Make-pg_upgrade-convert-multixact-offsets.patchDownload+583-7
v7-0002-Use-64-bit-multixact-offsets.patchapplication/octet-stream; name=v7-0002-Use-64-bit-multixact-offsets.patchDownload+11-170
#21Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#20)
#22Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Maxim Orlov (#20)
#23Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#22)
#24Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#23)
#25Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#24)
#26Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Maxim Orlov (#25)
#27Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#26)
#28Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Maxim Orlov (#27)
#29Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#28)
#30Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#29)
#31wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Maxim Orlov (#30)
#32Maxim Orlov
orlovmg@gmail.com
In reply to: wenhui qiu (#31)
#33Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#32)
#34wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Maxim Orlov (#33)
#35Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Maxim Orlov (#33)
#36Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#35)
#37wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Heikki Linnakangas (#36)
#38Maxim Orlov
orlovmg@gmail.com
In reply to: wenhui qiu (#37)
#39Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Maxim Orlov (#38)
#40Maxim Orlov
orlovmg@gmail.com
In reply to: Ashutosh Bapat (#39)
#41Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#40)
#42Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#41)
#43Alexander Korotkov
aekorotkov@gmail.com
In reply to: Maxim Orlov (#42)
#44Maxim Orlov
orlovmg@gmail.com
In reply to: Alexander Korotkov (#43)
#45wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Maxim Orlov (#44)
#46Maxim Orlov
orlovmg@gmail.com
In reply to: wenhui qiu (#45)
#47wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Maxim Orlov (#46)
#48Maxim Orlov
orlovmg@gmail.com
In reply to: wenhui qiu (#47)
#49Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Maxim Orlov (#48)
#50Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#49)
#51Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Maxim Orlov (#50)
#52Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#51)
#53Alexander Korotkov
aekorotkov@gmail.com
In reply to: Maxim Orlov (#52)
#54Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Maxim Orlov (#52)
#55Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#54)
#56Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#55)
#57Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Maxim Orlov (#56)
#58Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#35)
#59Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#58)
#60Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Maxim Orlov (#59)
#61Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#60)
#62wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Heikki Linnakangas (#61)
#63Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: wenhui qiu (#62)
#64Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#63)
#65Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Heikki Linnakangas (#61)
#66Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Ashutosh Bapat (#65)
#67Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#66)
#68Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#67)
#69Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Maxim Orlov (#68)
#70Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Heikki Linnakangas (#66)
#71Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#70)
#72Maxim Orlov
orlovmg@gmail.com
In reply to: Ashutosh Bapat (#71)
#73Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Maxim Orlov (#72)
#74Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#67)
#75Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#74)
#76Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#75)
#77Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Maxim Orlov (#76)
#78Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Ashutosh Bapat (#73)
#79wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Maxim Orlov (#76)
#80Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#77)
#81Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Heikki Linnakangas (#77)
#82Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#69)
#83Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#82)
#84Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#70)
#85Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Ashutosh Bapat (#84)
#86Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#85)
#87Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Heikki Linnakangas (#85)
#88Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Heikki Linnakangas (#86)
#89Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Ashutosh Bapat (#87)
#90Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#89)
#91Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Heikki Linnakangas (#90)
#92Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Ashutosh Bapat (#91)
#93Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#92)
#94Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#92)
#95Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alvaro Herrera (#94)
#96Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#95)
#97Alexander Lakhin
exclusion@gmail.com
In reply to: Heikki Linnakangas (#89)
#98Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#96)
#99Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Lakhin (#97)
#100Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#99)
#101zengman
zengman@halodbtech.com
In reply to: Heikki Linnakangas (#100)
#102Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: zengman (#101)
#103zengman
zengman@halodbtech.com
In reply to: Heikki Linnakangas (#102)
#104Chao Li
li.evan.chao@gmail.com
In reply to: Heikki Linnakangas (#102)
#105Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#100)
#106Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Maxim Orlov (#105)
#107Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#105)
#108Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#107)
#109Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#108)
#110Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Maxim Orlov (#109)
#111Maxim Orlov
orlovmg@gmail.com
In reply to: Maxim Orlov (#109)