POC: make mxidoff 64 bits
Hi!
I've been trying to introduce 64-bit transaction identifications to
Postgres for quite a while [0]/messages/by-id/CACG=ezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe=pyyjVWA@mail.gmail.com. All this implies,
of course, an enormous amount of change that will have to take place in
various modules. Consider this, the
patch set become too big to be committed “at once”.
The obvious solutions was to try to split the patch set into smaller ones.
But here it comes a new challenge,
not every one of these parts, make Postgres better at the moment. Actually,
even switching to a
FullTransactionId in PGPROC will have disadvantage in increasing of WAL
size [1]/messages/by-id/CACG=ezY7msw+jip=rtfvnfz051dRqz4s-diuO46v3rAoAE0T0g@mail.gmail.com.
In fact, I believe, we're dealing with the chicken and the egg problem
here. Not able to commit full patch set
since it is too big to handle and not able to commit parts of it, since
they make sense all together and do not
help improve Postgres at the moment.
But it's not that bad. Since the commit 4ed8f0913bfdb5f, added in [3]/messages/by-id/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq+vfkmTF5Q@mail.gmail.com, we
are capable to use 64 bits to
indexing SLRUs.
PROPOSAL
Make multixact offsets 64 bit.
RATIONALE
It is not very difficult to overflow 32-bit mxidoff. Since, it is created
for every unique combination of the
transaction for each tuple, including XIDs and respective flags. And when a
transaction is added to a
specific multixact, it is rewritten with a new offset. In other words, it
is possible to overflow the offsets of
multixacts without overflowing the multixacts themselves and/or ordinary
transactions. I believe, there
was something about in the hackers mail lists, but I could not find links
now.
PFA, patch. Here is a WIP version. Upgrade machinery should be added later.
As always, any opinions on a subject a very welcome!
[0]: /messages/by-id/CACG=ezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe=pyyjVWA@mail.gmail.com
/messages/by-id/CACG=ezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe=pyyjVWA@mail.gmail.com
[1]: /messages/by-id/CACG=ezY7msw+jip=rtfvnfz051dRqz4s-diuO46v3rAoAE0T0g@mail.gmail.com
/messages/by-id/CACG=ezY7msw+jip=rtfvnfz051dRqz4s-diuO46v3rAoAE0T0g@mail.gmail.com
[3]: /messages/by-id/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq+vfkmTF5Q@mail.gmail.com
/messages/by-id/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq+vfkmTF5Q@mail.gmail.com
--
Best regards,
Maxim Orlov.
Attachments:
On 23/04/2024 11:23, Maxim Orlov wrote:
PROPOSAL
Make multixact offsets 64 bit.
+1, this is a good next step and useful regardless of 64-bit XIDs.
@@ -156,7 +148,7 @@
((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))/* page in which a member is to be found */ -#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId) MULTIXACT_MEMBERS_PER_PAGE) +#define MXOffsetToMemberPage(xid) ((xid) / (MultiXactOffset) MULTIXACT_MEMBERS_PER_PAGE) #define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) / SLRU_PAGES_PER_SEGMENT)/* Location (byte offset within page) of flag word for a given member */
This is really a bug fix. It didn't matter when TransactionId and
MultiXactOffset were both typedefs of uint32, but it was always wrong.
The argument name 'xid' is also misleading.
I think there are some more like that, MXOffsetToFlagsBitShift for example.
--
Heikki Linnakangas
Neon (https://neon.tech)
On 23 Apr 2024, at 11:23, Maxim Orlov <orlovmg@gmail.com> wrote:
Make multixact offsets 64 bit.
- ereport(ERROR,
- (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
- errmsg("multixact \"members\" limit exceeded"),
Personally, I'd be happy with this! We had some incidents where the only mitigation was vacuum settings tweaking.
BTW as a side note... I see lot's of casts to (unsigned long long), can't we just cast to MultiXactOffset?
Best regards, Andrey Borodin.
Hi Maxim Orlov
Thank you so much for your tireless work on this. Increasing the WAL
size by a few bytes should have very little impact with today's disk
performance(Logical replication of this feature wal log is also increased a
lot, logical replication is a milestone new feature, and the community has
been improving the logical replication of functions),I believe removing
troubled postgresql Transaction ID Wraparound was also a milestone new
feature adding a few bytes is worth it!
Best regards
On Tue, 23 Apr 2024 at 17:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Show quoted text
On 23/04/2024 11:23, Maxim Orlov wrote:
PROPOSAL
Make multixact offsets 64 bit.+1, this is a good next step and useful regardless of 64-bit XIDs.
@@ -156,7 +148,7 @@
((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))/* page in which a member is to be found */
-#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId)MULTIXACT_MEMBERS_PER_PAGE)
+#define MXOffsetToMemberPage(xid) ((xid) / (MultiXactOffset)
MULTIXACT_MEMBERS_PER_PAGE)
#define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) /
SLRU_PAGES_PER_SEGMENT)
/* Location (byte offset within page) of flag word for a given member */
This is really a bug fix. It didn't matter when TransactionId and
MultiXactOffset were both typedefs of uint32, but it was always wrong.
The argument name 'xid' is also misleading.I think there are some more like that, MXOffsetToFlagsBitShift for example.
--
Heikki Linnakangas
Neon (https://neon.tech)
On Tue, 23 Apr 2024 at 12:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
This is really a bug fix. It didn't matter when TransactionId and
MultiXactOffset were both typedefs of uint32, but it was always wrong.
The argument name 'xid' is also misleading.I think there are some more like that, MXOffsetToFlagsBitShift for example.
Yeah, I always thought so too. I believe, this is just a copy-paste. You
mean, it is worth creating a separate CF
entry for these fixes?
On Tue, 23 Apr 2024 at 16:03, Andrey M. Borodin <x4mmm@yandex-team.ru>
wrote:
BTW as a side note... I see lot's of casts to (unsigned long long), can't
we just cast to MultiXactOffset?
Actually, first versions of the 64xid patch set have such a cast to types
TransactionID, MultiXact and so on. But,
after some discussions, we are switched to unsigned long long cast.
Unfortunately, I could not find an exact link
for that discussion. On the other hand, such a casting is already used
throughout the code. So, just for the
sake of the consistency, I would like to stay with these casts.
On Tue, 23 Apr 2024 at 16:03, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
Hi Maxim Orlov
Thank you so much for your tireless work on this. Increasing the WAL
size by a few bytes should have very little impact with today's disk
performance(Logical replication of this feature wal log is also increased a
lot, logical replication is a milestone new feature, and the community has
been improving the logical replication of functions),I believe removing
troubled postgresql Transaction ID Wraparound was also a milestone new
feature adding a few bytes is worth it!
I'm 100% agree. Maybe, I should return to this approach and find some
benefits for having FXIDs in WAL.
Hi!
Sorry for delay. I was a bit busy last month. Anyway, here is my
proposal for making multioffsets 64 bit.
The patch set consists of three parts:
0001 - making user output of offsets 64-bit ready;
0002 - making offsets 64-bit;
0003 - provide 32 to 64 bit conversion in pg_upgarde.
I'm pretty sure this is just a beginning of the conversation, so any
opinions and reviews, as always, are very welcome!
--
Best regards,
Maxim Orlov.
Attachments:
Here is rebase. Apparently I'll have to do it often, since the
CATALOG_VERSION_NO changed in the patch.
--
Best regards,
Maxim Orlov.
Attachments:
On Tue, Sep 3, 2024 at 4:30 PM Maxim Orlov <orlovmg@gmail.com> wrote:
Here is rebase. Apparently I'll have to do it often, since the CATALOG_VERSION_NO changed in the patch.
I don't think you need to maintain CATALOG_VERSION_NO change in your
patch for the exact reason you have mentioned: patch will get conflict
each time CATALOG_VERSION_NO is advanced. It's responsibility of
committer to advance CATALOG_VERSION_NO when needed.
------
Regards,
Alexander Korotkov
Supabase
On Tue, 3 Sept 2024 at 16:32, Alexander Korotkov <aekorotkov@gmail.com>
wrote:
I don't think you need to maintain CATALOG_VERSION_NO change in your
patch for the exact reason you have mentioned: patch will get conflict
each time CATALOG_VERSION_NO is advanced. It's responsibility of
committer to advance CATALOG_VERSION_NO when needed.
OK, I got it. My intention here was to help to test the patch. If someone
wants to have a
look at the patch, he won't need to make changes in the code. In the next
iteration, I'll
remove CATALOG_VERSION_NO version change.
--
Best regards,
Maxim Orlov.
Here is v3. I removed CATALOG_VERSION_NO change, so this should be done by
the actual commiter.
--
Best regards,
Maxim Orlov.
Attachments:
Hi, Maxim!
Previously we accessed offsets in shared MultiXactState without locks as
32-bit read is always atomic. But I'm not sure it's so when offset become
64-bit.
E.g. GetNewMultiXactId():
nextOffset = MultiXactState->nextOffset;
is outside lock.
There might be other places we do the same as well.
Regards,
Pavel Borisov
Supabase
On Thu, 12 Sept 2024 at 16:09, Pavel Borisov <pashkin.elfe@gmail.com> wrote:
Hi, Maxim!
Previously we accessed offsets in shared MultiXactState without locks as
32-bit read is always atomic. But I'm not sure it's so when offset become
64-bit.
E.g. GetNewMultiXactId():nextOffset = MultiXactState->nextOffset;
is outside lock.There might be other places we do the same as well.
I think the replacement of plain assignments by
pg_atomic_read_u64/pg_atomic_write_u64 would be sufficient.
(The same I think is needed for the patchset [1]/messages/by-id/CAJ7c6TMvPz8q+nC=JoKniy7yxPzQYcCTnNFYmsDP-nnWsAOJ2g@mail.gmail.com)
[1]: /messages/by-id/CAJ7c6TMvPz8q+nC=JoKniy7yxPzQYcCTnNFYmsDP-nnWsAOJ2g@mail.gmail.com
/messages/by-id/CAJ7c6TMvPz8q+nC=JoKniy7yxPzQYcCTnNFYmsDP-nnWsAOJ2g@mail.gmail.com
Regards,
Pavel Borisov
On 2024-Sep-12, Pavel Borisov wrote:
Hi, Maxim!
Previously we accessed offsets in shared MultiXactState without locks as
32-bit read is always atomic. But I'm not sure it's so when offset become
64-bit.
E.g. GetNewMultiXactId():nextOffset = MultiXactState->nextOffset;
is outside lock.
Good though. But fortunately I think it's not a problem. The one you
say is with MultiXactGetLock held in shared mode -- and that works OK,
as the assignment (in line 1263 at the bottom of the same routine) is
done with exclusive lock held.
--
Ãlvaro Herrera Breisgau, Deutschland â https://www.EnterpriseDB.com/
On 07/09/2024 07:36, Maxim Orlov wrote:
Here is v3.
MultiXactMemberFreezeThreshold looks quite bogus now. Now that
MaxMultiXactOffset==2^64-1, you cannot get anywhere near the
MULTIXACT_MEMBER_SAFE_THRESHOLD and MULTIXACT_MEMBER_DANGER_THRESHOLD
values anymore. Can we just get rid of MultiXactMemberFreezeThreshold? I
guess it would still be useful to trigger autovacuum if multixacts
members grows large though, to release the disk space, even if you can't
run out of members as such anymore. What should the logic for that look
like?
I'd love to see some tests for the pg_upgrade code. Something like a
little perl script to generate test clusters with different wraparound
scenarios etc. using the old version, and a TAP test to run pg_upgrade
on them and verify that queries on the upgraded cluster works correctly.
We don't have tests like that in the repository today, and I don't know
if we'd want to commit these permanently either, but it would be highly
useful now as a one-off thing, to show that the code works.
On upgrade, are there really no changes required to
pg_multixact/members? I imagined that the segment files would need to be
renamed around wraparound, so that if you previously had files like this:
pg_multixact/members/FFFE
pg_multixact/members/FFFF
pg_multixact/members/0000
pg_multixact/members/0001
after upgrade you would need to have:
pg_multixact/members/00000000FFFE
pg_multixact/members/00000000FFFF
pg_multixact/members/000000010000
pg_multixact/members/000000010001
Thanks for working on this!
--
Heikki Linnakangas
Neon (https://neon.tech)
On Tue, 22 Oct 2024 at 12:43, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
MultiXactMemberFreezeThreshold looks quite bogus now. Now that
MaxMultiXactOffset==2^64-1, you cannot get anywhere near the
MULTIXACT_MEMBER_SAFE_THRESHOLD and MULTIXACT_MEMBER_DANGER_THRESHOLD
values anymore. Can we just get rid of MultiXactMemberFreezeThreshold? I
guess it would still be useful to trigger autovacuum if multixacts
members grows large though, to release the disk space, even if you can't
run out of members as such anymore. What should the logic for that look
like?
Yep, you're totally correct. The MultiXactMemberFreezeThreshold call is not
necessary any more and can be safely removed.
I made this as a separate commit in v4. But, as you rightly say, it will be
useful to trigger autovacuum in some cases. The obvious
place for this machinery is in the GetNewMultiXactId. I imagine this like
"if nextOff - oldestOff > threshold kick autovac". So, the
question is: what kind of threshold we want here? Is it a hard coded define
or GUC? If it is a GUC (32–bit), what values should it be?
And the other issue I feel a little regretful about. We still must be
holding MultiXactGenLock in order to track oldestOffset to do
"nextOff - oldestOff" calculation.
I'd love to see some tests for the pg_upgrade code. Something like a
little perl script to generate test clusters with different wraparound
scenarios etc.
Agree. I'll address this as soon as I can.
--
Best regards,
Maxim Orlov.
Attachments:
After a bit of thought, I've realized that to be conservative here is the
way to go.
We can reuse a maximum of existing logic. I mean, we can remove offset
wraparound "error logic" and reuse "warning logic". But set the threshold
for "warning logic" to a much higher value. For now, I choose 2^32-1. In
other world, legit logic, in my view, here would be to trigger autovacuum
if the number of offsets (i.e. difference nextOffset - oldestOffset)
exceeds 2^32-1. PFA patch set.
--
Best regards,
Maxim Orlov.
Attachments:
HI Maxim Orlov
After a bit of thought, I've realized that to be conservative here is the
way to go.
We can reuse a maximum of existing logic. I mean, we can remove offset
wraparound "error logic" and reuse "warning logic". But set the threshold
for "warning >logic" to a much higher value. For now, I choose 2^32-1. In
other world, legit logic, in my view, here would be to trigger autovacuum
if the number of offsets (i.e. >difference nextOffset - oldestOffset)
exceeds 2^32-1. PFA patch set.
good point ,Couldn't agree with you more. xid64 is the solution to the
wraparound problem,The previous error log is no longer meaningful ,But we
might want to refine the output waring log a little(For example, checking
the underlying reasons why age has been increasing),Though we don't have to
worry about xid wraparound
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)
Can we refine this annotation a bit? for example
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space ,reduce table
bloat if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)
Thanks
Maxim Orlov <orlovmg@gmail.com> 于2024年10月23日周三 23:55写道:
Show quoted text
After a bit of thought, I've realized that to be conservative here is the
way to go.We can reuse a maximum of existing logic. I mean, we can remove offset
wraparound "error logic" and reuse "warning logic". But set the threshold
for "warning logic" to a much higher value. For now, I choose 2^32-1. In
other world, legit logic, in my view, here would be to trigger autovacuum
if the number of offsets (i.e. difference nextOffset - oldestOffset)
exceeds 2^32-1. PFA patch set.--
Best regards,
Maxim Orlov.
On Fri, 25 Oct 2024 at 06:39, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
+ * Multixact members warning threshold. + * + * If difference bettween nextOffset and oldestOffset exceed this value, we + * trigger autovacuum in order to release the disk space if possible. + */ +#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF) Can we refine this annotation a bit? for example
Thank you, fixed.
Sorry for a late reply. There was a problem in upgrade with offset
wraparound. Here is a fixed version. Test also added. I decide to use my
old patch to set a non-standard multixacts for the old cluster, fill it
with data and do pg_upgrade.
Here is how to test. All the patches are for 14e87ffa5c543b5f3 master
branch.
1) Get the 14e87ffa5c543b5f3 master branch apply patches
0001-Add-initdb-option-to-initialize-cluster-with-non-sta.patch and
0002-TEST-lower-SLRU_PAGES_PER_SEGMENT.patch
2) Get the 14e87ffa5c543b5f3 master branch in a separate directory and
apply v6 patch set.
3) Build two branches.
4) Use ENV oldinstall to run the test: PROVE_TESTS=t/005_mxidoff.pl
oldinstall=/home/orlov/proj/pgsql-new PG_TEST_NOCLEAN=1 make check -C
src/bin/pg_upgrade/
Maybe, I'll make a shell script to automate this steps if required.
--
Best regards,
Maxim Orlov.
Attachments:
On 08/11/2024 20:10, Maxim Orlov wrote:
Sorry for a late reply. There was a problem in upgrade with offset
wraparound. Here is a fixed version. Test also added. I decide to use my
old patch to set a non-standard multixacts for the old cluster, fill it
with data and do pg_upgrade.
The wraparound logic is still not correct. To test, I created a cluster
where multixids have wrapped around, so that:
$ ls -l data-old/pg_multixact/offsets/
total 720
-rw------- 1 heikki heikki 212992 Nov 12 01:11 0000
-rw-r--r-- 1 heikki heikki 262144 Nov 12 00:55 FFFE
-rw------- 1 heikki heikki 262144 Nov 12 00:56 FFFF
After running pg_upgrade:
$ ls -l data-new/pg_multixact/offsets/
total 1184
-rw------- 1 heikki heikki 155648 Nov 12 01:12 0001
-rw------- 1 heikki heikki 262144 Nov 12 01:11 1FFFD
-rw------- 1 heikki heikki 262144 Nov 12 01:11 1FFFE
-rw------- 1 heikki heikki 262144 Nov 12 01:11 1FFFF
-rw------- 1 heikki heikki 262144 Nov 12 01:11 20000
-rw------- 1 heikki heikki 155648 Nov 12 01:11 20001
That's not right. The segments 20000 and 20001 were created by the new
pg_upgrade conversion code from old segment '0000'. But multixids are
still 32-bit values, so after segment 1FFFF, you should still wrap
around to 0000. The new segments should be '0000' and '0001'. The
segment '0001' is created when postgres is started after upgrade, but
it's created from scratch and doesn't contain the upgraded values.
When I try to select from a table after upgrade that contains
post-wraparound multixids:
TRAP: failed Assert("offset != 0"), File:
"../src/backend/access/transam/multixact.c", Line: 1353, PID: 63386
On a different note, I'm surprised you're rewriting member segments from
scratch, parsing all the individual member groups and writing them out
again. There's no change to the members file format, except for the
numbering of the files, so you could just copy the files under the new
names without paying attention to the contents. It's not wrong to parse
them in detail, but I'd assume that it would be simpler not to.
Here is how to test. All the patches are for 14e87ffa5c543b5f3 master
branch.
1) Get the 14e87ffa5c543b5f3 master branch apply patches 0001-Add-
initdb-option-to-initialize-cluster-with-non-sta.patch and 0002-TEST-
lower-SLRU_PAGES_PER_SEGMENT.patch
2) Get the 14e87ffa5c543b5f3 master branch in a separate directory and
apply v6 patch set.
3) Build two branches.
4) Use ENV oldinstall to run the test: PROVE_TESTS=t/005_mxidoff.pl
<http://005_mxidoff.pl> oldinstall=/home/orlov/proj/pgsql-new
PG_TEST_NOCLEAN=1 make check -C src/bin/pg_upgrade/Maybe, I'll make a shell script to automate this steps if required.
Yeah, I think we need something to automate this. I did the testing
manually. I used the attached python script to consume multixids faster,
but it's still tedious.
I used pg_resetwal to quickly create a cluster that's close to multixid
wrapround:
initdb -D data
pg_resetwal -D data -m 4294900001,4294900000
dd if=/dev/zero of=data/pg_multixact/offsets/FFFE bs=8192 count=32
--
Heikki Linnakangas
Neon (https://neon.tech)
Attachments:
On Tue, 12 Nov 2024 at 02:31, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
The wraparound logic is still not correct.
Yep, my fault. I forget to reset segment counter if wraparound is happened.
Fixed.
When I try to select from a table after upgrade that contains
post-wraparound multixids:
TRAP: failed Assert("offset != 0"), File:
"../src/backend/access/transam/multixact.c", Line: 1353, PID: 63386
The problem was in converting offset segments. The new_entry index should
also bypass the invalid offset (0) value. Fixed.
On a different note, I'm surprised you're rewriting member segments from
scratch, parsing all the individual member groups and writing them out
again. There's no change to the members file format, except for the
numbering of the files, so you could just copy the files under the new
names without paying attention to the contents. It's not wrong to parse
them in detail, but I'd assume that it would be simpler not to.
Yes, at the beginning I also thought that it would be possible to get by
with simple copying. But in case of wraparound, we must "bypass" invalid
zero offset value. See, old 32 bit offsets a wrapped at 2^32, thus 0 values
appears in multixact.c So, they must be handled. Bypass, in fact. When we
are switched to the 64-bit offsets, we have two options:
1). Bypass every ((uint32) offset == 0) value in multixact.c;
2). Convert members and bypass invalid value once.
The first options seem too weird for me. So, we have to repack members and
bypass invalid value.
All patches are for master@38c18710b37a2d
--
Best regards,
Maxim Orlov.
Attachments:
Here is the test scripts.
The generate.sh script is used to generate data dir with multimple clusters
in it. This script will call multixids.py in order to generate data. If you
are not use system psql consider using LD_LIBRARY_PATH env to specify path
to the lib directory.
OLDBIN=/.../pgsql-new ./generate.sh
Then the test.sh is used to run various upgrades.
OLDBIN=/.../pgsql-old NEWBIN=/.../pgsql-new ./test.sh
I hope that helps!
--
Best regards,
Maxim Orlov.
Attachments:
On 13/11/2024 17:44, Maxim Orlov wrote:
On Tue, 12 Nov 2024 at 02:31, Heikki Linnakangas <hlinnaka@iki.fi
<mailto:hlinnaka@iki.fi>> wrote:
On a different note, I'm surprised you're rewriting member segments
from
scratch, parsing all the individual member groups and writing them out
again. There's no change to the members file format, except for the
numbering of the files, so you could just copy the files under the new
names without paying attention to the contents. It's not wrong to parse
them in detail, but I'd assume that it would be simpler not to.Yes, at the beginning I also thought that it would be possible to get by
with simple copying. But in case of wraparound, we must "bypass" invalid
zero offset value. See, old 32 bit offsets a wrapped at 2^32, thus 0
values appears in multixact.c So, they must be handled. Bypass, in fact.
When we are switched to the 64-bit offsets, we have two options:
1). Bypass every ((uint32) offset == 0) value in multixact.c;
2). Convert members and bypass invalid value once.The first options seem too weird for me. So, we have to repack members
and bypass invalid value.
Hmm, so if I understand correctly, this is related to how we determine
the length of the members array, by looking at the next multixid's
offset. This is explained in GetMultiXactIdMembers:
/*
* Find out the offset at which we need to start reading MultiXactMembers
* and the number of members in the multixact. We determine the latter as
* the difference between this multixact's starting offset and the next
* one's. However, there are some corner cases to worry about:
*
* 1. This multixact may be the latest one created, in which case there is
* no next one to look at. In this case the nextOffset value we just
* saved is the correct endpoint.
*
* 2. The next multixact may still be in process of being filled in: that
* is, another process may have done GetNewMultiXactId but not yet written
* the offset entry for that ID. In that scenario, it is guaranteed that
* the offset entry for that multixact exists (because GetNewMultiXactId
* won't release MultiXactGenLock until it does) but contains zero
* (because we are careful to pre-zero offset pages). Because
* GetNewMultiXactId will never return zero as the starting offset for a
* multixact, when we read zero as the next multixact's offset, we know we
* have this case. We handle this by sleeping on the condition variable
* we have just for this; the process in charge will signal the CV as soon
* as it has finished writing the multixact offset.
*
* 3. Because GetNewMultiXactId increments offset zero to offset one to
* handle case #2, there is an ambiguity near the point of offset
* wraparound. If we see next multixact's offset is one, is that our
* multixact's actual endpoint, or did it end at zero with a subsequent
* increment? We handle this using the knowledge that if the zero'th
* member slot wasn't filled, it'll contain zero, and zero isn't a valid
* transaction ID so it can't be a multixact member. Therefore, if we
* read a zero from the members array, just ignore it.
*
* This is all pretty messy, but the mess occurs only in infrequent corner
* cases, so it seems better than holding the MultiXactGenLock for a long
* time on every multixact creation.
*/
With 64-bit offsets, can we assume that it never wraps around? We often
treat 2^64 as "large enough that we'll never run out", e.g. LSNs are
also assumed to never wrap around. I think that would be a safe
assumption here too.
If we accept that, we don't need to worry about case 3 anymore. But if
we upgrade wrapped-around members files by just renaming them, there
could still be a members array where we had skipped offset 0, and
reading that after the upgrade might get confused. We could continue to
ignore a 0 XID in the members array like the comment says; I think that
would be enough. But yeah, maybe it's better to bite the bullet in
pg_upgrade and squeeze those out.
Does your upgrade test suite include case 3, where the next multixact's
offset is 1?
Can we remove MaybeExtendOffsetSlru() now? There are a bunch of other
comments and checks that talk about binary-upgraded values too that we
can hopefully clean up now.
If we are to parse the member segments in detail in upgrade anyway, I'd
be tempted to make some further changes / optimizations:
- You could leave out all locking XID members in upgrade, because
they're not relevant after upgrade any more (all the XIDs will be
committed or aborted and have released the locks; we require prepared
transactions to be completed before upgrading too). It'd be enough to
include actual UPDATE/DELETE XIDs.
- The way we determine the length of the members array by looking at the
next multixid's offset is a bit complicated. We could have one extra
flag per XID in the members to indicate "this is the last member of this
multixid". That could either to replace the current mechanism of looking
at the next offset, or be just an additional cross-check.
- Do we still like the "group" representation, with 4 bytes of flags
followed by 4 XIDs? I wonder if it'd be better to just store 5 bytes per
XID unaligned.
- A more radical idea: There can be only one updating XID in one
multixid. We could store that directly in the offsets SLRU, and keep
only the locking XIDs in members. That way, the members SLRU would
become less critical; it could be safely reset on crash for example
(except for prepared transactions, which could still be holding locks,
but it'd still be less serious). Separating correctness-critical data
from more ephemeral state is generally a good idea.
I'm not insisting on any of these changes, just some things that might
be worth considering if we're rewriting the SLRUs on upgrade anyway.
--
Heikki Linnakangas
Neon (https://neon.tech)
On Fri, 15 Nov 2024 at 14:06, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Hmm, so if I understand correctly, this is related to how we determine
the length of the members array, by looking at the next multixid's
offset. This is explained in GetMultiXactIdMembers:
Correct.
If we accept that, we don't need to worry about case 3 anymore. But if
we upgrade wrapped-around members files by just renaming them, there
could still be a members array where we had skipped offset 0, and
reading that after the upgrade might get confused. We could continue to
ignore a 0 XID in the members array like the comment says; I think that
would be enough. But yeah, maybe it's better to bite the bullet in
pg_upgrade and squeeze those out.
Correct. I couldn't explain this better. I'm more for the squeeze those
out. Overwise, we're ending up in adding another hack in multixact, but one
of the benefits from switching to 64-bits, it should make XID's logic more
straight forward. After all, mxact juggling in pg_upgrade is one time
inconvenience.
Does your upgrade test suite include case 3, where the next multixact's
offset is 1?
Not exactly.
simple
Latest checkpoint's NextMultiXactId: 119441
Latest checkpoint's NextMultiOffset: 5927049
offset-wrap
Latest checkpoint's NextMultiXactId: 119441
Latest checkpoint's NextMultiOffset: 5591183
multi-wrap
Latest checkpoint's NextMultiXactId: 82006
Latest checkpoint's NextMultiOffset: 7408811
offset-multi-wrap
Latest checkpoint's NextMultiXactId: 52146
Latest checkpoint's NextMultiOffset: 5591183
You want test case where NextMultiOffset will be 1?
Can we remove MaybeExtendOffsetSlru() now? There are a bunch of other
comments and checks that talk about binary-upgraded values too that we
can hopefully clean up now.
Yes, technically we can. But this is kinda unrelated to the offsets and
will make the patch set significantly complicated, thus more complicated to
review and less likely to be committed. Again, I'm not opposing the idea,
I'm not sure if it is worth to do it right now.
If we are to parse the member segments in detail in upgrade anyway, I'd
be tempted to make some further changes / optimizations:- You could leave out all locking XID members in upgrade, because
they're not relevant after upgrade any more (all the XIDs will be
committed or aborted and have released the locks; we require prepared
transactions to be completed before upgrading too). It'd be enough to
include actual UPDATE/DELETE XIDs.- The way we determine the length of the members array by looking at the
next multixid's offset is a bit complicated. We could have one extra
flag per XID in the members to indicate "this is the last member of this
multixid". That could either to replace the current mechanism of looking
at the next offset, or be just an additional cross-check.- Do we still like the "group" representation, with 4 bytes of flags
followed by 4 XIDs? I wonder if it'd be better to just store 5 bytes per
XID unaligned.
Not really. But I would leave it for next iteration - switching multi to 64
bit. I already have some drafts for this. In any case, we'll must do
adjustments in pg_upgrade again. My goal is to move towards 64 XIDs, but
with the small steps, and I plan changes in "group" representation in
combination with switching multi to 64 bit. This seems a bit more
appropriate in my view.
As for your optimization suggestions, I like them. I don’t against them,
but I’m afraid to disrupt the clarity of thought, especially since the
algorithm is not the simplest.
--
Best regards,
Maxim Orlov.
Shame on me! I've sent an erroneous patch set. Version 7 is defective. Here
is the proper version v8 with minor refactoring in segresize.c.
Also, I rename bump cat version patch into txt in order not to break cfbot.
--
Best regards,
Maxim Orlov.
Attachments:
Oops! Sorry for the noise. I've must have been overworking yesterday and
messed up the working branches. v7 was a correct set and v8 don't. Here is
the correction with extended Perl test.
The test itself is in src/bin/pg_upgrade/t/005_offset.pl It is rather heavy
and took about 45 minutes on my i5 with 2.7 Gb data generated. Basically,
each test here is creating a cluster and fill it with multixacts. Thus,
dozens of segments are created using two methods. One is with prepared
transactions, and it creates, roughly, the same amount of segments for
members and for offsets. The other one is based on Heikki's multixids.py
and creates more members than offsets. I've used both of these methods to
generate as much diverse data as possible.
Here is how I test this patch set:
1. You need two pg clusters: the "old" one, i.e. without patch set, and
the "new" with patch set v9 applied.
2. Apply v9-0005-TEST-initdb-option-to-initialize-cluster-with-non.patch.txt
to the "old" and "new" clusters. Note, this is only patch required for
"old" cluster. This will allow you to create a cluster with non-standard
initial multixact and multixact offset. Unfortunately, this patch was not
did not arouse public interest since it is assumed that there is similar
functionality to the pg_resetwal utility. But similar is not mean equal.
See, pg_resetwal must be used after cluster init, thus, we step into some
problems with vacuum and some SLRU segments must be filled with zeroes.
Also, template0 datminmxid must be manually updated. So, in me view,
using this patch is justified and very handy here.
3. Also, apply all the "TEST" (0006 and 0007) patches to the "new"
cluster.
4. Build "old" and "new" pg clusters.
5. Run the test with: PROVE_TESTS=t/005_offset.pl PG_TEST_NOCLEAN=1
oldinstall=/home/orlov/proj/OFFSET3/pgsql-old make check -s -C
src/bin/pg_upgrade/
6. In my case, it took around 45 minutes and generate roughly 2.7 Gb of
data.
"TEST" patches, of course, are for the test purposes and not to be
committed.
In src/bin/pg_upgrade/t/005_offset.pl I try to consider next cases:
- Basic sanity checks.
Here I test various initial multi and offset values (including
wraparound) and see how appropriate segments are generated.
- pg_upgarde tests.
Here is oldinstall ENV is for. Run pg_upgrade for old cluster with multi
and offset values just like in previous step. i.e. with various
combinations.
- Self pg_upgarde.
--
Best regards,
Maxim Orlov.
Attachments:
Thanks for working on this!
On 19/11/2024 19:53, Maxim Orlov wrote:
The test itself is in src/bin/pg_upgrade/t/005_offset.pl
<http://005_offset.pl> It is rather heavy and took about 45 minutes on
my i5 with 2.7 Gb data generated. Basically, each test here is creating
a cluster and fill it with multixacts. Thus, dozens of segments are
created using two methods. One is with prepared transactions, and it
creates, roughly, the same amount of segments for members and for
offsets. The other one is based on Heikki's multixids.py and creates
more members than offsets. I've used both of these methods to generate
as much diverse data as possible.Here is how I test this patch set:
1. You need two pg clusters: the "old" one, i.e. without patch set, and
the "new" with patch set v9 applied.
2. Apply v9-0005-TEST-initdb-option-to-initialize-cluster-with-
non.patch.txt to the "old" and "new" clusters. Note, this is only
patch required for "old" cluster. This will allow you to create a
cluster with non-standard initial multixact and multixact offset.
Unfortunately, this patch was not did not arouse public interest
since it is assumed that there is similar functionality to the
pg_resetwal utility. But similar is not mean equal. See, pg_resetwal
must be used after cluster init, thus, we step into some problems
with vacuum and some SLRU segments must be filled with zeroes. Also,
template0 datminmxidmust be manually updated. So, in me view, using
this patch is justifiedand very handy here.
3. Also, apply all the "TEST" (0006 and 0007) patches to the "new" cluster.
4. Build "old" and "new" pg clusters.
5. Run the test with: PROVE_TESTS=t/005_offset.pl
<http://005_offset.pl> PG_TEST_NOCLEAN=1 oldinstall=/home/orlov/
proj/OFFSET3/pgsql-old make check -s -C src/bin/pg_upgrade/
6. In my case, it took around 45 minutes and generate roughly 2.7 Gb of
data."TEST" patches, of course, are for the test purposes and not to be
committed.In src/bin/pg_upgrade/t/005_offset.pl <http://005_offset.pl> I try to
consider next cases:* Basic sanity checks.
Here I test various initial multi and offset values (including
wraparound) and see how appropriate segments are generated.
* pg_upgarde tests.
Here is oldinstall ENV is for. Run pg_upgrade for old cluster with
multi and offset values just like in previous step. i.e. with
various combinations.
* Self pg_upgarde.
Attached is some more cleanup on top of patch set v9, removing more dead
stuff related to wraparound. I also removed the oldestOffsetKnown
variable and related code. It was needed to deal with clusters upgraded
from buggy 9.3 and 9.4 era versions, but now that pg_upgrade will
rewrite the SLRUs, it's no longer needed.
Does the pg_upgrade code work though, if you have that buggy situation
where oldestOffsetKnown == false ?
if (!TransactionIdIsValid(*xactptr))
{
/* Corner case 3: we must be looking at unused slot zero */
Assert(offset == 0);
continue;
}
After upgrade, this corner case 3 would *not* happen on offset == 0. So
looks like we're still missing test coverage for this upgrade corner case.
I'm still felt pretty uneasy about the pg_upgrade code. It's
complicated, and the way it rewrites offsets and members separately and
page at a time is quite different from the normal codepaths in
multixact.c, so it's a bit hard to see if it's handling all those corner
cases the same way. I rewrito that so that it's easier to understand,
IMHO anyway. The code for reading the old format and writing the new
format is now more decoupled. The code for reading the old format is in
a separate source file, multixact_old.c. The interface to that is the
GetOldMultiXactIdSingleMember() that returns the updating member of a
given multixid, much like the GetMultiXactIdMembers() backend function.
The conversion routine calls that for each multixid, and write it back
out in the new format, one multixid at a time.
The new code now "squeezes out" locking-only XIDs, keeping only the
updating ones. Not that important, but with this new code structure, it
was trivial and even easier to do than retaining all the XIDs.
Now that the offsets are rewritten one by one, we don't need the
"special case 3" in GetMultiXactIdMembers. The upgrade process removes
that special case.
I tried to keep my changes sepearate from your patches in the attached
patch series. This needs some more cleanup and squashing before
committing, but I think we're getting close.
--
Heikki Linnakangas
Neon (https://neon.tech)
Attachments:
On Wed, 18 Dec 2024 at 13:21, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Attached is some more cleanup on top of patch set v9, removing more dead
stuff related to wraparound. I also removed the oldestOffsetKnown
variable and related code. It was needed to deal with clusters upgraded
from buggy 9.3 and 9.4 era versions, but now that pg_upgrade will
rewrite the SLRUs, it's no longer needed.
Yep, multixact.c looks correct to me. As for "XXX could use
SimpleLruTruncate()", yes, for sure.
Actually, xl_multixact_truncate.startTruncMemb is also no longer needed, is
it?
Does the pg_upgrade code work though, if you have that buggy situation
where oldestOffsetKnown == false ?if (!TransactionIdIsValid(*xactptr))
{
/* Corner case 3: we must be looking at unusedslot zero */
Assert(offset == 0);
continue;
}After upgrade, this corner case 3 would *not* happen on offset == 0. So
looks like we're still missing test coverage for this upgrade corner case.
Am I understanding correctly that you want to have a test corresponding to
the buggy 9.3 and 9.4 era versions?
Do you think we could imitate this scenario on a current master branch like
that:
1) generate a couple of offsets segments for the first table;
2) generate more segments for a second table;
3) drop first table;
4) stop pg cluster;
5) remove pg_multixact/offsets/0000
6) upgrade?
PFA, v10-0016-TEST-try-to-replicate-buggy-oldest-offset.patch
This test will fail now, for an obvious reason, but is this case a relevant
one?
--
Best regards,
Maxim Orlov.
Attachments:
On 27/12/2024 19:09, Maxim Orlov wrote:
On Wed, 18 Dec 2024 at 13:21, Heikki Linnakangas <hlinnaka@iki.fi
<mailto:hlinnaka@iki.fi>> wrote:
Does the pg_upgrade code work though, if you have that buggy situation
where oldestOffsetKnown == false ?
...
        if (!TransactionIdIsValid(*xactptr))
        {
            /* Corner case 3: we must be looking atunused slot zero */
            Assert(offset == 0);
            continue;
        }After upgrade, this corner case 3 would *not* happen on offset == 0. So
looks like we're still missing test coverage for this upgrade corner
case.Am I understanding correctly that you want to have a test corresponding
to the buggy 9.3 and 9.4 era versions?
No, those were two different things. I think there might be two things
wrong here:
1. I suspect pg_upgrade might not correctly handle the situation where
oldestOffsetKnown==false, and
2. The above assertion in "corner case 3" would not hold. It seems that
we don't have a test case for it, or it would've hit the assertion.
Now that I think about it, yes, a test case for 1. would be good too.
But I was talking about 2.
Do you think we could imitate this scenario on a current master branch
like that:
1) generate a couple of offsets segments for the first table;
2) generate more segments for a second table;
3) drop first table;
4) stop pg cluster;
5) remove pg_multixact/offsets/0000
6) upgrade?
I don't remember off the top of my head.
It might be best to just refuse the upgrade if oldestOffsetKnown==false.
It's a very ancient corner case. It seems reasonable to require you to
upgrade to a newer minor version and run VACUUM before upgrading. IIRC
that sets oldestOffsetKnown.
--
Heikki Linnakangas
Neon (https://neon.tech)
On Thu, 2 Jan 2025 at 01:12, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
It might be best to just refuse the upgrade if oldestOffsetKnown==false.
It's a very ancient corner case. It seems reasonable to require you to
upgrade to a newer minor version and run VACUUM before upgrading. IIRC
that sets oldestOffsetKnown.
I agree. After all, we do already have a ready-made solution in the form
of a vacuum, do we?
If I understand all this multixact_old.c machinery correctly, in case of
oldestOffsetKnown==false
we should fail with "could not open file" or offset will be 0 in
GetOldMultiXactIdSingleMember.
So, I suppose we can put an analogue of SimpleLruDoesPhysicalPageExist call
in the beginning
of GetOldMultiXactIdSingleMember. And if either
SimpleLruDoesPhysicalPageExist return false
or a corresponding offset will be 0 we have to bail out with "oldest offset
does not exist, consider
running vacuum before pg_upgrdade" or smth. Please, correct me if I'm wrong.
--
Best regards,
Maxim Orlov.
Looks like there is a bit of a pause in the discussion. Here is a small
update. Consider v12.
No major changes, rebase to the actual master and a squash of multiple
commits to make a
patch set easy to reviewer.
AFAICs, we are reached a consensus on a core patch for switching to 64 bits
offsets. The
only concern is about more comprehensive test coverage for pg_upgrade, is
it?
--
Best regards,
Maxim Orlov.
Attachments:
HI Maxim
Looks like there is a bit of a pause in the discussion. Here is a small
update. Consider v12.
No major changes, rebase to the actual master and a squash of multiple
commits to make a
patch set easy to reviewer.
AFAICs, we are reached a consensus on a core patch for switching to 64
bits offsets. The
only concern is about more comprehensive test coverage for pg_upgrade, is
it?
Agree ,When upgrading meets extremes (oldestOffsetKnown==false.) Just
follow the solution mentioned by Heikki Linnakangas.
Thanks
On Thu, Jan 16, 2025 at 9:32 PM Maxim Orlov <orlovmg@gmail.com> wrote:
Show quoted text
Looks like there is a bit of a pause in the discussion. Here is a small
update. Consider v12.
No major changes, rebase to the actual master and a squash of multiple
commits to make a
patch set easy to reviewer.AFAICs, we are reached a consensus on a core patch for switching to 64
bits offsets. The
only concern is about more comprehensive test coverage for pg_upgrade, is
it?--
Best regards,
Maxim Orlov.
Here is a v13 version with small changes to make cf bot happy.
--
Best regards,
Maxim Orlov.
Attachments:
Here is a rebase, v14.
--
Best regards,
Maxim Orlov.
Attachments:
HI Maxim Orlov Heikki Linnakangas
Thank you for working on it,A few more days is a code freeze.It seems
like things have been quiet for a while, but I believe implementing xid64
is absolutely necessary. For instance, there’s often concern about
performance jitter caused by frequent freezes. If xid64 is implemented, the
freeze process can be smoother.
Thanks
On Fri, Mar 7, 2025 at 7:30 PM Maxim Orlov <orlovmg@gmail.com> wrote:
Show quoted text
Here is a rebase, v14.
--
Best regards,
Maxim Orlov.
On 07/03/2025 13:30, Maxim Orlov wrote:
Here is a rebase, v14.
Thanks! I did some manual testing of this. I created a little helper
function to consume multixids, to test the autovacuum behavior, and
found one issue:
If you consume a lot of multixid members space, by creating lots of
multixids with huge number of members in each, you can end up with a
very bloated members SLRU, and autovacuum is in no hurry to clean it up.
Here's what I did:
1. Installed attached test module
2. Ran "select consume_multixids(10000, 100000);" many times
3. ran:
$ du -h data/pg_multixact/members/
26G data/pg_multixact/members/
When I run "vacuum freeze; select * from pg_database;", I can see that
'datminmxid' for the current database is advanced. However, autovacuum
is in no hurry to vacuum 'template0' and 'template1', so
pg_multixact/members/ does not get truncated. Eventually, when
autovacuum_multixact_freeze_max_age is reached, it presumably will, but
you will run out of disk space before that.
There is this check for members size at the end of SetOffsetVacuumLimit():
/*
* Do we need autovacuum? If we're not sure, assume yes.
*/
return !oldestOffsetKnown ||
(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
And the caller (SetMultiXactIdLimit()) will in fact signal the
autovacuum launcher after "vacuum freeze" because of that. But
autovacuum launcher will look at the datminmxid / relminmxid values, see
that they are well within autovacuum_multixact_freeze_max_age, and do
nothing.
This is a very extreme case, but clearly the code to signal autovacuum
launcher, and the freeze age cutoff that autovacuum then uses, are not
in sync.
This patch removed MultiXactMemberFreezeThreshold(), per my suggestion,
but we threw this baby with the bathwater. We discussed that in this
thread, but didn't come up with any solution. But ISTM we still need
something like MultiXactMemberFreezeThreshold() to trigger autovacuum
freezing if the members have grown too large.
--
Heikki Linnakangas
Neon (https://neon.tech)
On 01/04/2025 21:25, Heikki Linnakangas wrote:
On 07/03/2025 13:30, Maxim Orlov wrote:
Here is a rebase, v14.
Thanks! I did some manual testing of this. I created a little helper
function to consume multixids, to test the autovacuum behavior, and
found one issue:
Forgot to attach the test function I used, here it is.
--
Heikki Linnakangas
Neon (https://neon.tech)
Attachments:
HI Heikki
Pls Kindly help to create task in https://commitfest.postgresql.org/53/
,I can not found this path in
Commitfest 2025-07
Thanks
On Wed, Apr 2, 2025 at 2:26 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Show quoted text
On 01/04/2025 21:25, Heikki Linnakangas wrote:
On 07/03/2025 13:30, Maxim Orlov wrote:
Here is a rebase, v14.
Thanks! I did some manual testing of this. I created a little helper
function to consume multixids, to test the autovacuum behavior, and
found one issue:Forgot to attach the test function I used, here it is.
--
Heikki Linnakangas
Neon (https://neon.tech)
I moved the topic to the next commitfest.
--
Best regards,
Maxim Orlov.
Hi Maxim,
On Wed, Apr 16, 2025 at 1:42 PM Maxim Orlov <orlovmg@gmail.com> wrote:
I moved the topic to the next commitfest.
I am reviewing these patches.
I notice that transam/README does not mention multixact except one place in
section "Transaction Emulation during Recovery". I expected it to document
how pg_multixact/members and pg_multixact/offset are used and what is their
layout. It's not the responsibility of this patchset to document it, but it
will be good if we add a section about multixacts in transam/README. It
will make reviews easier.
--
Best Wishes,
Ashutosh Bapat
On Tue, 29 Apr 2025 at 15:01, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
wrote:
I notice that transam/README does not mention multixact except one place
in section "Transaction Emulation during Recovery". I expected it to
document how pg_multixact/members and pg_multixact/offset are used and what
is their layout. It's not the responsibility of this patchset to document
it, but it will be good if we add a section about multixacts in
transam/README. It will make reviews easier.
Yeah, I agree, this is a big overlook, I think. Anyone who tries to
understand how pg_multixact works has to deal with the code.
Certainly, we need to address this issue.
But for now, here is a new rebase @ 70a13c528b6e382a381f.
The only change is that following commits 15a79c7 and a0ed19e, we must also
switch to PRIu64 format.
--
Best regards,
Maxim Orlov.
Attachments:
Yet another rebase @ f5a987c0e5
--
Best regards,
Maxim Orlov.
Attachments:
Once again, @ 8191e0c16a
--
Best regards,
Maxim Orlov.
Attachments:
Hello Maxim!
On Thu, Sep 11, 2025 at 11:58 AM Maxim Orlov <orlovmg@gmail.com> wrote:
Once again, @ 8191e0c16a
Thank you for your work on this subject. Multixact members can really
grow much faster than multixact offsets, and avoiding wraparound just
here might make sense. At the same time, making multixact offsets
64-bit is local and doesn't require changing the tuple xmin/xmax
interpretation.
I went through the patchset. The shape does not look bad, but I have
a concern about the size of the multixact offsets. As I understand,
this patchset grows multixact offsets twice; each multixact offset
grows from 32 bits to 64 bits. This seems quite a price for avoiding
the Multixact members' wraparound.
We can try to squeeze multixact offsets given it's ascending sequence
each time increased by a multixact size. But how many members can a
multixact contain at maximum? Looking at MultiXactIdExpand(), I get
that we're keeping locks from in-progress transactions, and committed
non-lock transactions (I guess the latter could be only one). The
number of transactions running by backends should fit MAX_BACKENDS
(2^18 - 1), and the number of prepared transactions should also fit
MAX_BACKENDS. So, I guess we can cap the total number of one multixact
members to 2^24.
Therefore, we can change from each 8 of 32-bit multixact offsets
(takes 32-bytes) to one 64-bit offset + 7 of 24-bit offset increments
(takes 29-bytes). The actual multixact offsets can be calculated at
the fly, overhead shouldn't be significant. What do you think?
------
Regards,
Alexander Korotkov
Supabase
On Sat, 13 Sept 2025 at 16:34, Alexander Korotkov <aekorotkov@gmail.com>
wrote:
Therefore, we can change from each 8 of 32-bit multixact offsets
(takes 32-bytes) to one 64-bit offset + 7 of 24-bit offset increments
(takes 29-bytes). The actual multixact offsets can be calculated at
the fly, overhead shouldn't be significant. What do you think?
Thank you for your review; I'm pleased to hear from you again.
Yes, because the maximum number of mxoff is limited by the number of
running transactions, we may do it that way.
However, it is a bit wired to have offsets with the 7-byte "base".
I believe we may take advantage of the 64XID patch's notion of putting a
8 byte base followed by 4 byte offsets for particular page.
32kB page may contain then 2^13-2 offsets, each is maxed by 2^18+1.
Therefore, offset from base will never overflow 2^31 and will always
fit uint32.
It appears logical to me.
--
Best regards,
Maxim Orlov.
Hi Maxim
Thanks for your continued efforts to get XID64 implemented.
32kB page may contain then 2^13-2 offsets, each is maxed by 2^18+1.
Therefore, offset from base will never overflow 2^31 and will always
fit uint32.
It appears logical to me.
Agree +1 , but I have a question: I remember the XID64 patch got split into
a few threads. How are these threads related? The original one was seen as
too big a change, so it was broken up after people raised concerns.
Thanks
On Mon, Sep 15, 2025 at 11:42 PM Maxim Orlov <orlovmg@gmail.com> wrote:
Show quoted text
On Sat, 13 Sept 2025 at 16:34, Alexander Korotkov <aekorotkov@gmail.com>
wrote:Therefore, we can change from each 8 of 32-bit multixact offsets
(takes 32-bytes) to one 64-bit offset + 7 of 24-bit offset increments
(takes 29-bytes). The actual multixact offsets can be calculated at
the fly, overhead shouldn't be significant. What do you think?Thank you for your review; I'm pleased to hear from you again.
Yes, because the maximum number of mxoff is limited by the number of
running transactions, we may do it that way.
However, it is a bit wired to have offsets with the 7-byte "base".I believe we may take advantage of the 64XID patch's notion of putting a
8 byte base followed by 4 byte offsets for particular page.32kB page may contain then 2^13-2 offsets, each is maxed by 2^18+1.
Therefore, offset from base will never overflow 2^31 and will always
fit uint32.It appears logical to me.
--
Best regards,
Maxim Orlov.
On Tue, 16 Sept 2025 at 15:12, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
Agree +1 , but I have a question: I remember the XID64 patch got split
into a few threads. How are these threads related? The original one was
seen as too big a change, so it was broken up after people raised
concerns.
Yeah, you're absolutely correct. This thread is a part of the overall
work on the transition to XID64 in Postgres. As far as I remember, no
one explicitly raised concerns, but it's clear to me that it won't be
committed as one big patch set.
Here is a WIP patch @ 8191e0c16a for the discussed above issue.
I also had to merge several patches from the previous set, since the
consensus is to use the PRI* format for outputting 64-bit values, and a
separate patch that only changed the format with cast and %llu lost
it's meaning.
If this option suits everyone, the next step is to add a part related
to the pg_upgrade.
--
Best regards,
Maxim Orlov.
Attachments:
Hi Maxim
Thank you for your feedback.I remember this path the primary
challenges are in the upgrade section, where we're still debating how to
address a few edge cases. Right now, this thread is missing input from core
code contributors.
Thanks
On Fri, Sep 19, 2025 at 12:37 AM Maxim Orlov <orlovmg@gmail.com> wrote:
Show quoted text
On Tue, 16 Sept 2025 at 15:12, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
Agree +1 , but I have a question: I remember the XID64 patch got split
into a few threads. How are these threads related? The original one was
seen as too big a change, so it was broken up after people raised
concerns.Yeah, you're absolutely correct. This thread is a part of the overall
work on the transition to XID64 in Postgres. As far as I remember, no
one explicitly raised concerns, but it's clear to me that it won't be
committed as one big patch set.Here is a WIP patch @ 8191e0c16a for the discussed above issue.
I also had to merge several patches from the previous set, since the
consensus is to use the PRI* format for outputting 64-bit values, and a
separate patch that only changed the format with cast and %llu lost
it's meaning.If this option suits everyone, the next step is to add a part related
to the pg_upgrade.--
Best regards,
Maxim Orlov.
Here is a new patch set @ 10b5bb3bffaee8
As previously stated, the patch set implements the concept of saving the
"difference" between page offsets in order to save disc space.
The second significant change was that I decided to modify the
pg_upgrade portion as suggested by Heikki above.
At the very least, the logic becomes much simpler and completely
replicates what is done in the multxact.c module and provide some
optimization.
Perhaps this will be easier to comprehend and analyse than attempting
to correctly convert bytes from one format to another.
In the near future, I intend to focus on testing and implement a test
suite.
--
Best regards,
Maxim Orlov.
Attachments:
On 27/10/2025 17:54, Maxim Orlov wrote:
Here is a new patch set @ 10b5bb3bffaee8
As previously stated, the patch set implements the concept of saving the
"difference" between page offsets in order to save disc space.
Hmm, is that safe? We do the assignment of multixact and offset, in the
GetNewMultiXactId() function, separately from updating the SLRU pages in
the RecordNewMultiXact() function. I believe this happen:
To keep the arithmetic simple, let's assume that multixid 100 is the
first multixid on an offsets SLRU page. So the 'base' on the page is
initialized when multixid 100 is written.
1. Backend A calls GetNewMultiXactId(), is assigned multixid 100, offset
1000
2. Backend B calls GetNewMultiXactId(), is assigned multixid 101, offset
1010
3. Backend B calls RecordNewMultiXact() and sets 'page->offsets[1] = 10'
4. Backend A calls RecordNewMultiXact() and sets 'page->base = 1000' and
'page->offsets[0] = 0'
If backend C looks up multixid 101 in between steps 3 and 4, it would
read the offset incorrectly, because 'base' isn't set yet.
- Heikki
Unfortunately, I need to admit that I have messed a bit with v18.
I forgot to sync the pg_upgrade portion with the rest of the patch,
among other things. Here's a proper version with additional testing.
pg_upgrade/t/007_mxoff.pl is not meant to be committed, it's just
for test purposes. In order to run it, env var oldinstall must be set.
On Tue, 28 Oct 2025 at 17:17, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 27/10/2025 17:54, Maxim Orlov wrote:
If backend C looks up multixid 101 in between steps 3 and 4, it would
read the offset incorrectly, because 'base' isn't set yet.Hmm, maybe I miss something? We set page base on first write of any
offset on the page, not only the first one. In other words, there
should never be a case when we read an offset without a previously
defined page base. Correct me if I'm wrong:
1. Backend A assigned mxact=100, offset=1000.
2. Backend B assigned mxact=101, offset=1010.
3. Backend B calls RecordNewMultiXact()/MXOffsetWrite() and
set page base=1010, offset plus 0^0x80000000 bit while
holding lock on the page.
4. Backend C looks up for the mxact=101 by calling MXOffsetRead()
and should get exactly what he's looking for:
base (1010) + offset (0) minus 0x80000000 bit.
5. Backend A calls RecordNewMultiXact() and sets his offset using
existing base from step 3.
--
Best regards,
Maxim Orlov.
Attachments:
On 30/10/2025 08:13, Maxim Orlov wrote:
On Tue, 28 Oct 2025 at 17:17, Heikki Linnakangas <hlinnaka@iki.fi
<mailto:hlinnaka@iki.fi>> wrote:On 27/10/2025 17:54, Maxim Orlov wrote:
If backend C looks up multixid 101 in between steps 3 and 4, it would
read the offset incorrectly, because 'base' isn't set yet.Hmm, maybe I miss something? We set page base on first write of any
offset on the page, not only the first one. In other words, there
should never be a case when we read an offset without a previously
defined page base. Correct me if I'm wrong:
1. Backend A assigned mxact=100, offset=1000.
2. Backend B assigned mxact=101, offset=1010.
3. Backend B calls RecordNewMultiXact()/MXOffsetWrite() and
  set page base=1010, offset plus 0^0x80000000 bit while
  holding lock on the page.
4. Backend C looks up for the mxact=101 by calling MXOffsetRead()
  and should get exactly what he's looking for:
  base (1010) + offset (0) minus 0x80000000 bit.
5. Backend A calls RecordNewMultiXact() and sets his offset using
  existing base from step 3.
Oh I see, the 'base' is not necessarily the base offset of the first
multixact on the page, it's the base offset of the first multixid that
is written to the page. And the (short) offsets can be negative. That's
a frighteningly clever encoding scheme. One upshot of that is that WAL
redo might get construct the page with a different 'base'. I guess that
works, but it scares me. Could we come up with a more deterministic scheme?
- Heikki
On Thu, 30 Oct 2025 at 12:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Oh I see, the 'base' is not necessarily the base offset of the first
multixact on the page, it's the base offset of the first multixid that
is written to the page. And the (short) offsets can be negative. That's
a frighteningly clever encoding scheme. One upshot of that is that WAL
redo might get construct the page with a different 'base'. I guess that
works, but it scares me. Could we come up with a more deterministic scheme?Definitely! The most stable approach is the one we had before, which
used actual 64-bit offsets in the SLRU. To be honest, I'm completely
happy with it. After all, what's most important for me is to have 64-bit
xids in Postgres, and this patch is a step towards that goal.
PFA v20 returns to using actual 64-bit offsets for on-disk SLRU
segments.
Fortunately, now that I've separated reading and writing offsets into
different functions, switching from one implementation to another is
easy to do.
Here's a quick overview of the current state of the patch:
1) Access to the offset is placed to separate calls:
MXOffsetWrite/MXOffsetRead.
2) I abandoned byte juggling in pg_upgrade and moved to using logic that
replicates the work with offsets im multixact.c
3) As a result, the update issue came down to the correct implementation
of functions MXOffsetWrite/MXOffsetRead.
4) The only question that remains is the question of disk representation
of 64-bit offsets in SLRU segments.
My thoughts on point (4).
Using 32-bit offsets + some kind of packing:
Pros:
+ Reduce the total disc space used by the segments; ideally it is
almost the same as before.
Cons:
- Reduces reliability (losing a part will most likely result in losing
the entire page).
- Complicates code, especially considering that segments may be written
to the page in random order.
Using 64-bit offsets in SLRU:
Pros:
+ Easy to implement/transparent logic.
Cons:
- Increases the amount of disk space used.
In terms of speed, I'm not sure which will be faster. On the one hand,
64-bit eliminates the necessity for calculations and branching. On the
other hand, the amount of data used will increase.
I am not opposed to any of these options, as our primary goal is getting
64-bit offsets. However, I like the approach using full 64-bit offsets
in SLRU, because it is more clear and, should we say, robust. Yes, it
will increase the number of segment, however this is not heap data in
for a table. Under typical circumstances, there should not be too many
such segments.
--
Best regards,
Maxim Orlov.
Attachments:
On Thu, Oct 30, 2025 at 6:17 PM Maxim Orlov <orlovmg@gmail.com> wrote:
On Thu, 30 Oct 2025 at 12:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Oh I see, the 'base' is not necessarily the base offset of the first
multixact on the page, it's the base offset of the first multixid that
is written to the page. And the (short) offsets can be negative. That's
a frighteningly clever encoding scheme. One upshot of that is that WAL
redo might get construct the page with a different 'base'. I guess that
works, but it scares me. Could we come up with a more deterministic scheme?Definitely! The most stable approach is the one we had before, which
used actual 64-bit offsets in the SLRU. To be honest, I'm completely
happy with it. After all, what's most important for me is to have 64-bit
xids in Postgres, and this patch is a step towards that goal.
Yes, but why can't we have an encoding scheme which would both be
deterministic and provide compression? The attached is what I meant
in [1]. It's based on v19 and provide deterministic conversion of
each 8 of 64-bit offsets into a chunks containing 64-bit base and 7 of
24-bit increments. I didn't touch pg_upgrade code though.
Links.
1. /messages/by-id/CAPpHfdtPybyMYBj-x3-Z5=4bj_vhYk2R0nezfy=Vjcz4QBMDgw@mail.gmail.com
------
Regards,
Alexander Korotkov
Supabase
Attachments:
On 30/10/2025 18:17, Maxim Orlov wrote:
PFA v20 returns to using actual 64-bit offsets for on-disk SLRU
segments.Fortunately, now that I've separated reading and writing offsets into
different functions, switching from one implementation to another is
easy to do.Here's a quick overview of the current state of the patch:
1) Access to the offset is placed to separate calls:
MXOffsetWrite/MXOffsetRead.
2) I abandoned byte juggling in pg_upgrade and moved to using logic that
replicates the work with offsets im multixact.c
3) As a result, the update issue came down to the correct implementation
of functions MXOffsetWrite/MXOffsetRead.
4) The only question that remains is the question of disk representation
of 64-bit offsets in SLRU segments.
Here's another round of review and cleanup of this. I made a bunch of
small changes, but haven't found any major problems. Looking pretty good.
Notable changes since v20:
- Changed MULTIXACT_MEMBER_AUTOVAC_THRESHOLD to 4000000000 instead of
0xFFFFFFFF. The 2^32 mark doesn't have any particular meaning
significance and using a round number makes that more clear. We should
possibly expose that as a separate GUC, but that can be done in a
followup patch.
- Removed the MXOffsetRead/Write functions again. They certainly make
sense if we make them more complicated with some kind of a compression
scheme, but I preferred to keep the code closer to 'master' for now.
- Removed more remnants of offset wraparound handling. There were still
a few places that checked for wraparound and tried to deal with it,
while other places just assumed that it doesn't happen. I added a check
in GetNewMultiXactId() to error out if it would wrap around. It really
should not happen in the real world, one reason being that we would run
out of WAL before running out of 64-bit mxoffsets, but a sanity check
won't hurt just in case someone e.g. abuses pg_resetwal to get into that
situation.
- Removed MaybeExtendOffsetSlru(). It was only used to deal with binary
upgrade from version 9.2 and below. Now that pg_upgrade rewrites the
files, it's not needed anymore.
- Modified PerformMembersTruncation() to use SimpleLruTruncate()
Changes in pg_upgrade:
- Removed the nextMXact/nextOffset fields from MultiXactWriter. They
were redundant with the next_multi and next_offset local variables in
the caller.
Remaining issues:
- There's one more refactoring I'd like to do before merging this: Move
the definitions that are now duplicated between
src/bin/pg_upgrade/multixact_new.c and
src/backend/access/transam/multixact.c into a new header file,
multixact_internal.h. One complication with that is that it needs
SLRU_PAGES_PER_SEGMENT from access/slru.h, but slru.h cannot currently
be included in FRONTEND code. Perhaps we should move
SLRU_PAGES_PER_SEGMENT to pg_config_manual.h, or if that feels too
global, to a separate slru_config.h file.
- I saw Alexander's proposal for a new compression scheme but didn't
incorporate that here. It might be a good idea, but I think we can do
that as a followup patch before the release, if it seems worth it. I
don't feel too bad about just making pg_multixact/offsets 2x larger either.
- Have you done any performance testing of the pg_upgrade code? How long
does the conversion take if you have e.g. 1 billion multixids?
- Is the !oldestOffsetKnown case in the code still reachable? I left one
FIXME comment about that. Needs a comment update at least.
- The new pg_upgrade test fails on my system with this error in the log:
# Running: pg_dump --no-sync --restrict-key test -d port=22462 host=/tmp/5KdMvth1jk dbname='postgres' -f /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_CINS/newnode_1_dump.sql
pg_dump: error: aborting because of server version mismatch
pg_dump: detail: server version: 19devel; pg_dump version: 17.6
could not read "/home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_CINS/newnode_1_dump.sql": No such file or directory at /home/heikki/git-sandbox/postgresql/src/bin/pg_upgrade/t/007_mxoff.pl line 242.
This turns out to be an issue with IPC::Run. Setting the
IPCRUNDEBUG=basic env variable reveals that it has a built-in command cache:
IPC::Run 0004 [#19(109223)]: ****** harnessing *****
IPC::Run 0004 [#19(109223)]: parsing [ pg_dump --no-sync --restrict-key test -d 'port=20999 host=/tmp/NsJKldN1Ie dbname='postgres'' -f '/home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_urgw/newnode_1_dump.sql' ]
IPC::Run 0004 [#19(109223)]: ** starting
IPC::Run 0004 [#19(109223)]: 'pg_dump' found in cache: '/home/heikki/pgsql.17stable/bin/pg_dump'
IPC::Run 0004 [#19(111432) pg_dump]: execing /home/heikki/pgsql.17stable/bin/pg_dump --no-sync --restrict-key test -d 'port=20999 host=/tmp/NsJKldN1Ie dbname='postgres'' -f /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_urgw/newnode_1_dump.sql
IPC::Run 0004 [#19(109223)]: ** finishing
pg_dump: error: aborting because of server version mismatch
pg_dump: detail: server version: 19devel; pg_dump version: 17.6
The test calls pg_dump twice: first with the old version, then with the
new version. But thanks to IPC::Run's command cache, the invocation of
the new pg_dump version actually also calls the old version. I'm not
sure how to fix that, but I was able to work around it by reversing the
pg_dump calls so that thew new version is called first. That way we use
the new pg_dump against both server versions which works.
- The new pg_ugprade test is very slow. I would love to include that
test permanently in the test suite, but it's too slow for that currently.
- Heikki
Attachments:
On Wed, 5 Nov 2025 at 18:38, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Remaining issues:
- There's one more refactoring I'd like to do before merging this: Move
the definitions that are now duplicated between
src/bin/pg_upgrade/multixact_new.c and
src/backend/access/transam/multixact.c into a new header file,
multixact_internal.h. One complication with that is that it needs
SLRU_PAGES_PER_SEGMENT ...
Done. Also put SLRU_PAGES_PER_SEGMENT in pg_config_manual.h
In my opinion, this constant perfectly aligns the description in the
file header. In any case, feel free to move it anywhere you like.
- Have you done any performance testing of the pg_upgrade code? How long
does the conversion take if you have e.g. 1 billion multixids?
Unfortunately, not yet. I'd like to do this soon. Currently, the
bulk of the testing time is spent generating multi-transactions.
- Is the !oldestOffsetKnown case in the code still reachable? I left one
FIXME comment about that. Needs a comment update at least.
Yep, no longer needed. A separate commit has been added.
- The new pg_upgrade test fails on my system with this error in the log:
Unfortunately, I don't face this issue. I think this can be fixed by
providing an explicit path to the utility.
- The new pg_ugprade test is very slow. I would love to include that
test permanently in the test suite, but it's too slow for that currently.
Yes, unfortunately. The majority of the time is spent on tests that
produce multiple segments. These are cases numbered 4-th and higher.
If we remove these, the testing should be relatively fast.
I also add commit "Handle wraparound of next new multi in pg_upgrade".
Per BUG #18863 and BUG #18865
The issue is that pg_upgrade neglects to handle the wraparound of
mxact/mxoff.
We'll obviously resolve the issue with mxoff wraparound by moving to
64-bits. And the mxact bug can be easily solved with two lines of code.
Or five if you count indents and comments. Test also provided.
This commit is totally optional. If you think it deserves to be treated
as a different issue, feel free to discard it.
--
Best regards,
Maxim Orlov.
Attachments:
I noticed one minor issue after I had already sent the
previous letter.
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1034,7 +1034,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset
*offset)
if (nextOffset + nmembers < nextOffset)
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
- "MultiXact members would wrap around"));
+ errmsg("MultiXact members would wrap around")));
*offset = nextOffset;
$ $PGBINOLD/pg_controldata -D pgdata
pg_control version number: 1800
Catalog version number: 202510221
...
Latest checkpoint's NextMultiXactId: 10000000
Latest checkpoint's NextMultiOffset: 999995050
Latest checkpoint's oldestXID: 748
...
I tried finding out how long it would take to convert a big number of
segments. Unfortunately, I only have access to a very old machine right
now. It took me 7 hours to generate this much data on my old
Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 16 Gb of RAM.
Here are my rough measurements:
HDD
$ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time pg_upgrade
...
real 4m59.459s
user 0m19.974s
sys 0m13.640s
SSD
$ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time pg_upgrade
...
real 4m52.958s
user 0m19.826s
sys 0m13.624s
I aim to get access to more modern stuff and check it all out there.
--
Best regards,
Maxim Orlov.
On 07/11/2025 18:03, Maxim Orlov wrote:
I tried finding out how long it would take to convert a big number of
segments. Unfortunately, I only have access to a very old machine right
now. It took me 7 hours to generate this much data on my old
Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 16 Gb of RAM.Here are my rough measurements:
HDD
$ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time pg_upgrade
...
real   4m59.459s
user   0m19.974s
sys   0m13.640sSSD
$ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time pg_upgrade
...
real   4m52.958s
user   0m19.826s
sys   0m13.624sI aim to get access to more modern stuff and check it all out there.
Thanks, I also did some perf testing on my laptop. I wrote a little
helper function to consume multixids, and used it to create a v17
cluster with 100 million multixids. See attached
consume-mxids.patch.txt. I then ran pg_upgrade on that, and measured how
long the pg_multixact conversion part of pg_upgrade took. It took about
1.2 s on my laptop. Extrapolating from that, converting 1 billion
multixids would take 12 s. These were very simple multixacts with just
one member each, though; realistic multixacts with more members would
presumably take a little longer.
In any case, I think we're in an acceptable ballpark here.
There's some very low-hanging fruit though: Profiling with 'linux-perf'
suggested that a lot of CPU time was spent simply on the function call
overhead of GetOldMultiXactIdSingleMember, SlruReadSwitchPage,
RecordNewMultiXact, SlruWriteSwitchPage for each multixact. I added an
inlined fast path to SlruReadSwitchPage and SlruWriteSwitchPage to
eliminate the function call overhead of those in the common case that no
page switch is needed. With that, the 100 million mxid test case I used
went from 1.2 s to 0.9 s. We could optimize this further but I think
this is good enough.
Some other changes since patch set v23:
- Rebased. I committed the wraparound bug fixes.
- I added an SlruFileName() helper function to slru_io.c, and support
for reading SLRUs with long_segment_names==true. It's not needed
currently, but it seemed like a weird omission. AllocSlruRead() actually
left 'long_segment_names' uninitialized which is error-prone. We
could've just documented it, but it seems just as easy to support it.
- I split the multixact_internal.h header in a separate commit, to make
it more clear what changes are related to 64-bit offsets
I kept all the new test cases for now. We need to decide which ones are
worth keeping, and polish and speed up the ones we decide to keep.
I'm getting one failure from the pg_upgrade/008_mxoff test:
[14:43:38.422](0.530s) not ok 26 - dump outputs from original and restored regression databases match [14:43:38.422](0.000s) # Failed test 'dump outputs from original and restored regression databases match' # at /home/heikki/git-sandbox/postgresql/src/test/perl/PostgreSQL/Test/Utils.pm line 801. [14:43:38.422](0.000s) # got: '1' # expected: '0' === diff of /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/oldnode_6_dump.sql_adjusted and /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/newnode_6_dump.sql_adjusted === stdout === --- /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/oldnode_6_dump.sql_adjusted 2025-11-12 14:43:38.030399957 +0200 +++ /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/newnode_6_dump.sql_adjusted 2025-11-12 14:43:38.314399819 +0200 @@ -2,8 +2,8 @@ -- PostgreSQL database dump -- \restrict test --- Dumped from database version 17.6 --- Dumped by pg_dump version 17.6 +-- Dumped from database version 19devel +-- Dumped by pg_dump version 19devel SET statement_timeout = 0; SET lock_timeout = 0; SET idle_in_transaction_session_timeout = 0;=== stderr === === EOF === [14:43:38.425](0.004s) # >>> case #6
I ran the test with:
(rm -rf build/testrun/ build/tmp_install/;
olddump=/tmp/olddump-regress.sql oldinstall=/home/heikki/pgsql.17stable/
meson test -C build --suite setup --suite pg_upgrade)
- Heikki
Attachments:
I realized that this issue was still outstanding:
On 01/04/2025 21:25, Heikki Linnakangas wrote:
Thanks! I did some manual testing of this. I created a little helper
function to consume multixids, to test the autovacuum behavior, and
found one issue:If you consume a lot of multixid members space, by creating lots of
multixids with huge number of members in each, you can end up with a
very bloated members SLRU, and autovacuum is in no hurry to clean it up.
Here's what I did:1. Installed attached test module
2. Ran "select consume_multixids(10000, 100000);" many times
3. ran:$ du -h data/pg_multixact/members/
26G data/pg_multixact/members/When I run "vacuum freeze; select * from pg_database;", I can see that
'datminmxid' for the current database is advanced. However, autovacuum
is in no hurry to vacuum 'template0' and 'template1', so pg_multixact/
members/ does not get truncated. Eventually, when
autovacuum_multixact_freeze_max_age is reached, it presumably will, but
you will run out of disk space before that.There is this check for members size at the end of SetOffsetVacuumLimit():
/*
* Do we need autovacuum? If we're not sure, assume yes.
*/
return !oldestOffsetKnown ||
(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);And the caller (SetMultiXactIdLimit()) will in fact signal the
autovacuum launcher after "vacuum freeze" because of that. But
autovacuum launcher will look at the datminmxid / relminmxid values, see
that they are well within autovacuum_multixact_freeze_max_age, and do
nothing.This is a very extreme case, but clearly the code to signal autovacuum
launcher, and the freeze age cutoff that autovacuum then uses, are not
in sync.This patch removed MultiXactMemberFreezeThreshold(), per my suggestion,
but we threw this baby with the bathwater. We discussed that in this
thread, but didn't come up with any solution. But ISTM we still need
something like MultiXactMemberFreezeThreshold() to trigger autovacuum
freezing if the members have grown too large.
Here's a new patch version that addresses the above issue. I resurrected
MultiXactMemberFreezeThreshold(), using the same logic as before, just
using pretty arbitrary thresholds of 1 and 2 billion offsets instead of
the safe/danger thresholds derived from MaxMultiOffset. That gives
roughly the same behavior wrt. calculating effective freeze age as before.
Another change is that I removed the offset-based emergency vacuum
triggering. With 64-bit offsets, we never need to shut down the system
to prevent offset wraparound, so even if the offsets SLRU grows large,
it's not an "emergency" the same way that wraparound is. Consuming lots
of disk space could be a problem, of course, but we can let autovacuum
deal with that at the normal pace, like it deals with bloated tables.
The heuristics could surely be made better and/or more configurable, but
I think this good enough for now.
I included these changes as a separate patch for review purposes, but it
ought to be squashed with the main patch before committing.
- Heikki
Attachments:
On Wed, 12 Nov 2025 at 16:00, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I added an
inlined fast path to SlruReadSwitchPage and SlruWriteSwitchPage to
eliminate the function call overhead of those in the common case that no
page switch is needed. With that, the 100 million mxid test case I used
went from 1.2 s to 0.9 s. We could optimize this further but I think
this is good enough.
I agree with you.
- I added an SlruFileName() helper function to slru_io.c, and support
for reading SLRUs with long_segment_names==true. It's not needed
currently, but it seemed like a weird omission. AllocSlruRead() actually
left 'long_segment_names' uninitialized which is error-prone. We
could've just documented it, but it seems just as easy to support it.
Yeah, I didn't particularly like that place either. But then I decided it
was
overkill to do it for the sake of symmetry and would raise questions.
It turned out much better this way.
I kept all the new test cases for now. We need to decide which ones are
worth keeping, and polish and speed up the ones we decide to keep.
I think of two cases here.
A) Upgrade from "new cluster":
* created cluster with pre 32-bit overflow mxoff
* consume around of 2k of mxacts (1k before 32-bit overflow
and 1k after)
* run pg_upgrade
* check upgraded cluster is working
* check data invariant
B) Same as A), but for an "old cluster" with oldinstall env.
On Thu, 13 Nov 2025 at 19:04, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Here's a new patch version that addresses the above issue. I resurrected
MultiXactMemberFreezeThreshold(), using the same logic as before, just
using pretty arbitrary thresholds of 1 and 2 billion offsets instead of
the safe/danger thresholds derived from MaxMultiOffset. That gives
roughly the same behavior wrt. calculating effective freeze age as before.
Yes, I think it's okay for now. This reflects the existing logic well.
I wonder what the alternative solution might be? Can we make a
"vacuum freeze" also do pg_multixact segments truncation?
In any case, this can be discussed later.
--
Best regards,
Maxim Orlov.
On 14/11/2025 17:40, Maxim Orlov wrote:
On Wed, 12 Nov 2025 at 16:00, Heikki Linnakangas <hlinnaka@iki.fi>
I kept all the new test cases for now. We need to decide which
ones are worth keeping, and polish and speed up the ones we decide
to keep.
Attached is a new patch version, with more work on the tests. The
pg_upgrade patch
(v26-0004-Add-pg_upgrade-for-64-bit-multixact-offsets.patch) now
includes a test case. I'm proposing to commit that test along with these
patches. It's a heavily-modified version of the test cases you wrote.
I tested that test using old installations, all the way down to version
9.4. That required a bunch of changes to the test perl modules, to make
them work with such old versions. Without any extra changes, the test
works down to v11.
Later patches in the patch set add more tests, labelled with the TEST:
prefix. Those are the tests you posted earlier, with little to no
modifications. I'm just carrying those around, so that I can easily run
them now during development. But I don't think they're adding much value
and I plan to leave them out of the final commit.
I think of two cases here.
A) Upgrade from "new cluster":
* created cluster with pre 32-bit overflow mxoff
* consume around of 2k of mxacts (1k before 32-bit overflow
and 1k after)
* run pg_upgrade
* check upgraded cluster is working
* check data invariant
B) Same as A), but for an "old cluster" with oldinstall env.
Makes sense.
The 007_multixact_conversion.pl test in the attached patches includes
two test scenarios: "basic" and "wraparound" test. In the basic
scenario there's no overflow or wraparound involved, but it can be run
without an old installation, i.e. in a "new -> new upgrade". The
"wraparound" scenario is the same, but the old cluster is reset with
pg_resetwal so that the mxoff wraps around. The "wraparound" requires a
pre-19 old installation, because the pg_resetwal logic requires pre-v19
layout.
If we enhance the reset_mxoff() perl function in the test so that it
also works with v19, we could run the "wraparound" scenario in new->new
upgrades too. That would essentially the case A) that you listed above.
I think it's already pretty good as it is though. I don't expect the
point where we cross offset 2^32 in the new version to be very
interesting now that we use 64-bit offsets everywhere.
- Heikki
Attachments:
Here's yet another patch version. I spent the day reviewing this in
detail and doing little cleanups here and there. I squashed the commits
and wrote a proper commit message.
One noteworthy refactoring is in pg_upgrade.c, to make it more clear (to
me at least) how upgrade from version 9.2 and below now works. It was
actually broken when I tested it. Not sure if I had broken it earlier or
if it never worked, but in any case it works now.
I also tested upgrading a cluster from an old minor version, < 9.3.5,
where the control file has a bogus oldestMultiXid==1 value (see commit
b6a3444fa6). As expected, you get a "could not open file" error:
Performing Upgrade
------------------
Setting locale and encoding for new cluster ok
...
Deleting files from new pg_multixact/members ok
Deleting files from new pg_multixact/offsets ok
Converting pg_multixact files
could not open file "/home/heikki/pgsql.93stable/data/pg_multixact/offsets/0000": No such file or directory
Failure, exiting
I don't think we need to support that case. I hope there are no clusters
in that state still in the wild, and you can work around it by upgrading
to 9.3.5 or above and letting autovacuum run. But I wonder if a
pre-upgrade check with a better error message would still be worthwhile.
Ashutosh, you were interested in reviewing this earlier. Would you have
a chance to review this now, before I commit it? Alexander, Alvaro,
would you have a chance to take a final look too, please?
- Heikki
Attachments:
Hi Heikki
I don't think we need to support that case. I hope there are no clusters
in that state still in the wild, and you can work around it by upgrading
to 9.3.5 or above and letting autovacuum run. But I wonder if a
pre-upgrade check with a better error message would still be worthwhile.
I think we believe it is now highly unlikely to find instances of version
9.3; all users are advised to upgrade to the latest version first.
Thanks
On Tue, Nov 18, 2025 at 12:35 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Show quoted text
Here's yet another patch version. I spent the day reviewing this in
detail and doing little cleanups here and there. I squashed the commits
and wrote a proper commit message.One noteworthy refactoring is in pg_upgrade.c, to make it more clear (to
me at least) how upgrade from version 9.2 and below now works. It was
actually broken when I tested it. Not sure if I had broken it earlier or
if it never worked, but in any case it works now.I also tested upgrading a cluster from an old minor version, < 9.3.5,
where the control file has a bogus oldestMultiXid==1 value (see commit
b6a3444fa6). As expected, you get a "could not open file" error:Performing Upgrade
------------------
Setting locale and encoding for new cluster ok
...
Deleting files from new pg_multixact/members ok
Deleting files from new pg_multixact/offsets ok
Converting pg_multixact files
could not open file"/home/heikki/pgsql.93stable/data/pg_multixact/offsets/0000": No such file
or directoryFailure, exiting
I don't think we need to support that case. I hope there are no clusters
in that state still in the wild, and you can work around it by upgrading
to 9.3.5 or above and letting autovacuum run. But I wonder if a
pre-upgrade check with a better error message would still be worthwhile.Ashutosh, you were interested in reviewing this earlier. Would you have
a chance to review this now, before I commit it? Alexander, Alvaro,
would you have a chance to take a final look too, please?- Heikki
One more small issue: The docs for pg_resetwal contain recipes for how
to determine safe values to use:
-m mxid,mxid
--multixact-ids=mxid,mxid
Manually set the next and oldest multitransaction ID.A safe value for the next multitransaction ID (first part) can be
determined by looking for the numerically largest file name in the
directory pg_multixact/offsets under the data directory, adding one,
and then multiplying by 65536 (0x10000). Conversely, a safe value
for the oldest multitransaction ID (second part of -m) can be
determined by looking for the numerically smallest file name in the
same directory and multiplying by 65536. The file names are in
hexadecimal, so the easiest way to do this is to specify the option
value in hexadecimal and append four zeroes.-O mxoff
--multixact-offset=mxoffManually set the next multitransaction offset.
A safe value can be determined by looking for the numerically
largest file name in the directory pg_multixact/members under the
data directory, adding one, and then multiplying by 52352 (0xCC80).
The file names are in hexadecimal. There is no simple recipe such as
the ones for other options of appending zeroes.
I think those recipes need to be adjusted for 64-bit offsets.
- Heikki
On Wed, 19 Nov 2025 at 19:20, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I think those recipes need to be adjusted for 64-bit offsets.
Yes, we need to do it.
Sorry if this is too obvious, but with 32-bit offsets, we get:
SLRU_PAGES_PER_SEGMENT * BLKSZ / sizeof(MXOff) =
32 * 8192 / 4 = 65,536 mxoff per segment.
Now, with 64-bits offsets, we should have half as much.
--
Best regards,
Maxim Orlov.
On Mon, Nov 17, 2025 at 10:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Ashutosh, you were interested in reviewing this earlier. Would you have
a chance to review this now, before I commit it?
0002 seems to be fine. It's just moving code from one file to another.
However, the name multixact_internals.h seems to be misusing term
"internal". I would expect an "internal.h" to expose things for use
within the module and not "externally" in other modules. But this
isn't the first time, buf_internal.h, toast_internal.h
bgworker_internal.h and so on have their own meaning of what
"internal" means to them. But it might be better to use something more
meaningful than "internal" in this case. The file seems to contain
declarations related to how pg_multixact SLRU is structured. To that
effect multixact_slru.h or multixact_layout.h may be more appropriate.
There are two offsets that that file deals with offset within
pg_multixact/offset, MultiXactOffset and member offset (flag offset
and xid offset) within pg_multixact/members. It's quite easy to get
confused between those when reading that code. For example, it's not
clear which offset MultiXactIdToOffset* functions are about. These
functions are calculating the page, entry (within the page) and
segment (of page) in pg_multixact/offset where to find the
MultiXactOffset of the first member of a given mxid. Thus returning
offset within offset. I feel they should have been named differently
when the code was written. But now that we are moving this code in a
separate file, we also have an opportunity to rename it better. I
think MXOffsetToMember* functions have better names. Using a similar
convention we could use MultiXactIdToOffsetOffset*, but that might
increase confusion. How about MultiXactIdToOffsetPos*? A separate .h
file also looks like a good place to document how offsets are laid out
and its contents and how members is laid out. The latter is somehow
documented in terms of macros and the static functions. The first is
not documented well, I feel. This refactoring seems to be a good
opportunity to do that. If we do so, I think, the .h there will be
some value in committing .h file as a separate commit.
The reason why this eliminates the need for wraparound is mentioned
somewhere in GetNewMultiXactId(), but probably it should be mentioned
at a more prominent place and also in the commit message. I expected
it to be in the prologue of GetNewMultiXactId(), and then a reference
to prologue from where the comment is right now.
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("MultiXact members would wrap around")));
If a server ever reaches this point, there is no way but to create
another cluster, if the applications requires multi-xact ids? We
should also provide this as an errhint().
if (nextOffset + nmembers < nextOffset)
:). I spent a few seconds trying to understand this. But I don't know
how to make it easy to understand.
In ExtendMultiXactMember() the last comment mentions a wraparound
/*
* Advance to next page, taking care to properly handle the wraparound
* case. OK if nmembers goes negative.
*/
I thought this wraparound is about offset wraparound, which can not
happen now. Since you have left the comment intact, it's something
else. Is the wraparound of offset within the page? Maybe requires a
bit more clarification?
PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset
newOldestOffset)
{
... snip ...
- segment += 1;
- }
+ SimpleLruTruncate(MultiXactMemberCtl,
+ MXOffsetToMemberPage(newOldestOffset));
}
Most of the callers of SimpleLruTruncate() call it directly. Why do we
want to keep this static wrapper? PerformOffsetsTruncation() has a
comment which seems to need the wrapper. But
PerformMembersTruncation() doesn't even have that.
MultiXactState->oldestMultiXactId = newOldestMulti;
MultiXactState->oldestMultiXactDB = newOldestMultiDB;
+ MultiXactState->oldestOffset = newOldestOffset;
LWLockRelease(MultiXactGenLock);
Is this something we are missing in the current code? I can not
understand the connection between this change and the fact that offset
wraparound is not possible with wider multi xact offsets. Maybe I
missed some previous discussion.
I have reviewed patch 0002 and multxact.c changes in 0003. So far I
have only these comments. I will review the pg_upgrade.c changes next.
--
Best Wishes,
Ashutosh Bapat
On 21/11/2025 14:15, Ashutosh Bapat wrote:
On Mon, Nov 17, 2025 at 10:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Ashutosh, you were interested in reviewing this earlier. Would you have
a chance to review this now, before I commit it?0002 seems to be fine. It's just moving code from one file to another.
However, the name multixact_internals.h seems to be misusing term
"internal". I would expect an "internal.h" to expose things for use
within the module and not "externally" in other modules. But this
isn't the first time, buf_internal.h, toast_internal.h
bgworker_internal.h and so on have their own meaning of what
"internal" means to them. But it might be better to use something more
meaningful than "internal" in this case. The file seems to contain
declarations related to how pg_multixact SLRU is structured. To that
effect multixact_slru.h or multixact_layout.h may be more appropriate.
Yeah, I went with multixact_internal.h because of the precedence. It's
not great, but IMHO it's better to be consistent than invent a new
naming scheme.
There are two offsets that that file deals with offset within
pg_multixact/offset, MultiXactOffset and member offset (flag offset
and xid offset) within pg_multixact/members. It's quite easy to get
confused between those when reading that code.
Agreed, those are confusing. I'll think about that a little more.
The reason why this eliminates the need for wraparound is mentioned
somewhere in GetNewMultiXactId(), but probably it should be mentioned
at a more prominent place and also in the commit message. I expected
it to be in the prologue of GetNewMultiXactId(), and then a reference
to prologue from where the comment is right now.ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("MultiXact members would wrap around")));If a server ever reaches this point, there is no way but to create
another cluster, if the applications requires multi-xact ids?
Pretty much. You can also vacuum freeze everything, so that no multixids
are in use, and then use pg_resetwal to reset nextOffset to a lower value.
That obviously sounds bad, but this is a "can't happen" situation. For
comparison, we don't provide any better way to recover from running out
of LSNs either.
We should also provide this as an errhint().
I think no. You cannot run into this "organically" by just running the
system. If you do manage to hit it, it's a sign of some other trouble,
and I don't want to guess what that might be, or what might be the
appropriate way to recover.
In ExtendMultiXactMember() the last comment mentions a wraparound
/*
* Advance to next page, taking care to properly handle the wraparound
* case. OK if nmembers goes negative.
*/
I thought this wraparound is about offset wraparound, which can not
happen now. Since you have left the comment intact, it's something
else. Is the wraparound of offset within the page? Maybe requires a
bit more clarification?
It was indeed about offset wraparound. I'll remove it.
PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldestOffset) { ... snip ... - segment += 1; - } + SimpleLruTruncate(MultiXactMemberCtl, + MXOffsetToMemberPage(newOldestOffset)); }Most of the callers of SimpleLruTruncate() call it directly. Why do we
want to keep this static wrapper? PerformOffsetsTruncation() has a
comment which seems to need the wrapper. But
PerformMembersTruncation() doesn't even have that.
Hmm, yeah those wrappers are a bit vestigial now. I'm inclined to keep
them, because as you said, PerformOffsetsTruncation() provides a place
for the comment. And given that, it seems best to keep
PerformMembersTruncation(), for symmetry.
MultiXactState->oldestMultiXactId = newOldestMulti;
MultiXactState->oldestMultiXactDB = newOldestMultiDB;
+ MultiXactState->oldestOffset = newOldestOffset;
LWLockRelease(MultiXactGenLock);Is this something we are missing in the current code? I can not
understand the connection between this change and the fact that offset
wraparound is not possible with wider multi xact offsets. Maybe I
missed some previous discussion.
Good question. At first I intended to extract that to a separate commit,
before the main patch, because it seems like a nice improvement: We have
just calculated 'oldestOffset', so why not update the value in shared
memory while we have it? But looking closer, I'm not sure if it'd be
sane without the 64-bit offsets. Currently, oldestOffset is only updated
by SetOffsetVacuumLimit(), which also updates offsetStopLimit. We could
get into a state where oldestOffset is set, but offsetStopLimit is not.
With 64-bit offsets that's no longer a concern because it removes
offsetStopLimit altogether.
I have reviewed patch 0002 and multxact.c changes in 0003. So far I
have only these comments. I will review the pg_upgrade.c changes next.
Thanks for the review so far!
- Heikki
Looking at the upgrade code, in light of the "IPC/MultixactCreation on
the Standby server" thread [1]/messages/by-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru, I think we need to make it more
tolerant. It's possible that there are 0 offsets in
pg_multixact/offsets. That might or might not be a problem: it's OK as
long as those multixids don't appear in any heap table, or you might
actually have lost those multixids, which is bad but the damage has
already been done and upgrade should not get stuck on it.
GetOldMultiXactIdSingleMember() currently asserts that the offset is
never zero, but it should try to do something sensible in that case
instead of just failing.
[1]: /messages/by-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru
/messages/by-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru
- Heikki
On Tue, 25 Nov 2025 at 13:07, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
GetOldMultiXactIdSingleMember() currently asserts that the offset is
never zero, but it should try to do something sensible in that case
instead of just failing.Correct me if I'm wrong, but we added the assertion that offsets are
never 0, based on the idea that case #2 will never take place during an
update. If this isn't the case, this assertion could be removed.
The rest of the function appears to work correctly.
I even think that, as an experiment, we could randomly reset some of the
offsets to zero and nothing would happen, except that some data would
be lost.
The most sensible thing we can do is give the user a warning, right?
Something like, "During the update, we encountered some weird offset
that shouldn't have been there, but there's nothing we can do about it,
just take note."
--
Best regards,
Maxim Orlov.
On 26/11/2025 17:23, Maxim Orlov wrote:
On Tue, 25 Nov 2025 at 13:07, Heikki Linnakangas <hlinnaka@iki.fi
<mailto:hlinnaka@iki.fi>> wrote:GetOldMultiXactIdSingleMember() currently asserts that the offset is
never zero, but it should try to do something sensible in that case
instead of just failing.Correct me if I'm wrong, but we added the assertion that offsets are
never 0, based on the idea that case #2 will never take place during an
update. If this isn't the case, this assertion could be removed.
The rest of the function appears to work correctly.I even think that, as an experiment, we could randomly reset some of the
offsets to zero and nothing would happen, except that some data would
be lost.
+1
The most sensible thing we can do is give the user a warning, right?
Something like, "During the update, we encountered some weird offset
that shouldn't have been there, but there's nothing we can do about it,
just take note."
Yep, makes sense.
- Heikki
On Fri, Nov 21, 2025 at 7:26 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I have reviewed patch 0002 and multxact.c changes in 0003. So far I
have only these comments. I will review the pg_upgrade.c changes next.
007_multixact_conversion.pl fires thousands of queries through
BackgroundPsql which prints debug output for each of the queries. When
running this file with oldinstall set,
2.2M regress_log_007_multixact_conversion (size of file)
77874 regress_log_007_multixact_conversion (wc -l output)
Since this output is also copied in testlog.txt, the effect is two-fold.
Most, if not all, of this output is useless. It also makes it hard to
find the output we are looking for. PFA patch which reduces this
output. The patch adds a flag verbose to query_safe() and query() to
toggle this output. With the patch the sizes are
27K regress_log_007_multixact_conversion
588 regress_log_007_multixact_conversion
And it makes the test faster by about a second or two on my laptop.
Something on those lines or other is required to reduce the output
from query_safe().
Some more comments
+++ b/src/bin/pg_upgrade/multixact_old.c
We may need to introduce new _new and then _old will become _older.
Should we rename the files to have pre19 and post19 or some similar
suffixes which make it clear what is meant by old and new?
+
+static inline int64
+MultiXactIdToOffsetPage(MultiXactId multi)
The prologue mentions that the definitions are copy-pasted from
multixact.c from version 18, but they share the names with functions
in the current version. I think that's going to be a good source of
confusion especially in a file which is a few hundred lines long. Can
we rename them to have "Old" prefix or something similar?
+
+# Dump contents of the 'mxofftest' table, created by mxact_workload
+sub get_dump_for_comparison
This local function shares its name with a local function in
002_pg_upgrade.pl. Better to use a separate name. Also it's not
"dumping" data using "pg_dump", so "dump" in the name can be
misleading.
+ $newnode->start;
+ my $new_dump = get_dump_for_comparison($newnode, "newnode_${tag}_dump");
+ $newnode->stop;
There is no code which actually looks at the multixact offsets here to
make sure that the conversion happened correctly. I guess the test
relies on visibility checks for that. Anyway, we need a comment
explaining why just comparing the contents of the table is enough to
ensure correct conversion. Better if we can add an explicit test that
the offsets were converted correctly. I don't have any idea of how to
do that right now, though. Maybe use pg_get_multixact_members()
somehow in the query to extract data out of the table?
+
+ compare_files($old_dump, $new_dump,
+ 'dump outputs from original and restored regression databases match');
A shared test name too :); but there is not regression database here.
+
+ note ">>> case #${tag}\n"
+ . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n"
+ . " newnode mxoff ${new_next_mxoff}\n";
Should we check that some condition holds between finish_mxoff and
new_next_mxoff?
I will continue reviewing it further.
--
Best Wishes,
Ashutosh Bapat
Attachments:
On Fri, Nov 28, 2025 at 6:35 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
On Fri, Nov 21, 2025 at 7:26 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I have reviewed patch 0002 and multxact.c changes in 0003. So far I
have only these comments. I will review the pg_upgrade.c changes next.007_multixact_conversion.pl fires thousands of queries through
BackgroundPsql which prints debug output for each of the queries. When
running this file with oldinstall set,
2.2M regress_log_007_multixact_conversion (size of file)
77874 regress_log_007_multixact_conversion (wc -l output)Since this output is also copied in testlog.txt, the effect is two-fold.
Most, if not all, of this output is useless. It also makes it hard to
find the output we are looking for. PFA patch which reduces this
output. The patch adds a flag verbose to query_safe() and query() to
toggle this output. With the patch the sizes are
27K regress_log_007_multixact_conversion
588 regress_log_007_multixact_conversionAnd it makes the test faster by about a second or two on my laptop.
Something on those lines or other is required to reduce the output
from query_safe().Some more comments +++ b/src/bin/pg_upgrade/multixact_old.cWe may need to introduce new _new and then _old will become _older.
Should we rename the files to have pre19 and post19 or some similar
suffixes which make it clear what is meant by old and new?+ +static inline int64 +MultiXactIdToOffsetPage(MultiXactId multi)The prologue mentions that the definitions are copy-pasted from
multixact.c from version 18, but they share the names with functions
in the current version. I think that's going to be a good source of
confusion especially in a file which is a few hundred lines long. Can
we rename them to have "Old" prefix or something similar?+ +# Dump contents of the 'mxofftest' table, created by mxact_workload +sub get_dump_for_comparisonThis local function shares its name with a local function in
002_pg_upgrade.pl. Better to use a separate name. Also it's not
"dumping" data using "pg_dump", so "dump" in the name can be
misleading.+ $newnode->start; + my $new_dump = get_dump_for_comparison($newnode, "newnode_${tag}_dump"); + $newnode->stop;There is no code which actually looks at the multixact offsets here to
make sure that the conversion happened correctly. I guess the test
relies on visibility checks for that. Anyway, we need a comment
explaining why just comparing the contents of the table is enough to
ensure correct conversion. Better if we can add an explicit test that
the offsets were converted correctly. I don't have any idea of how to
do that right now, though. Maybe use pg_get_multixact_members()
somehow in the query to extract data out of the table?+ + compare_files($old_dump, $new_dump, + 'dump outputs from original and restored regression databases match');A shared test name too :); but there is not regression database here.
+ + note ">>> case #${tag}\n" + . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" + . " newnode mxoff ${new_next_mxoff}\n";Should we check that some condition holds between finish_mxoff and
new_next_mxoff?I will continue reviewing it further.
One more thing,
An UPDATE waits for FOR SHARE query to finish, and vice versa. In my
experiments I didn't see an UPDATE creating a multi-xact. Why do we
have UPDATEs in the load created by the test? Am I missing something?
--
Best Wishes,
Ashutosh Bapat
On Fri, 28 Nov 2025 at 16:17, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
wrote:
One more thing,
An UPDATE waits for FOR SHARE query to finish, and vice versa. In my
experiments I didn't see an UPDATE creating a multi-xact. Why do we
have UPDATEs in the load created by the test? Am I missing something?
As far as I remember, this was done on purpose to create different
multixact members statuses randomly.
--
Best regards,
Maxim Orlov.
On Mon, Dec 1, 2025 at 2:23 PM Maxim Orlov <orlovmg@gmail.com> wrote:
On Fri, 28 Nov 2025 at 16:17, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
One more thing,
An UPDATE waits for FOR SHARE query to finish, and vice versa. In my
experiments I didn't see an UPDATE creating a multi-xact. Why do we
have UPDATEs in the load created by the test? Am I missing something?As far as I remember, this was done on purpose to create different
multixact members statuses randomly.
In that case, better to include that in the comments.
--
Best Wishes,
Ashutosh Bapat
Hi, Heikki!
On Tue, Nov 25, 2025 at 12:07 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Looking at the upgrade code, in light of the "IPC/MultixactCreation on
the Standby server" thread [1], I think we need to make it more
tolerant. It's possible that there are 0 offsets in
pg_multixact/offsets. That might or might not be a problem: it's OK as
long as those multixids don't appear in any heap table, or you might
actually have lost those multixids, which is bad but the damage has
already been done and upgrade should not get stuck on it.
GetOldMultiXactIdSingleMember() currently asserts that the offset is
never zero, but it should try to do something sensible in that case
instead of just failing.
Thank you for your work on this subject. It's very much appreciated.
I'd like to raise the question about compression again. You have
fairly criticized non-deterministic compression, but what do you think
about deterministic one that I've proposed [1]. I understand that
multixact offsets are subject of growth and their limit is not
removed. However, it's still several extra gigabytes for multixact
offsets, which we could save.
Links.
1. /messages/by-id/CAPpHfduDFLXATvBkUiOjyvZUBZXhK_pj5zjVpxvrJzkRVq+8Lw@mail.gmail.com
------
Regards,
Alexander Korotkov
Supabase
On 02/12/2025 16:11, Alexander Korotkov wrote:
I'd like to raise the question about compression again. You have
fairly criticized non-deterministic compression, but what do you think
about deterministic one that I've proposed [1]. I understand that
multixact offsets are subject of growth and their limit is not
removed. However, it's still several extra gigabytes for multixact
offsets, which we could save.
It felt overly complicated to my taste. And decoding/encoding the whole
chunk on every access seems expensive. Maybe it's cheap enough that it
doesn't matter in practice, but some performance testing would at least
be in order. But I'd love to find a simpler scheme to begin with.
Storing one "base" offset per page, as Maxim did in [1]/messages/by-id/CACG=ezbPUASDL1eJ+c-ZkJMwRPukvp3EL0q1vSUa1h+fnX8y3g@mail.gmail.com, feels about
right to me. Except for the non-deterministic nature of how it gets set
in that patch, and what I referred to as a "frighteningly clever
encoding scheme".
Perhaps we could set the base offset in ExtendMultiXactOffset() already?
[1]: /messages/by-id/CACG=ezbPUASDL1eJ+c-ZkJMwRPukvp3EL0q1vSUa1h+fnX8y3g@mail.gmail.com
/messages/by-id/CACG=ezbPUASDL1eJ+c-ZkJMwRPukvp3EL0q1vSUa1h+fnX8y3g@mail.gmail.com
- Heikki
The biggest problem with compression, in my opinion, is that losing
even one byte causes the loss of the entire compressed block in the
worst case scenario. After all, we still don't have checksums for the
SLRU's, which is a shame by itself.
Again, I'm not against the idea of compression, but the risks need to
be considered.
As a software developer, I definitely want to implement compression and
save a few gigabytes. However, given my previous experience using
Postgres in real-world applications, reliability at the cost of several
gigabytes would not have caused me any trouble. Just saying.
--
Best regards,
Maxim Orlov.
On 03/12/2025 11:54, Maxim Orlov wrote:
The biggest problem with compression, in my opinion, is that losing
even one byte causes the loss of the entire compressed block in the
worst case scenario. After all, we still don't have checksums for the
SLRU's, which is a shame by itself.Again, I'm not against the idea of compression, but the risks need to
be considered.
There are plenty of such critical bytes in the system where a single bit
flip renders the whole block unreadable. Actually, if we had checksums
on SLRU pages, a single bit flip anywhere in the page would make the
checksum fail and render the block unreadable.
If things go really bad and you need to open a hex editor and try to fix
the data manually, it shouldn't be too hard to deduce the correct base
offset from surrounding data.
As a software developer, I definitely want to implement compression and
save a few gigabytes. However, given my previous experience using
Postgres in real-world applications, reliability at the cost of several
gigabytes would not have caused me any trouble. Just saying.
+1. If we decide to do some kind of compression here, I want it to be
very simple. Otherwise it's just not worth the code complexity and risk.
Let's do the math of how much disk space we'd save. Let's assume the
worst case that every multixid consists of only one transaction ID.
Currently, every such multixid takes up 4 bytes in the offsets SLRU, and
5 bytes in the members SLRU (one flag byte and 4 bytes for the XID). So
that's 9 bytes. With 64-bit offsets, it becomes 13 bytes. With the
compression, we're back to 9 bytes again (ignoring the one base offset
per page). So in an extreme case that you have 1 billion multixids, with
only one XID per multixid, the difference is between 9 GB and 13 GB.
That seems acceptable.
And having just one XID per multixid is a rare corner case. Much more
commonly, you have at at least two XIDs. With two XIDs per multixid, the
difference is between 14 bytes and 18 bytes.
And having a billion multixids is pretty extreme. Your database is
likely very large too if you reach that point, and a few gigabytes won't
matter.
One could argue that the memory needed for the SLRU cache matters more
than the disk space. That's perhaps true, but I think this is totally
acceptable from that point of view, too.
- Heikki
On 01/12/2025 14:35, Ashutosh Bapat wrote:
On Mon, Dec 1, 2025 at 2:23â¯PM Maxim Orlov <orlovmg@gmail.com> wrote:
On Fri, 28 Nov 2025 at 16:17, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
An UPDATE waits for FOR SHARE query to finish, and vice versa. In my
experiments I didn't see an UPDATE creating a multi-xact. Why do we
have UPDATEs in the load created by the test? Am I missing something?As far as I remember, this was done on purpose to create different
multixact members statuses randomly.In that case, better to include that in the comments.
I think that was indeed the purpose, but the test should use FOR KEY
SHARE rather than FOR SHARE. Otherwise the UPDATEs don't generate multixids.
- Heikki
Hi
As a software developer, I definitely want to > implement compression and
save a few gigabytes. However, given my previous experience using
Postgres in real-world applications, reliability at the cost of several
gigabytes would not have caused me any trouble. Just saying.
Agree +1, If this had been done twenty years ago, the cost might have been
unacceptable. But with today’s hardware—especially disk random and
sequential I/O performance improving by hundreds of thousands of times, and
memory capacity increasing by several hundred times—it’s almost
unimaginable that we now have single 256-GB DIMMs. So this kind of overhead
is negligible for modern hardware.
Thanks
On Wed, 3 Dec 2025 at 17:54, Maxim Orlov <orlovmg@gmail.com> wrote:
Show quoted text
The biggest problem with compression, in my opinion, is that losing
even one byte causes the loss of the entire compressed block in the
worst case scenario. After all, we still don't have checksums for the
SLRU's, which is a shame by itself.Again, I'm not against the idea of compression, but the risks need to
be considered.As a software developer, I definitely want to implement compression and
save a few gigabytes. However, given my previous experience using
Postgres in real-world applications, reliability at the cost of several
gigabytes would not have caused me any trouble. Just saying.--
Best regards,
Maxim Orlov.
On Wed, 3 Dec 2025 at 15:04, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
There are plenty of such critical bytes in the system where a single bit
flip renders the whole block unreadable. Actually, if we had checksums
on SLRU pages, a single bit flip anywhere in the page would make the
checksum fail and render the block unreadable.Correct. However, my concern about the lack of checksums for SLRU wasn't
about data loss and the impossibility of recovering it, but about the
impossibility of detecting the error. But it is how it is for now.
--
Best regards,
Maxim Orlov.
On Wed, Dec 3, 2025 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 03/12/2025 11:54, Maxim Orlov wrote:
The biggest problem with compression, in my opinion, is that losing
even one byte causes the loss of the entire compressed block in the
worst case scenario. After all, we still don't have checksums for the
SLRU's, which is a shame by itself.Again, I'm not against the idea of compression, but the risks need to
be considered.There are plenty of such critical bytes in the system where a single bit
flip renders the whole block unreadable. Actually, if we had checksums
on SLRU pages, a single bit flip anywhere in the page would make the
checksum fail and render the block unreadable.If things go really bad and you need to open a hex editor and try to fix
the data manually, it shouldn't be too hard to deduce the correct base
offset from surrounding data.As a software developer, I definitely want to implement compression and
save a few gigabytes. However, given my previous experience using
Postgres in real-world applications, reliability at the cost of several
gigabytes would not have caused me any trouble. Just saying.+1. If we decide to do some kind of compression here, I want it to be
very simple. Otherwise it's just not worth the code complexity and risk.Let's do the math of how much disk space we'd save. Let's assume the
worst case that every multixid consists of only one transaction ID.
Currently, every such multixid takes up 4 bytes in the offsets SLRU, and
5 bytes in the members SLRU (one flag byte and 4 bytes for the XID). So
that's 9 bytes. With 64-bit offsets, it becomes 13 bytes. With the
compression, we're back to 9 bytes again (ignoring the one base offset
per page). So in an extreme case that you have 1 billion multixids, with
only one XID per multixid, the difference is between 9 GB and 13 GB.
That seems acceptable.And having just one XID per multixid is a rare corner case. Much more
commonly, you have at at least two XIDs. With two XIDs per multixid, the
difference is between 14 bytes and 18 bytes.And having a billion multixids is pretty extreme. Your database is
likely very large too if you reach that point, and a few gigabytes won't
matter.
I am in favour of keeping things simpler than using a complex compression.
One could argue that the memory needed for the SLRU cache matters more
than the disk space. That's perhaps true, but I think this is totally
acceptable from that point of view, too.
This brings an interesting point. Since the offsets are twice large,
SLRU will contain half the entries than earlier. Have we measured
performance impact of this? Do we need to provide some guidance about
increasing the SLRU size or increase the default SLRU size?
--
Best Wishes,
Ashutosh Bapat
On 26/11/2025 17:50, Heikki Linnakangas wrote:
On 26/11/2025 17:23, Maxim Orlov wrote:
On Tue, 25 Nov 2025 at 13:07, Heikki Linnakangas <hlinnaka@iki.fi
<mailto:hlinnaka@iki.fi>> wrote:GetOldMultiXactIdSingleMember() currently asserts that the offset is
never zero, but it should try to do something sensible in that case
instead of just failing.Correct me if I'm wrong, but we added the assertion that offsets are
never 0, based on the idea that case #2 will never take place during an
update. If this isn't the case, this assertion could be removed.
The rest of the function appears to work correctly.I even think that, as an experiment, we could randomly reset some of the
offsets to zero and nothing would happen, except that some data would
be lost.+1
The most sensible thing we can do is give the user a warning, right?
Something like, "During the update, we encountered some weird offset
that shouldn't have been there, but there's nothing we can do about it,
just take note."Yep, makes sense.
I read through the SLRU reading codepath, looking for all the things
that could go wrong (not sure I got them all):
1. An SLRU file does not exist
2. An SLRU file is too short, i.e. a page does not exist
3. The offset in 'offsets' page is 0
4. The offset in 'offsets' page looks invalid, i.e. it's greater than
nextOffset or smaller than oldestOffset.
5. The offset is out of order compared to its neighbors
6. The multixid has no members
7. The multixid has an invalid (0) member
8. A multixid has more than one updating member
Some of those situations are theoretically are possible if there was a
crash. We don't follow the WAL-before-data rule for these SLRUs.
Instead, we piggyback on the WAL-before-data of the heap page that would
reference the multixid. In other words, we rely on the fact that if a
multixid write is missed or torn because of a crash, that multixid will
not be referenced from anywhwere and will never be read.
However, that doesn't hold for pg_upgrade. pg_upgrade will try to read
all the multixids. So we need to make the multixact reading code
tolerant of the situations that could be present after a crash. I think
the right philosophy here is that we try to read all the old multixids,
and do our best to interpret them the same way that the old server
would. For those situations that can legitimately be present if the old
server crashed at some point, be silent. For cases that should not
happen, even if there was a crash, print a warning. For example, I think
an SLRU file should never be missing (1) or truncated (2). But the zero
offset (3), and (6) can happen.
Perhaps we should check that all the files exist and have the correct
sizes in the pre-check stage, and abort the upgrade early if anything is
missing. That would be pretty cheap to check.
- Heikki
On Thu, 4 Dec 2025 at 13:39, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
However, that doesn't hold for pg_upgrade. pg_upgrade will try to read
all the multixids. So we need to make the multixact reading code
tolerant of the situations that could be present after a crash. I think
the right philosophy here is that we try to read all the old multixids,
and do our best to interpret them the same way that the old server
would.
Something like attached?
Now previous scheme of upgrade with the bytes joggling start
looking not so bad. Just a funny thought that came to my mind.
Perhaps we should check that all the files exist and have the correct
sizes in the pre-check stage
Not sure about it. Because SLRU does not support "holes", simply
checking if the first and last multixacts exist will be enough. But
we'll do it anyway in a real conversion.
PFA to start a conversation.
--
Best regards,
Maxim Orlov.
Attachments:
On Fri, Nov 28, 2025 at 6:35 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
I will continue reviewing it further.
There is duplication of code/functionality between server and
pg_upgrade. With it we carry all the risks that come with
code/functionality duplication like the copies going out of sync.
There may be a valid reason to do that but it's not documented in the
comments. At-least both mutlixact_new.c and slru_io.c are not as well
commented as their server counterparts. I understand that the SLRU
code in the server deals with shared memory which is not needed in
pg_upgrade; pg_upgrade will not need more than one buffer in memory
and pg_upgrade code doesn't need to deal with lock and it can not deal
with locks. That means the code required by pg_upgrade is much simpler
than that on the server. But there's also non-trivial code which is
required in both the cases. WIll it be possible to extract parts of
slru.c which deal with IO into slru_io.c, make it part of the core and
then use it in pg_upgrade as well as slru.c? Or whether it's possible
to make SLRU use local memory? And throwing some FRONTEND magic to the
mix, we may be able to avoid duplication. Have we tried this or
something else to avoid duplication? Sorry, if this has been discussed
earlier. Please point me to the relevant discussion if so.
--
Best Wishes,
Ashutosh Bapat
After a little detour to the "IPC/MultixactCreation on the Standby
server" issue [1]/messages/by-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru, I'm back to working on this. New patch version
attached, addressing your comments and Maxim's.
On 05/12/2025 15:42, Ashutosh Bapat wrote:
There is duplication of code/functionality between server and
pg_upgrade. With it we carry all the risks that come with
code/functionality duplication like the copies going out of sync.
There may be a valid reason to do that but it's not documented in the
comments. At-least both mutlixact_new.c and slru_io.c are not as well
commented as their server counterparts. I understand that the SLRU
code in the server deals with shared memory which is not needed in
pg_upgrade; pg_upgrade will not need more than one buffer in memory
and pg_upgrade code doesn't need to deal with lock and it can not deal
with locks. That means the code required by pg_upgrade is much simpler
than that on the server. But there's also non-trivial code which is
required in both the cases. WIll it be possible to extract parts of
slru.c which deal with IO into slru_io.c, make it part of the core and
then use it in pg_upgrade as well as slru.c? Or whether it's possible
to make SLRU use local memory? And throwing some FRONTEND magic to the
mix, we may be able to avoid duplication. Have we tried this or
something else to avoid duplication? Sorry, if this has been discussed
earlier. Please point me to the relevant discussion if so.
That's a fair point, but I think it's better to have some code
duplication in this case, than trying to write code that works for both
the server and for pg_upgrade. The needs are so different.
007_multixact_conversion.pl fires thousands of queries through
BackgroundPsql which prints debug output for each of the queries. When
running this file with oldinstall set,
2.2M regress_log_007_multixact_conversion (size of file)
77874 regress_log_007_multixact_conversion (wc -l output)Since this output is also copied in testlog.txt, the effect is two-fold.
Most, if not all, of this output is useless. It also makes it hard to
find the output we are looking for. PFA patch which reduces this
output. The patch adds a flag verbose to query_safe() and query() to
toggle this output. With the patch the sizes are
27K regress_log_007_multixact_conversion
588 regress_log_007_multixact_conversionAnd it makes the test faster by about a second or two on my laptop.
Something on those lines or other is required to reduce the output
from query_safe().
Nice! That log bloat was the reason I bundled together the "COMMIT;
BEGIN; SELECT ...;" steps into one statement in the loop. Your solution
addresses it more directly.
I turned 'verbose' into a keyword parameter, for future extensibility of
those functions, so you now call it like "$node->query_safe("SELECT 1",
verbose => 0);". I also set "log_statements=none" in those connections,
to reduce the noise in the server log too.
Some more comments +++ b/src/bin/pg_upgrade/multixact_old.cWe may need to introduce new _new and then _old will become _older.
Should we rename the files to have pre19 and post19 or some similar
suffixes which make it clear what is meant by old and new?
+1. I renamed multixact_old.c to multixact_pre_v19.c. And
multixact_new.c to multixact_rewrite.c. I also moved the
"convert_multixact" function that drives the conversion to
multixact_rewrite.c. The idea is that if in the future we change the
format again, we will have:
multixact_pre_v19.c # for reading -v19 files
multixact_pre_v24.c # for reading v19-v23 files
multixact_rewrite.c # for writing new files
Hard to predict what a possible future format might look like and how
we'd want to organize the code then, though. This can be changed then if
needed, but it makes sense now.
+static inline int64 +MultiXactIdToOffsetPage(MultiXactId multi)The prologue mentions that the definitions are copy-pasted from
multixact.c from version 18, but they share the names with functions
in the current version. I think that's going to be a good source of
confusion especially in a file which is a few hundred lines long. Can
we rename them to have "Old" prefix or something similar?
Fair. On the other hand, having the same names makes it easier to see
what the real differences with the server functions are. Not sure what's
best here..
As long as we use the same names, it's important that
multixact_pre_v19.c doesn't #include the new definitions. I added some
comments on that, and also this safeguard:
#define MultiXactOffset should_not_be_used
That actually caught one (harmless) instance in the file where we had
not renamed MultiXactOffset to OldMultiXactOffset.
I'm not entirely happy with the "Old" prefix here, because as you
pointed out, we might end up needing "older" or "oldold" in the future.
I couldn't come up with anything better though. "PreV19MultiXactOffset"
is quite a mouthful.
+# Dump contents of the 'mxofftest' table, created by mxact_workload +sub get_dump_for_comparisonThis local function shares its name with a local function in
002_pg_upgrade.pl. Better to use a separate name. Also it's not
"dumping" data using "pg_dump", so "dump" in the name can be
misleading.
Renamed to "get_test_table_contents"
+ $newnode->start; + my $new_dump = get_dump_for_comparison($newnode, "newnode_${tag}_dump"); + $newnode->stop;There is no code which actually looks at the multixact offsets here to
make sure that the conversion happened correctly. I guess the test
relies on visibility checks for that. Anyway, we need a comment
explaining why just comparing the contents of the table is enough to
ensure correct conversion. Better if we can add an explicit test that
the offsets were converted correctly. I don't have any idea of how to
do that right now, though. Maybe use pg_get_multixact_members()
somehow in the query to extract data out of the table?
Agreed, the verification here is quite weak. I didn't realize that
pg_get_multixact_members() exists! That might indeed be handy here, but
I'm not sure how exactly to construct the test. A direct C function like
test_create_multixact() in test_multixact.c would be handy here, but
we'd need to compile and do run that in the old cluster, which seems
difficult.
+ compare_files($old_dump, $new_dump, + 'dump outputs from original and restored regression databases match');A shared test name too :); but there is not regression database here.
Fixed :-)
+ + note ">>> case #${tag}\n" + . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" + . " newnode mxoff ${new_next_mxoff}\n";Should we check that some condition holds between finish_mxoff and
new_next_mxoff?
Got something in mind that we could check?
On 04/12/2025 17:33, Maxim Orlov wrote:
On Thu, 4 Dec 2025 at 13:39, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
However, that doesn't hold for pg_upgrade. pg_upgrade will try to read
all the multixids. So we need to make the multixact reading code
tolerant of the situations that could be present after a crash. I think
the right philosophy here is that we try to read all the old multixids,
and do our best to interpret them the same way that the old server
would.Something like attached?
+1
Now previous scheme of upgrade with the bytes joggling start
looking not so bad. Just a funny thought that came to my mind.
:-)
Perhaps we should check that all the files exist and have the correct
sizes in the pre-check stageNot sure about it. Because SLRU does not support "holes", simply
checking if the first and last multixacts exist will be enough. But
we'll do it anyway in a real conversion.
Yeah, the point would be to complain if there are holes when there
shouldn't be. As a sanity check.
There's a reason to not do that though: if you use pg_resetwal to skip
over some multixids, you end up with holes. We shouldn't encourage
people to use pg_resetwal, but it seems reasonable to tolerate it if you
have done it.
I worked some more on this. One notable change is that in light of the
"IPC/MultixactCreation on the Standby server" changes [1]/messages/by-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru, we need to
always write the next multixid's offset, even if the next multixid
itself is invalid. Because otherwise the previous multixid is unreadable
too.
The SlruReadSwitchPageSlow() function didn't handle short reads
properly. As a result, if an SLRU file was shorter than expected, the
buffer kept its old contents when switching to read the missing page.
Fixed that so that the missing part is read as all-zeros instead.
I removed the warnings from some of the invalid multixid cases, like if
the offset or some of the members are zeros. Those cases can
legitimately happen after crash and restart, so we shouldn't complain
about them. If a multixid has more than one updating member, I kept that
as a fatal error. That really should not happen.
To summarize, the behavior now is that if an old SLRU file does not
exist, you get an error. If an SLRU file is too short, you get warnings
and the missing pages are read as all-zeros, i.e. all the multixids on
the missing pages are considered invalid. If an individual multixid is
invalid, because the offset is zero, it's silently written as invalid in
the new file too.
I'm still not 100% sure what the desired behavior for missing files is.
For now, I didn't include the pre-checks for the first and last files in
this version. You can end up with missing files if you skip over many
multixids with pg_resetwal. Or it could be a sign of lost data. If it's
lost data, would you prefer for the upgrade to fail, or continue
upgrading the data that you have? The conversion shouldn't make things
worse, if the data was already lost, but then again, if something really
bad has happened, all bets are off and perhaps it would be best to abort
and complain loudly.
[1]: /messages/by-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru
/messages/by-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru
- Heikki
Attachments:
On 06/12/2025 01:36, Heikki Linnakangas wrote:
On 05/12/2025 15:42, Ashutosh Bapat wrote:
+ $newnode->start; + my $new_dump = get_dump_for_comparison($newnode, "newnode_${tag} _dump"); + $newnode->stop;There is no code which actually looks at the multixact offsets here to
make sure that the conversion happened correctly. I guess the test
relies on visibility checks for that. Anyway, we need a comment
explaining why just comparing the contents of the table is enough to
ensure correct conversion. Better if we can add an explicit test that
the offsets were converted correctly. I don't have any idea of how to
do that right now, though. Maybe use pg_get_multixact_members()
somehow in the query to extract data out of the table?Agreed, the verification here is quite weak. I didn't realize that
pg_get_multixact_members() exists! That might indeed be handy here, but
I'm not sure how exactly to construct the test. A direct C function like
test_create_multixact() in test_multixact.c would be handy here, but
we'd need to compile and do run that in the old cluster, which seems
difficult.
I added verification of all the multixids between oldest and next
multixid, using pg_get_multixact_members(). The test now calls
pg_get_multixact_members() for all updating multixids in the range,
before and after the upgrade, and compares the results.
The verification ignores locking-only multixids. Verifying their
correctness would need a little more code because they're not fully
preserved by the upgrade.
I also expanded the test to cover multixid wraparound. It only covered
mxoffset wraparound previously.
New patch set attached. Only test changes compared to patch set v28.
- Heikki
Attachments:
On Sat, Dec 6, 2025 at 5:06 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 05/12/2025 15:42, Ashutosh Bapat wrote:
007_multixact_conversion.pl fires thousands of queries through
BackgroundPsql which prints debug output for each of the queries. When
running this file with oldinstall set,
2.2M regress_log_007_multixact_conversion (size of file)
77874 regress_log_007_multixact_conversion (wc -l output)Since this output is also copied in testlog.txt, the effect is two-fold.
Most, if not all, of this output is useless. It also makes it hard to
find the output we are looking for. PFA patch which reduces this
output. The patch adds a flag verbose to query_safe() and query() to
toggle this output. With the patch the sizes are
27K regress_log_007_multixact_conversion
588 regress_log_007_multixact_conversionAnd it makes the test faster by about a second or two on my laptop.
Something on those lines or other is required to reduce the output
from query_safe().Nice! That log bloat was the reason I bundled together the "COMMIT;
BEGIN; SELECT ...;" steps into one statement in the loop. Your solution
addresses it more directly.
Now we can call query_safe() separately on each of those. That will be
more readable and marginally less code.
I turned 'verbose' into a keyword parameter, for future extensibility of
those functions, so you now call it like "$node->query_safe("SELECT 1",
verbose => 0);". I also set "log_statements=none" in those connections,
to reduce the noise in the server log too.
keyword parameter is better. also +1 for log_statements.
Some more comments +++ b/src/bin/pg_upgrade/multixact_old.cWe may need to introduce new _new and then _old will become _older.
Should we rename the files to have pre19 and post19 or some similar
suffixes which make it clear what is meant by old and new?+1. I renamed multixact_old.c to multixact_pre_v19.c. And
multixact_new.c to multixact_rewrite.c. I also moved the
"convert_multixact" function that drives the conversion to
multixact_rewrite.c. The idea is that if in the future we change the
format again, we will have:multixact_pre_v19.c # for reading -v19 files
multixact_pre_v24.c # for reading v19-v23 files
multixact_rewrite.c # for writing new filesHard to predict what a possible future format might look like and how
we'd want to organize the code then, though. This can be changed then if
needed, but it makes sense now.
+1.
+static inline int64 +MultiXactIdToOffsetPage(MultiXactId multi)The prologue mentions that the definitions are copy-pasted from
multixact.c from version 18, but they share the names with functions
in the current version. I think that's going to be a good source of
confusion especially in a file which is a few hundred lines long. Can
we rename them to have "Old" prefix or something similar?Fair. On the other hand, having the same names makes it easier to see
what the real differences with the server functions are. Not sure what's
best here..As long as we use the same names, it's important that
multixact_pre_v19.c doesn't #include the new definitions. I added some
comments on that, and also this safeguard:#define MultiXactOffset should_not_be_used
That actually caught one (harmless) instance in the file where we had
not renamed MultiXactOffset to OldMultiXactOffset.
That looks useful, and has proved to be useful already.
I'm not entirely happy with the "Old" prefix here, because as you
pointed out, we might end up needing "older" or "oldold" in the future.
I couldn't come up with anything better though. "PreV19MultiXactOffset"
is quite a mouthful.
How about MultiXactOffset32?
Thanks for addressing rest of the comments.
+ + note ">>> case #${tag}\n" + . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n" + . " newnode mxoff ${new_next_mxoff}\n";Should we check that some condition holds between finish_mxoff and
new_next_mxoff?Got something in mind that we could check?
I have always seen that finish_mxoff is very high compared to newnode
mxoff - given that we write only one member per mxid, is newnode mxoff
going to be always something like 4K or so? Then we can check that
value. But I will experiment more to see if I can come up with
something, if possible.
--
Best Wishes,
Ashutosh Bapat
On Mon, Dec 8, 2025 at 6:32 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 06/12/2025 01:36, Heikki Linnakangas wrote:
On 05/12/2025 15:42, Ashutosh Bapat wrote:
+ $newnode->start; + my $new_dump = get_dump_for_comparison($newnode, "newnode_${tag} _dump"); + $newnode->stop;There is no code which actually looks at the multixact offsets here to
make sure that the conversion happened correctly. I guess the test
relies on visibility checks for that. Anyway, we need a comment
explaining why just comparing the contents of the table is enough to
ensure correct conversion. Better if we can add an explicit test that
the offsets were converted correctly. I don't have any idea of how to
do that right now, though. Maybe use pg_get_multixact_members()
somehow in the query to extract data out of the table?Agreed, the verification here is quite weak. I didn't realize that
pg_get_multixact_members() exists! That might indeed be handy here, but
I'm not sure how exactly to construct the test. A direct C function like
test_create_multixact() in test_multixact.c would be handy here, but
we'd need to compile and do run that in the old cluster, which seems
difficult.I added verification of all the multixids between oldest and next
multixid, using pg_get_multixact_members(). The test now calls
pg_get_multixact_members() for all updating multixids in the range,
before and after the upgrade, and compares the results.
I thought about adding pg_get_multixact_member in
get_test_table_contents() itself like SELECT ctid, xmin, xmax,
get_multixact_member(xmin), get_multixact_member(xmax) * FROM
mxofftest; but then I realized that the UPDATE would replace mxids by
actual transaction ids in the visible rows. So that can't be used.
What you have done doesn't have that drawback, but it's also not
checking whether the multixids in (invisible) rows are reachable in
offsets and members. But probably that's too hard to do and is covered
by visibility checks.
--
Best Wishes,
Ashutosh Bapat
On 08/12/2025 17:43, Ashutosh Bapat wrote:
On Sat, Dec 6, 2025 at 5:06 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 05/12/2025 15:42, Ashutosh Bapat wrote:
And it makes the test faster by about a second or two on my laptop.
Something on those lines or other is required to reduce the output
from query_safe().Nice! That log bloat was the reason I bundled together the "COMMIT;
BEGIN; SELECT ...;" steps into one statement in the loop. Your solution
addresses it more directly.Now we can call query_safe() separately on each of those. That will be
more readable and marginally less code.
Done.
I'm not entirely happy with the "Old" prefix here, because as you
pointed out, we might end up needing "older" or "oldold" in the future.
I couldn't come up with anything better though. "PreV19MultiXactOffset"
is quite a mouthful.How about MultiXactOffset32?
Ooh, I like that. It doesn't sound as nice for the other "old" prefixed
things though. So I changed OldMultiXactOffset to MultiXactOffset32, but
kept OldMultiXactReader, GetOldMultiXactIdSingleMember() et al. We can
live with that for now, and rename in the future if we'd need "oldold".
Committed with that and some other minor cleanups. Thanks everyone! This
patch has been brewing for a while :-).
There are some noncritical followups that I'd like to address, now that
we know that in v19 the pg_multixact files will be rewritten. That gives
us an opportunity to clean up some backwards-compatibility stuff. The
committed patch already cleaned up a bunch, but there's some more we
could do:
1. Currently, at multixid wraparound, MultiXactState->nextMXact goes to
0, which is invalid. All the readers must be prepared for that, and skip
over the 0. That's error-prone, we've already missed that a few times.
Let's change things so that the code that *writes*
MultiXactState->nextMXact skips over the zero already.
2. We currently don't persist 'oldestOffset' in the control file the
same way as 'oldestMultiXactId'. Instead, we look up the offset of the
oldestMultiXactId at startup, and keep that value in memory. Originally
that was because we missed the need for that and had to add the offset
wraparound protections in a minor release without changing the control
file format. But we could easily do it now.
With 64-bit offsets, it's actually less critical to persist the
oldestOffset. Previously, if we failed to look up the oldest offset
because the oldest multixid was invalid, it could lead to serious
trouble if the offsets then wrapped around and old offsets were
overwritten, but that won't happen anymore. Nevertheless, it leads to
unnecessarily aggressive vacuuming and some messages in the log.
At first I thought that the failure to look up the oldest offset should
no longer happen, because we don't need to support reading old 9.3 era
SLRUs anymore that were created before we added the offset wraparound
protection. But it's not so: it's still possible to have multixids with
invalid offsets in the 'offsets' SLRU on a crash. Such multixids won't
be referenced from anywhere in the heap, but I think they could later
become the oldest multixid, and we would fail to look up its offset.
Persisting the oldest offset doesn't fully fix that problem, because
advancing the oldest offset is done by looking up the oldest multixid's
offset anyway.
3. I think we should turn some of the assertions in
GetMultiXactIdMembers() into ereports(ERROR) calls. I included those
changes in my patch version 29 [1], as a separate patch, but I didn't
commit that yet.
4. Compressing the offsets, per discussion. It doesn't really seem worth
to me and I don't intend to work on it, but if someone wants to do it,
now would be the time, so that we don't need to have upgrade code to
deal with yet another format.
- Heikki
On 09/12/2025 14:00, Heikki Linnakangas wrote:
1. Currently, at multixid wraparound, MultiXactState->nextMXact goes to
0, which is invalid. All the readers must be prepared for that, and skip
over the 0. That's error-prone, we've already missed that a few times.
Let's change things so that the code that *writes* MultiXactState-nextMXact skips over the zero already.
Here's a patch for that. Does anyone see a problem with this?
- Heikki
Attachments:
On Thu, Dec 11, 2025 at 12:49 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 09/12/2025 14:00, Heikki Linnakangas wrote:
1. Currently, at multixid wraparound, MultiXactState->nextMXact goes to
0, which is invalid. All the readers must be prepared for that, and skip
over the 0. That's error-prone, we've already missed that a few times.
Let's change things so that the code that *writes* MultiXactState-nextMXact skips over the zero already.
Here's a patch for that. Does anyone see a problem with this?
The patch looks fine to me. It simplifies readers without affecting
writers much. I was expecting more explanation of why it wasn't done
that way to start with and why is it safe to do so (now, if
applicable). There must be a reason why we chose to make readers
handle invalid mxid instead of writers writing one. If it's for
performance reasons then does the new arrangement cause any
regression? If it's for safety reasons, are we fixing one set of
problems but introducing a new set. I was expecting commit message to
answer those questions.
--
Best Wishes,
Ashutosh Bapat