pg_trgm comparison bug on cross-architecture replication due to different char implementation
Hi all,
I would like to report an issue with the pg_trgm extension on cross-architecture replication scenarios. When an x86_64 standby server is replicating from an aarch64 primary server or vice versa, the gist_trgm_ops opclass returns different results on the primary and standby. Masahiko previously reported a similar issue affecting the pg_bigm extension [1]https://osdn.net/projects/pgbigm/lists/archive/hackers/2024-February/000370.html.
To reproduce, execute the following on the x86_64 primary server:
CREATE EXTENSION pg_trgm;
CREATE TABLE tbl (c text);
CREATE INDEX ON tbl USING gist (c gist_trgm_ops);
INSERT INTO tbl VALUES ('Bóbr');
On the x86_64 primary server:
postgres=> select * from tbl where c like '%Bób%';
c
------
Bóbr
(1 row)
On the aarch64 replica server:
postgres=> select * from tbl where c like '%Bób%';
c
---
(0 rows)
The root cause looks the same as the pg_bigm issue that Masahiko reported. To compare trigrams, pg_trgm uses a numerical comparison of chars [2]https://github.com/postgres/postgres/blob/480bc6e3ed3a5719cdec076d4943b119890e8171/contrib/pg_trgm/trgm.h#L45. On x86_64 a char is signed by default, whereas on aarch64 it is unsigned by default. gist_trgm_ops expects the trigram list to be sorted, but due to the different signedness of chars, the sort order is broken when replicating the values across architectures.
The different sort behaviour can be demonstrated using show_trgm.
On the x86_64 primary server:
postgres=> SELECT show_trgm('Bóbr');
show_trgm
------------------------------------------
{0x89194c," b","br ",0x707c72,0x7f7849}
(1 row)
On the aarch64 replica server:
postgres=> SELECT show_trgm('Bóbr');
show_trgm
------------------------------------------
{" b","br ",0x707c72,0x7f7849,0x89194c}
(1 row)
One simple solution for this specific case is to declare the char signedness in the CMPPCHAR macro.
--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -42,7 +42,7 @@
typedef char trgm[3];
#define CMPCHAR(a,b) ( ((a)==(b)) ? 0 : ( ((a)<(b)) ? -1 : 1 ) )
-#define CMPPCHAR(a,b,i) CMPCHAR( *(((const char*)(a))+i), *(((const char*)(b))+i) )
+#define CMPPCHAR(a,b,i) CMPCHAR( *(((unsigned char*)(a))+i), *(((unsigned char*)(b))+i) )
#define CMPTRGM(a,b) ( CMPPCHAR(a,b,0) ? CMPPCHAR(a,b,0) : ( CMPPCHAR(a,b,1) ? CMPPCHAR(a,b,1) : CMPPCHAR(a,b,2) ) )
#define CPTRGM(a,b) do { \
Alternatively, Postgres can be compiled with -funsigned-char or -fsigned-char. I came across a code comment suggesting that this may not be a good idea in general [3]https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/cash.c#L114-L123.
Given that this has problem has come up before and seems likely to come up again, I'm curious what other broad solutions there might be to resolve it? Looking forward to any feedback, thanks!
Best,
Adam Guo
Amazon Web Services: https://aws.amazon.com
[1]: https://osdn.net/projects/pgbigm/lists/archive/hackers/2024-February/000370.html
[2]: https://github.com/postgres/postgres/blob/480bc6e3ed3a5719cdec076d4943b119890e8171/contrib/pg_trgm/trgm.h#L45
[3]: https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/cash.c#L114-L123
"Guo, Adam" <adamguo@amazon.com> writes:
I would like to report an issue with the pg_trgm extension on
cross-architecture replication scenarios. When an x86_64 standby
server is replicating from an aarch64 primary server or vice versa,
the gist_trgm_ops opclass returns different results on the primary
and standby.
I do not think that is a supported scenario. Hash functions and
suchlike are not guaranteed to produce the same results on different
CPU architectures. As a quick example, I get
regression=# select hashfloat8(34);
hashfloat8
------------
21570837
(1 row)
on x86_64 but
postgres=# select hashfloat8(34);
hashfloat8
------------
-602898821
(1 row)
on ppc32 thanks to the endianness difference.
Given that this has problem has come up before and seems likely to
come up again, I'm curious what other broad solutions there might be
to resolve it?
Reject as not a bug. Discourage people from thinking that physical
replication will work across architectures.
regards, tom lane
On Tue, Apr 23, 2024 at 11:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Guo, Adam" <adamguo@amazon.com> writes:
I would like to report an issue with the pg_trgm extension on
cross-architecture replication scenarios. When an x86_64 standby
server is replicating from an aarch64 primary server or vice versa,
the gist_trgm_ops opclass returns different results on the primary
and standby.I do not think that is a supported scenario. Hash functions and
suchlike are not guaranteed to produce the same results on different
CPU architectures. As a quick example, I getregression=# select hashfloat8(34);
hashfloat8
------------
21570837
(1 row)on x86_64 but
postgres=# select hashfloat8(34);
hashfloat8
------------
-602898821
(1 row)on ppc32 thanks to the endianness difference.
Given that this has problem has come up before and seems likely to
come up again, I'm curious what other broad solutions there might be
to resolve it?Reject as not a bug. Discourage people from thinking that physical
replication will work across architectures.
While cross-arch physical replication is not supported, I think having
architecture dependent differences is not good and It's legitimate to
fix it. FYI the 'char' data type comparison is done as though char is
unsigned. I've attached a small patch to fix it. What do you think?
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
fix_signedness_issue_in_pg_trgm.patchapplication/octet-stream; name=fix_signedness_issue_in_pg_trgm.patchDownload
diff --git a/contrib/pg_trgm/trgm.h b/contrib/pg_trgm/trgm.h
index afb0adb222..4f618e0c21 100644
--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -42,7 +42,7 @@
typedef char trgm[3];
#define CMPCHAR(a,b) ( ((a)==(b)) ? 0 : ( ((a)<(b)) ? -1 : 1 ) )
-#define CMPPCHAR(a,b,i) CMPCHAR( *(((const char*)(a))+i), *(((const char*)(b))+i) )
+#define CMPPCHAR(a,b,i) CMPCHAR( *(((const unsigned char*)(a))+i), *(((const unsigned char*)(b))+i) )
#define CMPTRGM(a,b) ( CMPPCHAR(a,b,0) ? CMPPCHAR(a,b,0) : ( CMPPCHAR(a,b,1) ? CMPPCHAR(a,b,1) : CMPPCHAR(a,b,2) ) )
#define CPTRGM(a,b) do { \
Masahiko Sawada <sawada.mshk@gmail.com> writes:
On Tue, Apr 23, 2024 at 11:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Reject as not a bug. Discourage people from thinking that physical
replication will work across architectures.
While cross-arch physical replication is not supported, I think having
architecture dependent differences is not good and It's legitimate to
fix it. FYI the 'char' data type comparison is done as though char is
unsigned. I've attached a small patch to fix it. What do you think?
I think this will break existing indexes that are working fine.
Yeah, it would have been better to avoid the difference, but
it's too late now.
regards, tom lane
On Tue, Apr 30, 2024 at 12:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Masahiko Sawada <sawada.mshk@gmail.com> writes:
On Tue, Apr 23, 2024 at 11:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Reject as not a bug. Discourage people from thinking that physical
replication will work across architectures.While cross-arch physical replication is not supported, I think having
architecture dependent differences is not good and It's legitimate to
fix it. FYI the 'char' data type comparison is done as though char is
unsigned. I've attached a small patch to fix it. What do you think?I think this will break existing indexes that are working fine.
Yeah, it would have been better to avoid the difference, but
it's too late now.
True. So it will be a PG18 item.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Masahiko Sawada <sawada.mshk@gmail.com> writes:
On Tue, Apr 30, 2024 at 12:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I think this will break existing indexes that are working fine.
Yeah, it would have been better to avoid the difference, but
it's too late now.
True. So it will be a PG18 item.
How will it be any better in v18? It's still an on-disk
compatibility break for affected platforms.
Now, people could recover by reindexing affected indexes,
but I think we need to have a better justification than this
for making them do so.
regards, tom lane
On Tue, Apr 23, 2024 at 5:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Guo, Adam" <adamguo@amazon.com> writes:
I would like to report an issue with the pg_trgm extension on
cross-architecture replication scenarios. When an x86_64 standby
server is replicating from an aarch64 primary server or vice versa,
the gist_trgm_ops opclass returns different results on the primary
and standby.I do not think that is a supported scenario. Hash functions and
suchlike are not guaranteed to produce the same results on different
CPU architectures. As a quick example, I getregression=# select hashfloat8(34);
hashfloat8
------------
21570837
(1 row)on x86_64 but
postgres=# select hashfloat8(34);
hashfloat8
------------
-602898821
(1 row)on ppc32 thanks to the endianness difference.
Given this, should we try to do better with binary compatibility
checks using ControlFileData? AFAICS they are supposed to check if
the database cluster is binary compatible with the running
architecture. But it obviously allows incompatibilities.
------
Regards,
Alexander Korotkov
Alexander Korotkov <aekorotkov@gmail.com> writes:
Given this, should we try to do better with binary compatibility
checks using ControlFileData? AFAICS they are supposed to check if
the database cluster is binary compatible with the running
architecture. But it obviously allows incompatibilities.
Perhaps. pg_control already covers endianness, which I think
is the root of the hashing differences I showed. Adding a field
for char signedness feels a little weird, since it's not directly
a property of the bits-on-disk, but maybe we should.
regards, tom lane
On Tue, Apr 30, 2024 at 7:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alexander Korotkov <aekorotkov@gmail.com> writes:
Given this, should we try to do better with binary compatibility
checks using ControlFileData? AFAICS they are supposed to check if
the database cluster is binary compatible with the running
architecture. But it obviously allows incompatibilities.Perhaps. pg_control already covers endianness, which I think
is the root of the hashing differences I showed. Adding a field
for char signedness feels a little weird, since it's not directly
a property of the bits-on-disk, but maybe we should.
I agree that storing char signedness might seem weird. But it appears
that we already store indexes that depend on char signedness. So,
it's effectively property of bits-on-disk even though it affects
indirectly. Then I see two options to make the picture consistent.
1) Assume that char signedness is somehow a property of bits-on-disk
even though it's weird. Then pg_trgm indexes are correct, but we need
to store char signedness in pg_control.
2) Assume that char signedness is not a property of bits-on-disk.
Then pg_trgm indexes are buggy and need to be fixed.
What do you think?
------
Regards,
Alexander Korotkov
Alexander Korotkov <aekorotkov@gmail.com> writes:
I agree that storing char signedness might seem weird. But it appears
that we already store indexes that depend on char signedness. So,
it's effectively property of bits-on-disk even though it affects
indirectly. Then I see two options to make the picture consistent.
1) Assume that char signedness is somehow a property of bits-on-disk
even though it's weird. Then pg_trgm indexes are correct, but we need
to store char signedness in pg_control.
2) Assume that char signedness is not a property of bits-on-disk.
Then pg_trgm indexes are buggy and need to be fixed.
What do you think?
The problem with option (2) is the assumption that pg_trgm's behavior
is the only bug of this kind, either now or in the future. I think
that's just about an impossible standard to meet, because there's no
realistic way to test whether char signedness is affecting things.
(Sure, you can compare results across platforms, but maybe you
just didn't test the right case.)
Also, the bigger picture here is the seeming assumption that "if
we change pg_trgm then it will be safe to replicate from x86 to
arm". I don't believe that that's a good idea and I'm unwilling
to promise that it will work, regardless of what we do about
char signedness. That being the case, I don't want to invest a
lot of effort in the signedness issue. Option (1) is clearly
a small change with little if any risk of future breakage.
Option (2) ... not so much.
regards, tom lane
On Wed, May 1, 2024 at 2:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alexander Korotkov <aekorotkov@gmail.com> writes:
I agree that storing char signedness might seem weird. But it appears
that we already store indexes that depend on char signedness. So,
it's effectively property of bits-on-disk even though it affects
indirectly. Then I see two options to make the picture consistent.
1) Assume that char signedness is somehow a property of bits-on-disk
even though it's weird. Then pg_trgm indexes are correct, but we need
to store char signedness in pg_control.
2) Assume that char signedness is not a property of bits-on-disk.
Then pg_trgm indexes are buggy and need to be fixed.
What do you think?The problem with option (2) is the assumption that pg_trgm's behavior
is the only bug of this kind, either now or in the future. I think
that's just about an impossible standard to meet, because there's no
realistic way to test whether char signedness is affecting things.
(Sure, you can compare results across platforms, but maybe you
just didn't test the right case.)Also, the bigger picture here is the seeming assumption that "if
we change pg_trgm then it will be safe to replicate from x86 to
arm". I don't believe that that's a good idea and I'm unwilling
to promise that it will work, regardless of what we do about
char signedness. That being the case, I don't want to invest a
lot of effort in the signedness issue.
I think that the char signedness issue is an issue also for developers
(and extension authors) since it could lead to confusion and potential
bugs in the future due to that. x86 developers would think of char as
always being signed and write code that will misbehave on arm
machines. For example, since logical replication should behave
correctly even in cross-arch replication all developers need to be
aware of that. I thought of using the -funsigned-char (or
-fsigned-char) compiler flag to avoid that but it would have a broader
impact not only on indexes.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On 30.04.24 19:29, Tom Lane wrote:
1) Assume that char signedness is somehow a property of bits-on-disk
even though it's weird. Then pg_trgm indexes are correct, but we need
to store char signedness in pg_control.
2) Assume that char signedness is not a property of bits-on-disk.
Then pg_trgm indexes are buggy and need to be fixed.
What do you think?Also, the bigger picture here is the seeming assumption that "if
we change pg_trgm then it will be safe to replicate from x86 to
arm". I don't believe that that's a good idea and I'm unwilling
to promise that it will work, regardless of what we do about
char signedness. That being the case, I don't want to invest a
lot of effort in the signedness issue. Option (1) is clearly
a small change with little if any risk of future breakage.
But note that option 1 would prevent some replication that is currently
working.
Peter Eisentraut <peter@eisentraut.org> writes:
On 30.04.24 19:29, Tom Lane wrote:
Also, the bigger picture here is the seeming assumption that "if
we change pg_trgm then it will be safe to replicate from x86 to
arm". I don't believe that that's a good idea and I'm unwilling
to promise that it will work, regardless of what we do about
char signedness. That being the case, I don't want to invest a
lot of effort in the signedness issue. Option (1) is clearly
a small change with little if any risk of future breakage.
But note that option 1 would prevent some replication that is currently
working.
The point of this thread though is that it's working only for small
values of "work". People are rightfully unhappy if it seems to work
and then later they get bitten by compatibility problems.
Treating char signedness as a machine property in pg_control would
signal that we don't intend to make it work, and would ensure that
even the most minimal testing would find out that it doesn't work.
If we do not do that, it seems to me we have to buy into making
it work. That would mean dealing with the consequences of an
incompatible change in pg_trgm indexes, and then going through
the same dance again the next time(s) similar problems are found.
regards, tom lane
On 03.05.24 16:13, Tom Lane wrote:
Peter Eisentraut <peter@eisentraut.org> writes:
On 30.04.24 19:29, Tom Lane wrote:
Also, the bigger picture here is the seeming assumption that "if
we change pg_trgm then it will be safe to replicate from x86 to
arm". I don't believe that that's a good idea and I'm unwilling
to promise that it will work, regardless of what we do about
char signedness. That being the case, I don't want to invest a
lot of effort in the signedness issue. Option (1) is clearly
a small change with little if any risk of future breakage.But note that option 1 would prevent some replication that is currently
working.The point of this thread though is that it's working only for small
values of "work". People are rightfully unhappy if it seems to work
and then later they get bitten by compatibility problems.Treating char signedness as a machine property in pg_control would
signal that we don't intend to make it work, and would ensure that
even the most minimal testing would find out that it doesn't work.If we do not do that, it seems to me we have to buy into making
it work. That would mean dealing with the consequences of an
incompatible change in pg_trgm indexes, and then going through
the same dance again the next time(s) similar problems are found.
Yes, that is understood. But anecdotally, replicating between x86-64
arm64 is occasionally used for upgrades or migrations. In practice,
this appears to have mostly worked. If we now discover that it won't
work with certain index extension modules, it's usable for most users.
Even if we say, you have to reindex everything afterwards, it's probably
still useful for these scenarios.
The way I understand the original report, the issue has to do
specifically with how signed and unsigned chars compare differently. I
don't imagine this is used anywhere in the table/heap code. So it's
plausible that this issue is really contained to indexes.
On the other hand, if we put in a check against this, then at least we
can answer any user questions about this with more certainty: No, won't
work, here is why.
On 5/3/24 11:44, Peter Eisentraut wrote:
On 03.05.24 16:13, Tom Lane wrote:
Peter Eisentraut <peter@eisentraut.org> writes:
On 30.04.24 19:29, Tom Lane wrote:
Also, the bigger picture here is the seeming assumption that "if
we change pg_trgm then it will be safe to replicate from x86 to
arm". I don't believe that that's a good idea and I'm unwilling
to promise that it will work, regardless of what we do about
char signedness. That being the case, I don't want to invest a
lot of effort in the signedness issue. Option (1) is clearly
a small change with little if any risk of future breakage.But note that option 1 would prevent some replication that is currently
working.The point of this thread though is that it's working only for small
values of "work". People are rightfully unhappy if it seems to work
and then later they get bitten by compatibility problems.Treating char signedness as a machine property in pg_control would
signal that we don't intend to make it work, and would ensure that
even the most minimal testing would find out that it doesn't work.If we do not do that, it seems to me we have to buy into making
it work. That would mean dealing with the consequences of an
incompatible change in pg_trgm indexes, and then going through
the same dance again the next time(s) similar problems are found.Yes, that is understood. But anecdotally, replicating between x86-64 arm64 is
occasionally used for upgrades or migrations. In practice, this appears to have
mostly worked. If we now discover that it won't work with certain index
extension modules, it's usable for most users. Even if we say, you have to
reindex everything afterwards, it's probably still useful for these scenarios.
+1
I have heard similar anecdotes, and the reported experience goes even further --
many such upgrade/migration uses, with exceedingly rare reported failures.
--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Sat, May 4, 2024 at 7:36 AM Joe Conway <mail@joeconway.com> wrote:
On 5/3/24 11:44, Peter Eisentraut wrote:
On 03.05.24 16:13, Tom Lane wrote:
Peter Eisentraut <peter@eisentraut.org> writes:
On 30.04.24 19:29, Tom Lane wrote:
Also, the bigger picture here is the seeming assumption that "if
we change pg_trgm then it will be safe to replicate from x86 to
arm". I don't believe that that's a good idea and I'm unwilling
to promise that it will work, regardless of what we do about
char signedness. That being the case, I don't want to invest a
lot of effort in the signedness issue. Option (1) is clearly
a small change with little if any risk of future breakage.But note that option 1 would prevent some replication that is currently
working.The point of this thread though is that it's working only for small
values of "work". People are rightfully unhappy if it seems to work
and then later they get bitten by compatibility problems.Treating char signedness as a machine property in pg_control would
signal that we don't intend to make it work, and would ensure that
even the most minimal testing would find out that it doesn't work.If we do not do that, it seems to me we have to buy into making
it work. That would mean dealing with the consequences of an
incompatible change in pg_trgm indexes, and then going through
the same dance again the next time(s) similar problems are found.Yes, that is understood. But anecdotally, replicating between x86-64 arm64 is
occasionally used for upgrades or migrations. In practice, this appears to have
mostly worked. If we now discover that it won't work with certain index
extension modules, it's usable for most users. Even if we say, you have to
reindex everything afterwards, it's probably still useful for these scenarios.+1
+1
How about extending amcheck to support GIN and GIst indexes so that it
can detect potential data incompatibility due to changing 'char' to
'unsigned char'? I think these new tests would be useful also for
users to check if they really need to reindex indexes due to such
changes. Also we fix pg_trgm so that it uses 'unsigned char' in PG18.
Users who upgraded to PG18 can run the new amcheck tests on the
primary as well as the standby.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Wed, May 15, 2024 at 02:56:54PM +0900, Masahiko Sawada wrote:
On Sat, May 4, 2024 at 7:36 AM Joe Conway <mail@joeconway.com> wrote:
On 5/3/24 11:44, Peter Eisentraut wrote:
On 03.05.24 16:13, Tom Lane wrote:
Peter Eisentraut <peter@eisentraut.org> writes:
On 30.04.24 19:29, Tom Lane wrote:
Also, the bigger picture here is the seeming assumption that "if
we change pg_trgm then it will be safe to replicate from x86 to
arm". I don't believe that that's a good idea and I'm unwilling
to promise that it will work, regardless of what we do about
char signedness. That being the case, I don't want to invest a
lot of effort in the signedness issue. Option (1) is clearly
a small change with little if any risk of future breakage.But note that option 1 would prevent some replication that is currently
working.The point of this thread though is that it's working only for small
values of "work". People are rightfully unhappy if it seems to work
and then later they get bitten by compatibility problems.Treating char signedness as a machine property in pg_control would
signal that we don't intend to make it work, and would ensure that
even the most minimal testing would find out that it doesn't work.If we do not do that, it seems to me we have to buy into making
it work. That would mean dealing with the consequences of an
incompatible change in pg_trgm indexes, and then going through
the same dance again the next time(s) similar problems are found.Yes, that is understood. But anecdotally, replicating between x86-64 arm64 is
occasionally used for upgrades or migrations. In practice, this appears to have
mostly worked. If we now discover that it won't work with certain index
extension modules, it's usable for most users. Even if we say, you have to
reindex everything afterwards, it's probably still useful for these scenarios.
Like you, I would not introduce a ControlFileData block for sign-of-char,
given the signs of breakage in extension indexing only. That would lose much
useful migration capability. I'm sympathetic to Tom's point, which I'll
attempt to summarize as: a ControlFileData block is a promise we know how to
keep, whereas we should expect further bug discoveries without it. At the
same time, I put more weight on the apparently-wide span of functionality that
works fine. (I wonder whether any static analyzer does or practically could
detect char sign dependence with a decent false positive rate.)
+1
How about extending amcheck to support GIN and GIst indexes so that it
can detect potential data incompatibility due to changing 'char' to
'unsigned char'? I think these new tests would be useful also for
users to check if they really need to reindex indexes due to such
changes. Also we fix pg_trgm so that it uses 'unsigned char' in PG18.
Users who upgraded to PG18 can run the new amcheck tests on the
primary as well as the standby.
If I were standardizing pg_trgm on one or the other notion of "char", I would
choose signed char, since I think it's still the majority. More broadly, I
see these options to fix pg_trgm:
1. Change to signed char. Every arm64 system needs to scan pg_trgm indexes.
2. Change to unsigned char. Every x86 system needs to scan pg_trgm indexes.
3. Offer both, as an upgrade path. For example, pg_trgm could have separate
operator classes gin_trgm_ops and gin_trgm_ops_unsigned. Running
pg_upgrade on an unsigned-char system would automatically map v17
gin_trgm_ops to v18 gin_trgm_ops_unsigned. This avoids penalizing any
architecture with upgrade-time scans.
Independently, having amcheck for GIN and/or GiST would be great.
On Sun, May 19, 2024 at 6:46 AM Noah Misch <noah@leadboat.com> wrote:
On Wed, May 15, 2024 at 02:56:54PM +0900, Masahiko Sawada wrote:
How about extending amcheck to support GIN and GIst indexes so that it
can detect potential data incompatibility due to changing 'char' to
'unsigned char'? I think these new tests would be useful also for
users to check if they really need to reindex indexes due to such
changes. Also we fix pg_trgm so that it uses 'unsigned char' in PG18.
Users who upgraded to PG18 can run the new amcheck tests on the
primary as well as the standby.If I were standardizing pg_trgm on one or the other notion of "char", I would
choose signed char, since I think it's still the majority. More broadly, I
see these options to fix pg_trgm:1. Change to signed char. Every arm64 system needs to scan pg_trgm indexes.
2. Change to unsigned char. Every x86 system needs to scan pg_trgm indexes.
Even though it's true that signed char systems are the majority, it
would not be acceptable to force the need to scan pg_trgm indexes on
unsigned char systems.
3. Offer both, as an upgrade path. For example, pg_trgm could have separate
operator classes gin_trgm_ops and gin_trgm_ops_unsigned. Running
pg_upgrade on an unsigned-char system would automatically map v17
gin_trgm_ops to v18 gin_trgm_ops_unsigned. This avoids penalizing any
architecture with upgrade-time scans.
Very interesting idea. How can new v18 users use the correct operator
class? I don't want to require users to specify the correct signed or
unsigned operator classes when creating a GIN index. Maybe we need to
dynamically use the correct compare function for the same operator
class depending on the char signedness. But is it possible to do it on
the extension (e.g. pg_trgm) side?
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Thu, Aug 29, 2024 at 03:48:53PM -0500, Masahiko Sawada wrote:
On Sun, May 19, 2024 at 6:46 AM Noah Misch <noah@leadboat.com> wrote:
If I were standardizing pg_trgm on one or the other notion of "char", I would
choose signed char, since I think it's still the majority. More broadly, I
see these options to fix pg_trgm:1. Change to signed char. Every arm64 system needs to scan pg_trgm indexes.
2. Change to unsigned char. Every x86 system needs to scan pg_trgm indexes.Even though it's true that signed char systems are the majority, it
would not be acceptable to force the need to scan pg_trgm indexes on
unsigned char systems.3. Offer both, as an upgrade path. For example, pg_trgm could have separate
operator classes gin_trgm_ops and gin_trgm_ops_unsigned. Running
pg_upgrade on an unsigned-char system would automatically map v17
gin_trgm_ops to v18 gin_trgm_ops_unsigned. This avoids penalizing any
architecture with upgrade-time scans.Very interesting idea. How can new v18 users use the correct operator
class? I don't want to require users to specify the correct signed or
unsigned operator classes when creating a GIN index. Maybe we need to
In brief, it wouldn't matter which operator class new v18 indexes use. The
documentation would focus on gin_trgm_ops and also say something like:
There's an additional operator class, gin_trgm_ops_unsigned. It behaves
exactly like gin_trgm_ops, but it uses a deprecated on-disk representation.
Use gin_trgm_ops in new indexes, but there's no disadvantage from continuing
to use gin_trgm_ops_unsigned. Before PostgreSQL 18, gin_trgm_ops used a
platform-dependent representation. pg_upgrade automatically uses
gin_trgm_ops_unsigned when upgrading from source data that used the
deprecated representation.
What concerns might users have, then? (Neither operator class would use plain
"char" in a context that affects on-disk state. They'll use "signed char" and
"unsigned char".)
dynamically use the correct compare function for the same operator
class depending on the char signedness. But is it possible to do it on
the extension (e.g. pg_trgm) side?
No, I don't think the extension can do that cleanly. It would need to store
the signedness in the index somehow, and GIN doesn't call the opclass at a
time facilitating that. That would need help from the core server.
On Fri, Aug 30, 2024 at 8:10 PM Noah Misch <noah@leadboat.com> wrote:
On Thu, Aug 29, 2024 at 03:48:53PM -0500, Masahiko Sawada wrote:
On Sun, May 19, 2024 at 6:46 AM Noah Misch <noah@leadboat.com> wrote:
If I were standardizing pg_trgm on one or the other notion of "char", I would
choose signed char, since I think it's still the majority. More broadly, I
see these options to fix pg_trgm:1. Change to signed char. Every arm64 system needs to scan pg_trgm indexes.
2. Change to unsigned char. Every x86 system needs to scan pg_trgm indexes.Even though it's true that signed char systems are the majority, it
would not be acceptable to force the need to scan pg_trgm indexes on
unsigned char systems.3. Offer both, as an upgrade path. For example, pg_trgm could have separate
operator classes gin_trgm_ops and gin_trgm_ops_unsigned. Running
pg_upgrade on an unsigned-char system would automatically map v17
gin_trgm_ops to v18 gin_trgm_ops_unsigned. This avoids penalizing any
architecture with upgrade-time scans.Very interesting idea. How can new v18 users use the correct operator
class? I don't want to require users to specify the correct signed or
unsigned operator classes when creating a GIN index. Maybe we need toIn brief, it wouldn't matter which operator class new v18 indexes use. The
documentation would focus on gin_trgm_ops and also say something like:There's an additional operator class, gin_trgm_ops_unsigned. It behaves
exactly like gin_trgm_ops, but it uses a deprecated on-disk representation.
Use gin_trgm_ops in new indexes, but there's no disadvantage from continuing
to use gin_trgm_ops_unsigned. Before PostgreSQL 18, gin_trgm_ops used a
platform-dependent representation. pg_upgrade automatically uses
gin_trgm_ops_unsigned when upgrading from source data that used the
deprecated representation.What concerns might users have, then? (Neither operator class would use plain
"char" in a context that affects on-disk state. They'll use "signed char" and
"unsigned char".)
I think I understand your idea now. Since gin_trgm_ops will use
"signed char", there is no impact for v18 users -- they can continue
using gin_trgm_ops.
But how does pg_upgrade use gin_trgm_ops_unsigned? This opclass will
be created by executing the pg_trgm script for v18, but it isn't
executed during pg_upgrade. Another way would be to do these opclass
replacement when executing the pg_trgm's update script (i.e., 1.6 to
1.7).
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Fri, Sep 06, 2024 at 02:07:10PM -0700, Masahiko Sawada wrote:
On Fri, Aug 30, 2024 at 8:10 PM Noah Misch <noah@leadboat.com> wrote:
On Thu, Aug 29, 2024 at 03:48:53PM -0500, Masahiko Sawada wrote:
On Sun, May 19, 2024 at 6:46 AM Noah Misch <noah@leadboat.com> wrote:
If I were standardizing pg_trgm on one or the other notion of "char", I would
choose signed char, since I think it's still the majority. More broadly, I
see these options to fix pg_trgm:1. Change to signed char. Every arm64 system needs to scan pg_trgm indexes.
2. Change to unsigned char. Every x86 system needs to scan pg_trgm indexes.Even though it's true that signed char systems are the majority, it
would not be acceptable to force the need to scan pg_trgm indexes on
unsigned char systems.3. Offer both, as an upgrade path. For example, pg_trgm could have separate
operator classes gin_trgm_ops and gin_trgm_ops_unsigned. Running
pg_upgrade on an unsigned-char system would automatically map v17
gin_trgm_ops to v18 gin_trgm_ops_unsigned. This avoids penalizing any
architecture with upgrade-time scans.Very interesting idea. How can new v18 users use the correct operator
class? I don't want to require users to specify the correct signed or
unsigned operator classes when creating a GIN index. Maybe we need toIn brief, it wouldn't matter which operator class new v18 indexes use. The
documentation would focus on gin_trgm_ops and also say something like:There's an additional operator class, gin_trgm_ops_unsigned. It behaves
exactly like gin_trgm_ops, but it uses a deprecated on-disk representation.
Use gin_trgm_ops in new indexes, but there's no disadvantage from continuing
to use gin_trgm_ops_unsigned. Before PostgreSQL 18, gin_trgm_ops used a
platform-dependent representation. pg_upgrade automatically uses
gin_trgm_ops_unsigned when upgrading from source data that used the
deprecated representation.What concerns might users have, then? (Neither operator class would use plain
"char" in a context that affects on-disk state. They'll use "signed char" and
"unsigned char".)I think I understand your idea now. Since gin_trgm_ops will use
"signed char", there is no impact for v18 users -- they can continue
using gin_trgm_ops.
Right.
But how does pg_upgrade use gin_trgm_ops_unsigned? This opclass will
be created by executing the pg_trgm script for v18, but it isn't
executed during pg_upgrade. Another way would be to do these opclass
replacement when executing the pg_trgm's update script (i.e., 1.6 to
1.7).
Yes, that's one way to make it work. If we do it that way, it would be
critical to make the ALTER EXTENSION UPDATE run before anything uses the
index. Otherwise, we'll run new v18 "signed char" code on a v17 "unsigned
char" data file. Running ALTER EXTENSION UPDATE early enough should be
feasible, so that's fine. Some other options:
- If v18 "pg_dump -b" decides to emit CREATE INDEX ... gin_trgm_ops_unsigned,
then make it also emit the statements to create the opclass.
- Ship pg_trgm--1.6--1.7.sql in back branches. If the upgrade wants to use
gin_trgm_ops_unsigned, require the user to ALTER EXTENSION UPDATE first.
(In back branches, the C code behind gin_trgm_ops_unsigned could just raise
an error if called.)
What other options exist? I bet there are more.
Noah Misch <noah@leadboat.com> writes:
Yes, that's one way to make it work. If we do it that way, it would be
critical to make the ALTER EXTENSION UPDATE run before anything uses the
index. Otherwise, we'll run new v18 "signed char" code on a v17 "unsigned
char" data file. Running ALTER EXTENSION UPDATE early enough should be
feasible, so that's fine. Some other options:
- If v18 "pg_dump -b" decides to emit CREATE INDEX ... gin_trgm_ops_unsigned,
then make it also emit the statements to create the opclass.
- Ship pg_trgm--1.6--1.7.sql in back branches. If the upgrade wants to use
gin_trgm_ops_unsigned, require the user to ALTER EXTENSION UPDATE first.
(In back branches, the C code behind gin_trgm_ops_unsigned could just raise
an error if called.)
I feel like all of these are leaning heavily on users to get it right,
and therefore have a significant chance of breaking use-cases that
were perfectly fine before.
Remind me of why we can't do something like this:
* Add APIs that allow opclasses to read/write some space in the GIN
metapage. (We could get ambitious and add such APIs for other AMs
too, but doing it just for GIN is probably a prudent start.) Ensure
that this space is initially zeroed.
* In gin_trgm_ops, read a byte of this space and interpret it as
0 = unset
1 = signed chars
2 = unsigned chars
If the value is zero, set the byte on the basis of the native
char-signedness of the current platform. (Obviously, a
secondary server couldn't write the byte, and would have to
wait for the primary to update the value. In the meantime,
it's no more broken than today.)
* Subsequently, use either signed or unsigned comparisons
based on that value.
This would automatically do the right thing in all cases that
work today, and it'd provide the ability for cross-platform
replication to work in future. You can envision cases where
transferring a pre-existing index to a platform of the other
stripe would misbehave, but they're the same cases that fail
today, and the fix remains the same: reindex.
regards, tom lane
On Fri, Sep 06, 2024 at 06:36:53PM -0400, Tom Lane wrote:
Noah Misch <noah@leadboat.com> writes:
Yes, that's one way to make it work. If we do it that way, it would be
critical to make the ALTER EXTENSION UPDATE run before anything uses the
index. Otherwise, we'll run new v18 "signed char" code on a v17 "unsigned
char" data file. Running ALTER EXTENSION UPDATE early enough should be
feasible, so that's fine. Some other options:- If v18 "pg_dump -b" decides to emit CREATE INDEX ... gin_trgm_ops_unsigned,
then make it also emit the statements to create the opclass.- Ship pg_trgm--1.6--1.7.sql in back branches. If the upgrade wants to use
gin_trgm_ops_unsigned, require the user to ALTER EXTENSION UPDATE first.
(In back branches, the C code behind gin_trgm_ops_unsigned could just raise
an error if called.)I feel like all of these are leaning heavily on users to get it right,
What do you have in mind? I see things for the pg_upgrade implementation to
get right, but I don't see things for pg_upgrade users to get right.
The last option is disruptive for users of unsigned char platforms, since it
requires a two-step upgrade if upgrading from v11 or earlier. The other two
don't have that problem.
and therefore have a significant chance of breaking use-cases that
were perfectly fine before.Remind me of why we can't do something like this:
* Add APIs that allow opclasses to read/write some space in the GIN
metapage. (We could get ambitious and add such APIs for other AMs
too, but doing it just for GIN is probably a prudent start.) Ensure
that this space is initially zeroed.* In gin_trgm_ops, read a byte of this space and interpret it as
0 = unset
1 = signed chars
2 = unsigned chars
If the value is zero, set the byte on the basis of the native
char-signedness of the current platform. (Obviously, a
secondary server couldn't write the byte, and would have to
wait for the primary to update the value. In the meantime,
it's no more broken than today.)* Subsequently, use either signed or unsigned comparisons
based on that value.This would automatically do the right thing in all cases that
work today, and it'd provide the ability for cross-platform
replication to work in future. You can envision cases where
transferring a pre-existing index to a platform of the other
stripe would misbehave, but they're the same cases that fail
today, and the fix remains the same: reindex.
No reason you couldn't do it that way. Works for me.
Noah Misch <noah@leadboat.com> writes:
On Fri, Sep 06, 2024 at 06:36:53PM -0400, Tom Lane wrote:
I feel like all of these are leaning heavily on users to get it right,
What do you have in mind? I see things for the pg_upgrade implementation to
get right, but I don't see things for pg_upgrade users to get right.
Well, yeah, if you are willing to put pg_upgrade in charge of
executing ALTER EXTENSION UPDATE, then that would be a reasonably
omission-proof path. But we have always intended the pg_upgrade
process to *not* auto-update extensions, so I'm concerned about
the potential side-effects of drilling a hole in that policy.
As an example: even if we ensure that pg_trgm 1.6 to 1.7 is totally
transparent except for this fix, what happens if the user's old
database is still on some pre-1.6 version? Is it okay to force an
update that includes feature upgrades?
regards, tom lane
On Fri, Sep 06, 2024 at 07:37:09PM -0400, Tom Lane wrote:
Noah Misch <noah@leadboat.com> writes:
On Fri, Sep 06, 2024 at 06:36:53PM -0400, Tom Lane wrote:
I feel like all of these are leaning heavily on users to get it right,
What do you have in mind? I see things for the pg_upgrade implementation to
get right, but I don't see things for pg_upgrade users to get right.Well, yeah, if you are willing to put pg_upgrade in charge of
executing ALTER EXTENSION UPDATE, then that would be a reasonably
omission-proof path. But we have always intended the pg_upgrade
process to *not* auto-update extensions, so I'm concerned about
the potential side-effects of drilling a hole in that policy.
As an example: even if we ensure that pg_trgm 1.6 to 1.7 is totally
transparent except for this fix, what happens if the user's old
database is still on some pre-1.6 version? Is it okay to force an
update that includes feature upgrades?
Those are disadvantages of some of the options. I think it could be okay to
inject the mandatory upgrade here or just tell the user to update to 1.7
before attempting the upgrade (if not at 1.7, pg_upgrade would fail with an
error message to that effect). Your counterproposal avoids the issue, and I'm
good with that design.
On Fri, Sep 6, 2024 at 3:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Noah Misch <noah@leadboat.com> writes:
Yes, that's one way to make it work. If we do it that way, it would be
critical to make the ALTER EXTENSION UPDATE run before anything uses the
index. Otherwise, we'll run new v18 "signed char" code on a v17 "unsigned
char" data file. Running ALTER EXTENSION UPDATE early enough should be
feasible, so that's fine. Some other options:- If v18 "pg_dump -b" decides to emit CREATE INDEX ... gin_trgm_ops_unsigned,
then make it also emit the statements to create the opclass.- Ship pg_trgm--1.6--1.7.sql in back branches. If the upgrade wants to use
gin_trgm_ops_unsigned, require the user to ALTER EXTENSION UPDATE first.
(In back branches, the C code behind gin_trgm_ops_unsigned could just raise
an error if called.)I feel like all of these are leaning heavily on users to get it right,
and therefore have a significant chance of breaking use-cases that
were perfectly fine before.Remind me of why we can't do something like this:
* Add APIs that allow opclasses to read/write some space in the GIN
metapage. (We could get ambitious and add such APIs for other AMs
too, but doing it just for GIN is probably a prudent start.) Ensure
that this space is initially zeroed.* In gin_trgm_ops, read a byte of this space and interpret it as
0 = unset
1 = signed chars
2 = unsigned chars
If the value is zero, set the byte on the basis of the native
char-signedness of the current platform. (Obviously, a
secondary server couldn't write the byte, and would have to
wait for the primary to update the value. In the meantime,
it's no more broken than today.)
When do we set the byte on the primary server? If it's the first time
to use the GIN index, secondary servers would have to wait for the
primary to use the GIN index, which could be an unpredictable time or
it may never come depending on index usages. Probably we can make
pg_upgrade set the byte in the meta page of GIN indexes that use the
gin_trgm_ops.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Masahiko Sawada <sawada.mshk@gmail.com> writes:
When do we set the byte on the primary server? If it's the first time
to use the GIN index, secondary servers would have to wait for the
primary to use the GIN index, which could be an unpredictable time or
it may never come depending on index usages. Probably we can make
pg_upgrade set the byte in the meta page of GIN indexes that use the
gin_trgm_ops.
Hmm, perhaps. That plus set-it-during-index-create would remove the
need for dynamic update like I suggested. So very roughly the amount
of complexity would balance out. Do you have an idea for how we'd get
this to happen during pg_upgrade, exactly?
regards, tom lane
On Mon, Sep 9, 2024 at 4:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Masahiko Sawada <sawada.mshk@gmail.com> writes:
When do we set the byte on the primary server? If it's the first time
to use the GIN index, secondary servers would have to wait for the
primary to use the GIN index, which could be an unpredictable time or
it may never come depending on index usages. Probably we can make
pg_upgrade set the byte in the meta page of GIN indexes that use the
gin_trgm_ops.Hmm, perhaps. That plus set-it-during-index-create would remove the
need for dynamic update like I suggested. So very roughly the amount
of complexity would balance out.
Yes, I think your set-it-during-index-create would be necessary.
Do you have an idea for how we'd get
this to happen during pg_upgrade, exactly?
What I was thinking is that we have "pg_dump --binary-upgrade" emit a
function call, say "SELECT binary_upgrade_update_gin_meta_page()" for
each GIN index, and the meta pages are updated when restoring the
schemas.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Masahiko Sawada <sawada.mshk@gmail.com> writes:
On Mon, Sep 9, 2024 at 4:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Do you have an idea for how we'd get
this to happen during pg_upgrade, exactly?
What I was thinking is that we have "pg_dump --binary-upgrade" emit a
function call, say "SELECT binary_upgrade_update_gin_meta_page()" for
each GIN index, and the meta pages are updated when restoring the
schemas.
Hmm, but ...
1. IIRC we don't move the relation files into the new cluster until
after we've run the schema dump/restore step. I think this'd have to
be driven in some other way than from the pg_dump output. I guess we
could have pg_upgrade start up the new postmaster and call a function
in each DB, which would have to scan for GIN indexes by itself.
2. How will this interact with --link mode? I don't see how it
doesn't involve scribbling on files that are shared with the old
cluster, leading to possible problems if the pg_upgrade fails later
and the user tries to go back to using the old cluster. It's not so
much the metapage update that is worrisome --- we're assuming that
that will modify storage that's unused in old versions. But the
change would be unrecorded in the old cluster's WAL, which sounds
risky.
Maybe we could get away with forcing --copy mode for affected
indexes, but that feels a bit messy. We'd not want to do it
for unaffected indexes, so the copying code would need to know
a great deal about this problem.
regards, tom lane
On Mon, Sep 9, 2024 at 11:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Masahiko Sawada <sawada.mshk@gmail.com> writes:
On Mon, Sep 9, 2024 at 4:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Do you have an idea for how we'd get
this to happen during pg_upgrade, exactly?What I was thinking is that we have "pg_dump --binary-upgrade" emit a
function call, say "SELECT binary_upgrade_update_gin_meta_page()" for
each GIN index, and the meta pages are updated when restoring the
schemas.Hmm, but ...
1. IIRC we don't move the relation files into the new cluster until
after we've run the schema dump/restore step. I think this'd have to
be driven in some other way than from the pg_dump output. I guess we
could have pg_upgrade start up the new postmaster and call a function
in each DB, which would have to scan for GIN indexes by itself.
You're right.
2. How will this interact with --link mode? I don't see how it
doesn't involve scribbling on files that are shared with the old
cluster, leading to possible problems if the pg_upgrade fails later
and the user tries to go back to using the old cluster. It's not so
much the metapage update that is worrisome --- we're assuming that
that will modify storage that's unused in old versions. But the
change would be unrecorded in the old cluster's WAL, which sounds
risky.Maybe we could get away with forcing --copy mode for affected
indexes, but that feels a bit messy. We'd not want to do it
for unaffected indexes, so the copying code would need to know
a great deal about this problem.
Good point. I agree that it would not be a very good idea to force --copy mode.
An alternative way would be that we store the char signedness in the
control file, and gin_trgm_ops opclass reads it if the bytes in the
meta page shows 'unset'. The char signedness in the control file
doesn't mean to be used for the compatibility check for physical
replication but used as a hint. But it also could be a bit messy,
though.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Masahiko Sawada <sawada.mshk@gmail.com> writes:
An alternative way would be that we store the char signedness in the
control file, and gin_trgm_ops opclass reads it if the bytes in the
meta page shows 'unset'. The char signedness in the control file
doesn't mean to be used for the compatibility check for physical
replication but used as a hint. But it also could be a bit messy,
though.
Yeah, that seems like it could work. But are we sure that replicas
get a copy of the primary's control file rather than creating their
own?
regards, tom lane
On Tue, Sep 10, 2024 at 11:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Masahiko Sawada <sawada.mshk@gmail.com> writes:
An alternative way would be that we store the char signedness in the
control file, and gin_trgm_ops opclass reads it if the bytes in the
meta page shows 'unset'. The char signedness in the control file
doesn't mean to be used for the compatibility check for physical
replication but used as a hint. But it also could be a bit messy,
though.Yeah, that seems like it could work. But are we sure that replicas
get a copy of the primary's control file rather than creating their
own?
Yes, I think so. Since at least the system identifiers of primary and
replicas must be identical for physical replication, if replicas use
their own control files then they cannot start the replication.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Masahiko Sawada <sawada.mshk@gmail.com> writes:
On Tue, Sep 10, 2024 at 11:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, that seems like it could work. But are we sure that replicas
get a copy of the primary's control file rather than creating their
own?
Yes, I think so. Since at least the system identifiers of primary and
replicas must be identical for physical replication, if replicas use
their own control files then they cannot start the replication.
Got it. So now I'm wondering if we need all the complexity of storing
stuff in the GIN metapages. Could we simply read the (primary's)
signedness out of pg_control and use that? We might need some caching
mechanism to make that cheap enough, but accessing the current index's
metapage is far from free either.
regards, tom lane
On Tue, Sep 10, 2024 at 05:56:47PM -0400, Tom Lane wrote:
Masahiko Sawada <sawada.mshk@gmail.com> writes:
On Tue, Sep 10, 2024 at 11:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, that seems like it could work. But are we sure that replicas
get a copy of the primary's control file rather than creating their
own?Yes, I think so. Since at least the system identifiers of primary and
replicas must be identical for physical replication, if replicas use
their own control files then they cannot start the replication.Got it. So now I'm wondering if we need all the complexity of storing
stuff in the GIN metapages. Could we simply read the (primary's)
signedness out of pg_control and use that?
Yes.
On Tue, Sep 10, 2024 at 3:05 PM Noah Misch <noah@leadboat.com> wrote:
On Tue, Sep 10, 2024 at 05:56:47PM -0400, Tom Lane wrote:
Masahiko Sawada <sawada.mshk@gmail.com> writes:
On Tue, Sep 10, 2024 at 11:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, that seems like it could work. But are we sure that replicas
get a copy of the primary's control file rather than creating their
own?Yes, I think so. Since at least the system identifiers of primary and
replicas must be identical for physical replication, if replicas use
their own control files then they cannot start the replication.Got it. So now I'm wondering if we need all the complexity of storing
stuff in the GIN metapages. Could we simply read the (primary's)
signedness out of pg_control and use that?Yes.
Indeed.
I've attached a PoC patch for this idea. We write the default char
signedness to the control file at initdb time. Then when comparing two
trgms, pg_trgm opclasses use a comparison function based on the char
signedness of the cluster. I've confirmed that the patch fixes the
reported case at least.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
poc_char_signedness.patchapplication/octet-stream; name=poc_char_signedness.patchDownload
diff --git a/contrib/pg_trgm/trgm.h b/contrib/pg_trgm/trgm.h
index afb0adb222..5534abe680 100644
--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -42,8 +42,6 @@
typedef char trgm[3];
#define CMPCHAR(a,b) ( ((a)==(b)) ? 0 : ( ((a)<(b)) ? -1 : 1 ) )
-#define CMPPCHAR(a,b,i) CMPCHAR( *(((const char*)(a))+i), *(((const char*)(b))+i) )
-#define CMPTRGM(a,b) ( CMPPCHAR(a,b,0) ? CMPPCHAR(a,b,0) : ( CMPPCHAR(a,b,1) ? CMPPCHAR(a,b,1) : CMPPCHAR(a,b,2) ) )
#define CPTRGM(a,b) do { \
*(((char*)(a))+0) = *(((char*)(b))+0); \
@@ -51,6 +49,8 @@ typedef char trgm[3];
*(((char*)(a))+2) = *(((char*)(b))+2); \
} while(0)
+extern int (*CMPTRGM) (const void *a, const void *b);
+
#ifdef KEEPONLYALNUM
#define ISWORDCHR(c) (t_isalnum(c))
#define ISPRINTABLECHAR(a) ( isascii( *(unsigned char*)(a) ) && (isalnum( *(unsigned char*)(a) ) || *(unsigned char*)(a)==' ') )
diff --git a/contrib/pg_trgm/trgm_op.c b/contrib/pg_trgm/trgm_op.c
index c509d15ee4..75e670edf6 100644
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@@ -40,6 +40,9 @@ PG_FUNCTION_INFO_V1(strict_word_similarity_commutator_op);
PG_FUNCTION_INFO_V1(strict_word_similarity_dist_op);
PG_FUNCTION_INFO_V1(strict_word_similarity_dist_commutator_op);
+static inline int CMPTRGM_CHOOSE(const void *a, const void *b);
+int (*CMPTRGM) (const void *a, const void *b) = CMPTRGM_CHOOSE;
+
/* Trigram with position */
typedef struct
{
@@ -105,6 +108,45 @@ _PG_init(void)
MarkGUCPrefixReserved("pg_trgm");
}
+/*
+ * Functions for comparing two trgms while treating each char as "signed char" or
+ * "unsigned char".
+ */
+static inline int
+CMPTRGM_SIGNED(const void *a, const void *b)
+{
+#define CMPPCHAR_S(a,b,i) CMPCHAR( *(((const signed char*)(a))+i), *(((const signed char*)(b))+i) )
+
+ return CMPPCHAR_S(a, b, 0) ? CMPPCHAR_S(a, b, 0)
+ : (CMPPCHAR_S(a, b, 1) ? CMPPCHAR_S(a, b, 1)
+ : CMPPCHAR_S(a, b, 2));
+}
+
+static inline int
+CMPTRGM_UNSIGNED(const void *a, const void *b)
+{
+#define CMPPCHAR_UNS(a,b,i) CMPCHAR( *(((const unsigned char*)(a))+i), *(((const unsigned char*)(b))+i) )
+
+ return CMPPCHAR_UNS(a, b, 0) ? CMPPCHAR_UNS(a, b, 0)
+ : (CMPPCHAR_UNS(a, b, 1) ? CMPPCHAR_UNS(a, b, 1)
+ : CMPPCHAR_UNS(a, b, 2));
+}
+
+/*
+ * This gets called on the first call. It replaces the function pointer so
+ * that subsequent calls are routed directly to the chosen implementation.
+ */
+static inline int
+CMPTRGM_CHOOSE(const void *a, const void *b)
+{
+ if (GetDefaultCharSignedness())
+ CMPTRGM = CMPTRGM_SIGNED;
+ else
+ CMPTRGM = CMPTRGM_UNSIGNED;
+
+ return CMPTRGM(a, b);
+}
+
/*
* Deprecated function.
* Use "pg_trgm.similarity_threshold" GUC variable instead of this function.
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e96aa4d1cb..89a86009c3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -42,6 +42,7 @@
#include <math.h>
#include <time.h>
#include <fcntl.h>
+#include <limits.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <unistd.h>
@@ -4256,6 +4257,18 @@ WriteControlFile(void)
ControlFile->float8ByVal = FLOAT8PASSBYVAL;
+ /*
+ * The signedness of the char type is implementation-defined. For example
+ * on x86 architecture CPUs, the char data type is typically treated as
+ * signed by default whereas on aarch architecture CPUs it is typically
+ * treated as unsigned by default.
+ */
+#if CHAR_MIN != 0
+ ControlFile->signed_char = true;
+#else
+ ControlFile->signed_char = false;
+#endif
+
/* Contents are protected with a CRC */
INIT_CRC32C(ControlFile->crc);
COMP_CRC32C(ControlFile->crc,
@@ -4584,6 +4597,16 @@ DataChecksumsEnabled(void)
return (ControlFile->data_checksum_version > 0);
}
+/*
+ * Return true if the cluster was initialized on a platform where
+ * the default signedness of char is "signed".
+ */
+bool
+GetDefaultCharSignedness(void)
+{
+ return (ControlFile->signed_char);
+}
+
/*
* Returns a fake LSN for unlogged relations.
*
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca..769a9bfd98 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -323,6 +323,8 @@ main(int argc, char *argv[])
_("64-bit integers"));
printf(_("Float8 argument passing: %s\n"),
(ControlFile->float8ByVal ? _("by value") : _("by reference")));
+ printf(_("Default char data signedness: %s\n"),
+ (ControlFile->signed_char ? _("signed") : _("unsigned")));
printf(_("Data page checksum version: %u\n"),
ControlFile->data_checksum_version);
printf(_("Mock authentication nonce: %s\n"),
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 083810f5b4..ef64f67424 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern XLogRecPtr GetXLogWriteRecPtr(void);
extern uint64 GetSystemIdentifier(void);
extern char *GetMockAuthenticationNonce(void);
extern bool DataChecksumsEnabled(void);
+extern bool GetDefaultCharSignedness(void);
extern XLogRecPtr GetFakeLSNForUnloggedRel(void);
extern Size XLOGShmemSize(void);
extern void XLOGShmemInit(void);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e80ff8e414..7e79d147ce 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -221,6 +221,12 @@ typedef struct ControlFileData
/* Are data pages protected by checksums? Zero if no checksum version */
uint32 data_checksum_version;
+ /*
+ * True if the default signedness of char is "signed" on a platform where
+ * the control file is created.
+ */
+ bool signed_char;
+
/*
* Random nonce, used in authentication requests that need to proceed
* based on values that are cluster-unique, like a SASL exchange that
On Thu, Sep 12, 2024 at 03:42:48PM -0700, Masahiko Sawada wrote:
On Tue, Sep 10, 2024 at 3:05 PM Noah Misch <noah@leadboat.com> wrote:
On Tue, Sep 10, 2024 at 05:56:47PM -0400, Tom Lane wrote:
Got it. So now I'm wondering if we need all the complexity of storing
stuff in the GIN metapages. Could we simply read the (primary's)
signedness out of pg_control and use that?
I've attached a PoC patch for this idea. We write the default char
signedness to the control file at initdb time. Then when comparing two
trgms, pg_trgm opclasses use a comparison function based on the char
signedness of the cluster. I've confirmed that the patch fixes the
reported case at least.
I agree that proves the concept.
On Mon, Sep 16, 2024 at 9:24 AM Noah Misch <noah@leadboat.com> wrote:
On Thu, Sep 12, 2024 at 03:42:48PM -0700, Masahiko Sawada wrote:
On Tue, Sep 10, 2024 at 3:05 PM Noah Misch <noah@leadboat.com> wrote:
On Tue, Sep 10, 2024 at 05:56:47PM -0400, Tom Lane wrote:
Got it. So now I'm wondering if we need all the complexity of storing
stuff in the GIN metapages. Could we simply read the (primary's)
signedness out of pg_control and use that?I've attached a PoC patch for this idea. We write the default char
signedness to the control file at initdb time. Then when comparing two
trgms, pg_trgm opclasses use a comparison function based on the char
signedness of the cluster. I've confirmed that the patch fixes the
reported case at least.I agree that proves the concept.
Thanks. I like the simplicity of this approach. If we agree with this
approach, I'd like to proceed with it.
Regardless of what approach we take, I wanted to provide some
regression tests for these changes, but I could not come up with a
reasonable idea. It would be great if we could do something like
027_stream_regress.pl on cross-architecture replication. But just
executing 027_stream_regress.pl on cross-architecture replication
could not be sufficient since we would like to confirm query results
as well. If we could have a reasonable tool or way, it would also help
find other similar bugs related architectural differences.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Tue, Sep 17, 2024 at 09:43:41PM -0700, Masahiko Sawada wrote:
On Mon, Sep 16, 2024 at 9:24 AM Noah Misch <noah@leadboat.com> wrote:
On Thu, Sep 12, 2024 at 03:42:48PM -0700, Masahiko Sawada wrote:
On Tue, Sep 10, 2024 at 3:05 PM Noah Misch <noah@leadboat.com> wrote:
On Tue, Sep 10, 2024 at 05:56:47PM -0400, Tom Lane wrote:
Got it. So now I'm wondering if we need all the complexity of storing
stuff in the GIN metapages. Could we simply read the (primary's)
signedness out of pg_control and use that?I've attached a PoC patch for this idea. We write the default char
signedness to the control file at initdb time. Then when comparing two
trgms, pg_trgm opclasses use a comparison function based on the char
signedness of the cluster. I've confirmed that the patch fixes the
reported case at least.I agree that proves the concept.
Thanks. I like the simplicity of this approach. If we agree with this
approach, I'd like to proceed with it.
Works for me.
Regardless of what approach we take, I wanted to provide some
regression tests for these changes, but I could not come up with a
reasonable idea. It would be great if we could do something like
027_stream_regress.pl on cross-architecture replication. But just
executing 027_stream_regress.pl on cross-architecture replication
could not be sufficient since we would like to confirm query results
as well. If we could have a reasonable tool or way, it would also help
find other similar bugs related architectural differences.
Perhaps add a regress.c function that changes the control file flag and
flushes the change to disk?
On Wed, Sep 18, 2024 at 6:46 PM Noah Misch <noah@leadboat.com> wrote:
On Tue, Sep 17, 2024 at 09:43:41PM -0700, Masahiko Sawada wrote:
On Mon, Sep 16, 2024 at 9:24 AM Noah Misch <noah@leadboat.com> wrote:
On Thu, Sep 12, 2024 at 03:42:48PM -0700, Masahiko Sawada wrote:
On Tue, Sep 10, 2024 at 3:05 PM Noah Misch <noah@leadboat.com> wrote:
On Tue, Sep 10, 2024 at 05:56:47PM -0400, Tom Lane wrote:
Got it. So now I'm wondering if we need all the complexity of storing
stuff in the GIN metapages. Could we simply read the (primary's)
signedness out of pg_control and use that?I've attached a PoC patch for this idea. We write the default char
signedness to the control file at initdb time. Then when comparing two
trgms, pg_trgm opclasses use a comparison function based on the char
signedness of the cluster. I've confirmed that the patch fixes the
reported case at least.I agree that proves the concept.
Thanks. I like the simplicity of this approach. If we agree with this
approach, I'd like to proceed with it.Works for me.
Regardless of what approach we take, I wanted to provide some
regression tests for these changes, but I could not come up with a
reasonable idea. It would be great if we could do something like
027_stream_regress.pl on cross-architecture replication. But just
executing 027_stream_regress.pl on cross-architecture replication
could not be sufficient since we would like to confirm query results
as well. If we could have a reasonable tool or way, it would also help
find other similar bugs related architectural differences.Perhaps add a regress.c function that changes the control file flag and
flushes the change to disk?
I've tried this idea but found out that extensions can flush the
controlfile on the disk after flipping the char signedness value, they
cannot update the controlfile data on the shared memory. And, when the
server shuts down, the on-disk controlfile is updated with the
on-memory controlfile data, so the char signedness is reverted.
We can add a function to set the char signedness of on-memory
controlfile data or add a new option to pg_resetwal to change the char
signedness on-disk controlfile data but they seem to be overkill to me
and I'm concerned about misusing these option and function.
I've updated and splitted patches. They don't have regression tests yet.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0002-Fix-an-issue-with-index-scan-using-pg_trgm-due-to.patchapplication/octet-stream; name=v1-0002-Fix-an-issue-with-index-scan-using-pg_trgm-due-to.patchDownload
From 04e97f86fd55e3c848c2fc527d2c7de567c08923 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Sep 2024 15:17:02 -0700
Subject: [PATCH v1 2/2] Fix an issue with index scan using pg_trgm due to char
signedness on different architectures.
GIN and GiST indexes utilizing pg_trgm's opclasses store sorted
trigrams within index tuples. When comparing and sorting each trigram,
pg_trgm treats each character as a 'char[3]' type in C. However, the
char type in C can be interpreted as either signed char or unsigned
char, depending on the platform, if the signedness is not explicitly
specified. Consequently, during replication between different CPU
architectures, there was an issue where index scans on standby servers
could not locate matching index tuples due to the differing treatment
of character signedness.
This change introduces comparison functions for trgm that explicitly
handle signed char and unsigned char. The appropriate comparison
function will be dynamically selected based on the character
signedness stored in the control file. Therefore, upgraded clusters
can utilize the indexes without rebuilding, provided the cluster
upgrade occurs on platforms with the same character signedness as the
original cluster initialization.
The default char signedness information was introduced in XXXX, so no
backpatch.
Reviewed-by:
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
contrib/pg_trgm/trgm.h | 4 +--
contrib/pg_trgm/trgm_op.c | 42 ++++++++++++++++++++++++++++
src/backend/storage/lmgr/predicate.c | 5 ++++
3 files changed, 49 insertions(+), 2 deletions(-)
diff --git a/contrib/pg_trgm/trgm.h b/contrib/pg_trgm/trgm.h
index afb0adb222..5534abe680 100644
--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -42,8 +42,6 @@
typedef char trgm[3];
#define CMPCHAR(a,b) ( ((a)==(b)) ? 0 : ( ((a)<(b)) ? -1 : 1 ) )
-#define CMPPCHAR(a,b,i) CMPCHAR( *(((const char*)(a))+i), *(((const char*)(b))+i) )
-#define CMPTRGM(a,b) ( CMPPCHAR(a,b,0) ? CMPPCHAR(a,b,0) : ( CMPPCHAR(a,b,1) ? CMPPCHAR(a,b,1) : CMPPCHAR(a,b,2) ) )
#define CPTRGM(a,b) do { \
*(((char*)(a))+0) = *(((char*)(b))+0); \
@@ -51,6 +49,8 @@ typedef char trgm[3];
*(((char*)(a))+2) = *(((char*)(b))+2); \
} while(0)
+extern int (*CMPTRGM) (const void *a, const void *b);
+
#ifdef KEEPONLYALNUM
#define ISWORDCHR(c) (t_isalnum(c))
#define ISPRINTABLECHAR(a) ( isascii( *(unsigned char*)(a) ) && (isalnum( *(unsigned char*)(a) ) || *(unsigned char*)(a)==' ') )
diff --git a/contrib/pg_trgm/trgm_op.c b/contrib/pg_trgm/trgm_op.c
index c509d15ee4..75e670edf6 100644
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@@ -40,6 +40,9 @@ PG_FUNCTION_INFO_V1(strict_word_similarity_commutator_op);
PG_FUNCTION_INFO_V1(strict_word_similarity_dist_op);
PG_FUNCTION_INFO_V1(strict_word_similarity_dist_commutator_op);
+static inline int CMPTRGM_CHOOSE(const void *a, const void *b);
+int (*CMPTRGM) (const void *a, const void *b) = CMPTRGM_CHOOSE;
+
/* Trigram with position */
typedef struct
{
@@ -105,6 +108,45 @@ _PG_init(void)
MarkGUCPrefixReserved("pg_trgm");
}
+/*
+ * Functions for comparing two trgms while treating each char as "signed char" or
+ * "unsigned char".
+ */
+static inline int
+CMPTRGM_SIGNED(const void *a, const void *b)
+{
+#define CMPPCHAR_S(a,b,i) CMPCHAR( *(((const signed char*)(a))+i), *(((const signed char*)(b))+i) )
+
+ return CMPPCHAR_S(a, b, 0) ? CMPPCHAR_S(a, b, 0)
+ : (CMPPCHAR_S(a, b, 1) ? CMPPCHAR_S(a, b, 1)
+ : CMPPCHAR_S(a, b, 2));
+}
+
+static inline int
+CMPTRGM_UNSIGNED(const void *a, const void *b)
+{
+#define CMPPCHAR_UNS(a,b,i) CMPCHAR( *(((const unsigned char*)(a))+i), *(((const unsigned char*)(b))+i) )
+
+ return CMPPCHAR_UNS(a, b, 0) ? CMPPCHAR_UNS(a, b, 0)
+ : (CMPPCHAR_UNS(a, b, 1) ? CMPPCHAR_UNS(a, b, 1)
+ : CMPPCHAR_UNS(a, b, 2));
+}
+
+/*
+ * This gets called on the first call. It replaces the function pointer so
+ * that subsequent calls are routed directly to the chosen implementation.
+ */
+static inline int
+CMPTRGM_CHOOSE(const void *a, const void *b)
+{
+ if (GetDefaultCharSignedness())
+ CMPTRGM = CMPTRGM_SIGNED;
+ else
+ CMPTRGM = CMPTRGM_UNSIGNED;
+
+ return CMPTRGM(a, b);
+}
+
/*
* Deprecated function.
* Use "pg_trgm.similarity_threshold" GUC variable instead of this function.
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index e24a0f2fdb..2648c1e260 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -2588,6 +2588,11 @@ PredicateLockPage(Relation relation, BlockNumber blkno, Snapshot snapshot)
if (!SerializationNeededForRead(relation, snapshot))
return;
+ ereport(LOG, (errmsg("predicate %s blk %u",
+ RelationGetRelationName(relation),
+ blkno),
+ errbacktrace()));
+
SET_PREDICATELOCKTARGETTAG_PAGE(tag,
relation->rd_locator.dbOid,
relation->rd_id,
--
2.39.3
v1-0001-Add-default_char_signedness-to-ControlFileData.patchapplication/octet-stream; name=v1-0001-Add-default_char_signedness-to-ControlFileData.patchDownload
From 312c9437fd272faff8385617cb057a7f5bf85b6e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Sep 2024 11:08:57 -0700
Subject: [PATCH v1 1/2] Add default_char_signedness to ControlFileData
The signedness of the 'char' type in C is
implementation-dependent. For instance, 'signed char' is used by
default on x86 CPUs, while 'unsigned char' is used on aarch CPUs. This
can lead to inconsistent results when comparing char data across
different platforms.
This commit introduces a new 'default_char_signedness' field in
ControlFileData to store the signedness of the 'char' type on the
platform where the database cluster was initialized.
While this change does not encourage the use of 'char' without
explicitly specifying its signedness, it provides a hint to ensure
consistent behavior for indexes (e.g., GIN and GiST) and tables that
already store data sorted by the 'char' type on disk, especially in
cross-platform replication scenarios.
XXX bump catversion.
Reviewed-by:
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
doc/src/sgml/func.sgml | 5 +++++
src/backend/access/transam/xlog.c | 23 +++++++++++++++++++++++
src/backend/utils/misc/pg_controldata.c | 7 +++++--
src/bin/pg_controldata/pg_controldata.c | 2 ++
src/include/access/xlog.h | 1 +
src/include/catalog/pg_control.h | 6 ++++++
src/include/catalog/pg_proc.dat | 6 +++---
7 files changed, 45 insertions(+), 5 deletions(-)
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 461fc3f437..e2e9d97e77 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27831,6 +27831,11 @@ acl | {postgres=arwdDxtm/postgres,foo=r/postgres}
<entry><type>integer</type></entry>
</row>
+ <row>
+ <entry><structfield>signed_char</structfield></entry>
+ <entry><type>bool</type></entry>
+ </row>
+
</tbody>
</tgroup>
</table>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e96aa4d1cb..58cec9848d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -42,6 +42,7 @@
#include <math.h>
#include <time.h>
#include <fcntl.h>
+#include <limits.h> /* for CHAR_MIN */
#include <sys/stat.h>
#include <sys/time.h>
#include <unistd.h>
@@ -4256,6 +4257,18 @@ WriteControlFile(void)
ControlFile->float8ByVal = FLOAT8PASSBYVAL;
+ /*
+ * The signedness of the char type is implementation-defined. For example
+ * on x86 architecture CPUs, the char data type is typically treated as
+ * signed by default whereas on aarch architecture CPUs it is typically
+ * treated as unsigned by default.
+ */
+#if CHAR_MIN != 0
+ ControlFile->default_char_signedness = true;
+#else
+ ControlFile->default_char_signedness = false;
+#endif
+
/* Contents are protected with a CRC */
INIT_CRC32C(ControlFile->crc);
COMP_CRC32C(ControlFile->crc,
@@ -4584,6 +4597,16 @@ DataChecksumsEnabled(void)
return (ControlFile->data_checksum_version > 0);
}
+/*
+ * Return true if the cluster was initialized on a platform where
+ * the default signedness of char is "signed".
+ */
+bool
+GetDefaultCharSignedness(void)
+{
+ return ControlFile->default_char_signedness;
+}
+
/*
* Returns a fake LSN for unlogged relations.
*
diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index 98c932dc7b..7c750bf0b9 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -203,8 +203,8 @@ pg_control_recovery(PG_FUNCTION_ARGS)
Datum
pg_control_init(PG_FUNCTION_ARGS)
{
- Datum values[11];
- bool nulls[11];
+ Datum values[12];
+ bool nulls[12];
TupleDesc tupdesc;
HeapTuple htup;
ControlFileData *ControlFile;
@@ -254,6 +254,9 @@ pg_control_init(PG_FUNCTION_ARGS)
values[10] = Int32GetDatum(ControlFile->data_checksum_version);
nulls[10] = false;
+ values[11] = BoolGetDatum(ControlFile->default_char_signedness);
+ nulls[11] = false;
+
htup = heap_form_tuple(tupdesc, values, nulls);
PG_RETURN_DATUM(HeapTupleGetDatum(htup));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca..718e7940fd 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -323,6 +323,8 @@ main(int argc, char *argv[])
_("64-bit integers"));
printf(_("Float8 argument passing: %s\n"),
(ControlFile->float8ByVal ? _("by value") : _("by reference")));
+ printf(_("Default char data signedness: %s\n"),
+ (ControlFile->default_char_signedness ? _("signed") : _("unsigned")));
printf(_("Data page checksum version: %u\n"),
ControlFile->data_checksum_version);
printf(_("Mock authentication nonce: %s\n"),
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 083810f5b4..ef64f67424 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern XLogRecPtr GetXLogWriteRecPtr(void);
extern uint64 GetSystemIdentifier(void);
extern char *GetMockAuthenticationNonce(void);
extern bool DataChecksumsEnabled(void);
+extern bool GetDefaultCharSignedness(void);
extern XLogRecPtr GetFakeLSNForUnloggedRel(void);
extern Size XLOGShmemSize(void);
extern void XLOGShmemInit(void);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e80ff8e414..3663790589 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -221,6 +221,12 @@ typedef struct ControlFileData
/* Are data pages protected by checksums? Zero if no checksum version */
uint32 data_checksum_version;
+ /*
+ * True if the default signedness of char is "signed" on a platform where
+ * the cluster is initialized.
+ */
+ bool default_char_signedness;
+
/*
* Random nonce, used in authentication requests that need to proceed
* based on values that are cluster-unique, like a SASL exchange that
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ff5436acac..acdf63aa62 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12060,9 +12060,9 @@
descr => 'pg_controldata init state information as a function',
proname => 'pg_control_init', provolatile => 'v', prorettype => 'record',
proargtypes => '',
- proallargtypes => '{int4,int4,int4,int4,int4,int4,int4,int4,int4,bool,int4}',
- proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
- proargnames => '{max_data_alignment,database_block_size,blocks_per_segment,wal_block_size,bytes_per_wal_segment,max_identifier_length,max_index_columns,max_toast_chunk_size,large_object_chunk_size,float8_pass_by_value,data_page_checksum_version}',
+ proallargtypes => '{int4,int4,int4,int4,int4,int4,int4,int4,int4,bool,int4,bool}',
+ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{max_data_alignment,database_block_size,blocks_per_segment,wal_block_size,bytes_per_wal_segment,max_identifier_length,max_index_columns,max_toast_chunk_size,large_object_chunk_size,float8_pass_by_value,data_page_checksum_version,default_char_signedness}',
prosrc => 'pg_control_init' },
# subscripting support for built-in types
--
2.39.3
On Wed, Sep 25, 2024 at 03:46:27PM -0700, Masahiko Sawada wrote:
On Wed, Sep 18, 2024 at 6:46 PM Noah Misch <noah@leadboat.com> wrote:
On Tue, Sep 17, 2024 at 09:43:41PM -0700, Masahiko Sawada wrote:
On Mon, Sep 16, 2024 at 9:24 AM Noah Misch <noah@leadboat.com> wrote:
On Thu, Sep 12, 2024 at 03:42:48PM -0700, Masahiko Sawada wrote:
On Tue, Sep 10, 2024 at 3:05 PM Noah Misch <noah@leadboat.com> wrote:
On Tue, Sep 10, 2024 at 05:56:47PM -0400, Tom Lane wrote:
Got it. So now I'm wondering if we need all the complexity of storing
stuff in the GIN metapages. Could we simply read the (primary's)
signedness out of pg_control and use that?
Thanks. I like the simplicity of this approach. If we agree with this
approach, I'd like to proceed with it.Works for me.
Regardless of what approach we take, I wanted to provide some
regression tests for these changes, but I could not come up with a
reasonable idea. It would be great if we could do something like
027_stream_regress.pl on cross-architecture replication. But just
executing 027_stream_regress.pl on cross-architecture replication
could not be sufficient since we would like to confirm query results
as well. If we could have a reasonable tool or way, it would also help
find other similar bugs related architectural differences.Perhaps add a regress.c function that changes the control file flag and
flushes the change to disk?I've tried this idea but found out that extensions can flush the
controlfile on the disk after flipping the char signedness value, they
cannot update the controlfile data on the shared memory. And, when the
server shuts down, the on-disk controlfile is updated with the
on-memory controlfile data, so the char signedness is reverted.We can add a function to set the char signedness of on-memory
controlfile data or add a new option to pg_resetwal to change the char
signedness on-disk controlfile data but they seem to be overkill to me
and I'm concerned about misusing these option and function.
Before the project is over, pg_upgrade will have some way to set the control
file signedness to the source cluster signedness. I think pg_upgrade should
also take an option for overriding the source cluster signedness for source
clusters v17 and older. That way, a user knowing they copied the v17 source
cluster from x86 to ARM can pg_upgrade properly without the prerequisite of
acquiring an x86 VM. Once that all exists, perhaps it will be clearer how
best to change signedness for testing.
While this change does not encourage the use of 'char' without
explicitly specifying its signedness, it provides a hint to ensure
consistent behavior for indexes (e.g., GIN and GiST) and tables that
already store data sorted by the 'char' type on disk, especially in
cross-platform replication scenarios.
Let's put that sort of information in the GetDefaultCharSignedness() header
comment. New code should use explicit "signed char" or "unsigned char" when
it matters for data file compatibility. GetDefaultCharSignedness() exists for
code that deals with pre-v18 data files, where we accidentally let C
implementation signedness affect persistent data.
@@ -4256,6 +4257,18 @@ WriteControlFile(void)
ControlFile->float8ByVal = FLOAT8PASSBYVAL;
+ /* + * The signedness of the char type is implementation-defined. For example + * on x86 architecture CPUs, the char data type is typically treated as + * signed by default whereas on aarch architecture CPUs it is typically + * treated as unsigned by default. + */ +#if CHAR_MIN != 0 + ControlFile->default_char_signedness = true; +#else + ControlFile->default_char_signedness = false; +#endif
This has initdb set the field to the current C implementation's signedness. I
wonder if, instead, initdb should set signedness=true unconditionally. Then
the only source of signedness=false will be pg_upgrade.
Advantage: signedness=false will get rarer over time. If we had known about
this problem during the last development cycle that forced initdb (v8.3), we
would have made all clusters signed or all clusters unsigned. Making
pg_upgrade the only source of signedness=false will cause the population of
database clusters to converge toward that retrospective ideal.
Disadvantage: testing signedness=false will require more planning.
What other merits should we consider as part of deciding?
On Mon, Sep 30, 2024 at 11:49 AM Noah Misch <noah@leadboat.com> wrote:
On Wed, Sep 25, 2024 at 03:46:27PM -0700, Masahiko Sawada wrote:
On Wed, Sep 18, 2024 at 6:46 PM Noah Misch <noah@leadboat.com> wrote:
On Tue, Sep 17, 2024 at 09:43:41PM -0700, Masahiko Sawada wrote:
On Mon, Sep 16, 2024 at 9:24 AM Noah Misch <noah@leadboat.com> wrote:
On Thu, Sep 12, 2024 at 03:42:48PM -0700, Masahiko Sawada wrote:
On Tue, Sep 10, 2024 at 3:05 PM Noah Misch <noah@leadboat.com> wrote:
On Tue, Sep 10, 2024 at 05:56:47PM -0400, Tom Lane wrote:
Got it. So now I'm wondering if we need all the complexity of storing
stuff in the GIN metapages. Could we simply read the (primary's)
signedness out of pg_control and use that?Thanks. I like the simplicity of this approach. If we agree with this
approach, I'd like to proceed with it.Works for me.
Regardless of what approach we take, I wanted to provide some
regression tests for these changes, but I could not come up with a
reasonable idea. It would be great if we could do something like
027_stream_regress.pl on cross-architecture replication. But just
executing 027_stream_regress.pl on cross-architecture replication
could not be sufficient since we would like to confirm query results
as well. If we could have a reasonable tool or way, it would also help
find other similar bugs related architectural differences.Perhaps add a regress.c function that changes the control file flag and
flushes the change to disk?I've tried this idea but found out that extensions can flush the
controlfile on the disk after flipping the char signedness value, they
cannot update the controlfile data on the shared memory. And, when the
server shuts down, the on-disk controlfile is updated with the
on-memory controlfile data, so the char signedness is reverted.We can add a function to set the char signedness of on-memory
controlfile data or add a new option to pg_resetwal to change the char
signedness on-disk controlfile data but they seem to be overkill to me
and I'm concerned about misusing these option and function.Before the project is over, pg_upgrade will have some way to set the control
file signedness to the source cluster signedness. I think pg_upgrade should
also take an option for overriding the source cluster signedness for source
clusters v17 and older. That way, a user knowing they copied the v17 source
cluster from x86 to ARM can pg_upgrade properly without the prerequisite of
acquiring an x86 VM.
Good idea.
Once that all exists, perhaps it will be clearer how
best to change signedness for testing.
Agreed.
While this change does not encourage the use of 'char' without
explicitly specifying its signedness, it provides a hint to ensure
consistent behavior for indexes (e.g., GIN and GiST) and tables that
already store data sorted by the 'char' type on disk, especially in
cross-platform replication scenarios.Let's put that sort of information in the GetDefaultCharSignedness() header
comment. New code should use explicit "signed char" or "unsigned char" when
it matters for data file compatibility. GetDefaultCharSignedness() exists for
code that deals with pre-v18 data files, where we accidentally let C
implementation signedness affect persistent data.
Agreed, will fix.
@@ -4256,6 +4257,18 @@ WriteControlFile(void)
ControlFile->float8ByVal = FLOAT8PASSBYVAL;
+ /* + * The signedness of the char type is implementation-defined. For example + * on x86 architecture CPUs, the char data type is typically treated as + * signed by default whereas on aarch architecture CPUs it is typically + * treated as unsigned by default. + */ +#if CHAR_MIN != 0 + ControlFile->default_char_signedness = true; +#else + ControlFile->default_char_signedness = false; +#endifThis has initdb set the field to the current C implementation's signedness. I
wonder if, instead, initdb should set signedness=true unconditionally. Then
the only source of signedness=false will be pg_upgrade.Advantage: signedness=false will get rarer over time. If we had known about
this problem during the last development cycle that forced initdb (v8.3), we
would have made all clusters signed or all clusters unsigned. Making
pg_upgrade the only source of signedness=false will cause the population of
database clusters to converge toward that retrospective ideal.
I think it's a good idea. Being able to treat one case as a rarer one
rather than treating both cases equally may have various advantages in
the future, for example when developing or investigating a problem.
Disadvantage: testing signedness=false will require more planning.
If testing signedness=false always requires pg_upgrade, there might be
some cumbersomeness. Inventing a testing-purpose-only tool (e.g., a
CLI program) that changes the signedness might make tests easier.
What other merits should we consider as part of deciding?
Considering that the population of database cluster signedness will
converge to signedness=true in the future, we can consider using
-fsigned-char to prevent similar problems for the future. We need to
think about possible side-effects as well, though.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Tue, Oct 01, 2024 at 11:55:48AM -0700, Masahiko Sawada wrote:
On Mon, Sep 30, 2024 at 11:49 AM Noah Misch <noah@leadboat.com> wrote:
This has initdb set the field to the current C implementation's signedness. I
wonder if, instead, initdb should set signedness=true unconditionally. Then
the only source of signedness=false will be pg_upgrade.Advantage: signedness=false will get rarer over time. If we had known about
this problem during the last development cycle that forced initdb (v8.3), we
would have made all clusters signed or all clusters unsigned. Making
pg_upgrade the only source of signedness=false will cause the population of
database clusters to converge toward that retrospective ideal.I think it's a good idea. Being able to treat one case as a rarer one
rather than treating both cases equally may have various advantages in
the future, for example when developing or investigating a problem.Disadvantage: testing signedness=false will require more planning.
If testing signedness=false always requires pg_upgrade, there might be
some cumbersomeness. Inventing a testing-purpose-only tool (e.g., a
CLI program) that changes the signedness might make tests easier.What other merits should we consider as part of deciding?
Considering that the population of database cluster signedness will
converge to signedness=true in the future, we can consider using
-fsigned-char to prevent similar problems for the future. We need to
think about possible side-effects as well, though.
It's good to think about -fsigned-char. While I find it tempting, several
things would need to hold for us to benefit from it:
- Every supported compiler has to offer it or an equivalent.
- The non-compiler parts of every supported C implementation need to
cooperate. For example, CHAR_MIN must change in response to the flag. See
the first comment in cash_in().
- Libraries we depend on can't do anything incompatible with it.
Given that, I would lean toward not using -fsigned-char. It's unlikely all
three things will hold. Even if they do, the benefit is not large.
Noah Misch <noah@leadboat.com> writes:
On Tue, Oct 01, 2024 at 11:55:48AM -0700, Masahiko Sawada wrote:
Considering that the population of database cluster signedness will
converge to signedness=true in the future, we can consider using
-fsigned-char to prevent similar problems for the future. We need to
think about possible side-effects as well, though.
It's good to think about -fsigned-char. While I find it tempting, several
things would need to hold for us to benefit from it:
- Every supported compiler has to offer it or an equivalent.
- The non-compiler parts of every supported C implementation need to
cooperate. For example, CHAR_MIN must change in response to the flag. See
the first comment in cash_in().
- Libraries we depend on can't do anything incompatible with it.
Given that, I would lean toward not using -fsigned-char. It's unlikely all
three things will hold. Even if they do, the benefit is not large.
I am very, very strongly against deciding that Postgres will only
support one setting of char signedness. It's a step on the way to
hardware monoculture, and we know where that eventually leads.
(In other words, I categorically reject Sawada-san's assertion
that signed chars will become universal. I'd reject the opposite
assertion as well.)
regards, tom lane
On Tue, Oct 1, 2024 at 8:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Noah Misch <noah@leadboat.com> writes:
On Tue, Oct 01, 2024 at 11:55:48AM -0700, Masahiko Sawada wrote:
Considering that the population of database cluster signedness will
converge to signedness=true in the future, we can consider using
-fsigned-char to prevent similar problems for the future. We need to
think about possible side-effects as well, though.It's good to think about -fsigned-char. While I find it tempting, several
things would need to hold for us to benefit from it:- Every supported compiler has to offer it or an equivalent.
- The non-compiler parts of every supported C implementation need to
cooperate. For example, CHAR_MIN must change in response to the flag. See
the first comment in cash_in().
- Libraries we depend on can't do anything incompatible with it.Given that, I would lean toward not using -fsigned-char. It's unlikely all
three things will hold. Even if they do, the benefit is not large.I am very, very strongly against deciding that Postgres will only
support one setting of char signedness. It's a step on the way to
hardware monoculture, and we know where that eventually leads.
(In other words, I categorically reject Sawada-san's assertion
that signed chars will become universal. I'd reject the opposite
assertion as well.)
Thank you for pointing this out. I agree with both of you.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Wed, Oct 2, 2024 at 10:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Oct 1, 2024 at 8:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Noah Misch <noah@leadboat.com> writes:
On Tue, Oct 01, 2024 at 11:55:48AM -0700, Masahiko Sawada wrote:
Considering that the population of database cluster signedness will
converge to signedness=true in the future, we can consider using
-fsigned-char to prevent similar problems for the future. We need to
think about possible side-effects as well, though.It's good to think about -fsigned-char. While I find it tempting, several
things would need to hold for us to benefit from it:- Every supported compiler has to offer it or an equivalent.
- The non-compiler parts of every supported C implementation need to
cooperate. For example, CHAR_MIN must change in response to the flag. See
the first comment in cash_in().
- Libraries we depend on can't do anything incompatible with it.Given that, I would lean toward not using -fsigned-char. It's unlikely all
three things will hold. Even if they do, the benefit is not large.I am very, very strongly against deciding that Postgres will only
support one setting of char signedness. It's a step on the way to
hardware monoculture, and we know where that eventually leads.
(In other words, I categorically reject Sawada-san's assertion
that signed chars will become universal. I'd reject the opposite
assertion as well.)Thank you for pointing this out. I agree with both of you.
I've attached PoC patches for the idea Noah proposed. Newly created
clusters unconditionally have default_char_signedness=true, and the
only source of signedness=false is pg_upgrade. To update the
signedness in the controlfile, pg_resetwal now has a new option
--char-signedness, which is used by pg_upgrade internally. Feedback is
very welcome.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v2-0005-Fix-an-issue-with-index-scan-using-pg_trgm-due-to.patchapplication/octet-stream; name=v2-0005-Fix-an-issue-with-index-scan-using-pg_trgm-due-to.patchDownload
From 77b80116117a337b7163e65c54085e517b9d0e17 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Sep 2024 15:17:02 -0700
Subject: [PATCH v2 5/5] Fix an issue with index scan using pg_trgm due to char
signedness on different architectures.
GIN and GiST indexes utilizing pg_trgm's opclasses store sorted
trigrams within index tuples. When comparing and sorting each trigram,
pg_trgm treats each character as a 'char[3]' type in C. However, the
char type in C can be interpreted as either signed char or unsigned
char, depending on the platform, if the signedness is not explicitly
specified. Consequently, during replication between different CPU
architectures, there was an issue where index scans on standby servers
could not locate matching index tuples due to the differing treatment
of character signedness.
This change introduces comparison functions for trgm that explicitly
handle signed char and unsigned char. The appropriate comparison
function will be dynamically selected based on the character
signedness stored in the control file. Therefore, upgraded clusters
can utilize the indexes without rebuilding, provided the cluster
upgrade occurs on platforms with the same character signedness as the
original cluster initialization.
The default char signedness information was introduced in XXXX, so no
backpatch.
Reviewed-by:
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
contrib/pg_trgm/trgm.h | 4 +--
contrib/pg_trgm/trgm_op.c | 42 ++++++++++++++++++++++++++++
src/backend/storage/lmgr/predicate.c | 5 ++++
3 files changed, 49 insertions(+), 2 deletions(-)
diff --git a/contrib/pg_trgm/trgm.h b/contrib/pg_trgm/trgm.h
index afb0adb222..5534abe680 100644
--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -42,8 +42,6 @@
typedef char trgm[3];
#define CMPCHAR(a,b) ( ((a)==(b)) ? 0 : ( ((a)<(b)) ? -1 : 1 ) )
-#define CMPPCHAR(a,b,i) CMPCHAR( *(((const char*)(a))+i), *(((const char*)(b))+i) )
-#define CMPTRGM(a,b) ( CMPPCHAR(a,b,0) ? CMPPCHAR(a,b,0) : ( CMPPCHAR(a,b,1) ? CMPPCHAR(a,b,1) : CMPPCHAR(a,b,2) ) )
#define CPTRGM(a,b) do { \
*(((char*)(a))+0) = *(((char*)(b))+0); \
@@ -51,6 +49,8 @@ typedef char trgm[3];
*(((char*)(a))+2) = *(((char*)(b))+2); \
} while(0)
+extern int (*CMPTRGM) (const void *a, const void *b);
+
#ifdef KEEPONLYALNUM
#define ISWORDCHR(c) (t_isalnum(c))
#define ISPRINTABLECHAR(a) ( isascii( *(unsigned char*)(a) ) && (isalnum( *(unsigned char*)(a) ) || *(unsigned char*)(a)==' ') )
diff --git a/contrib/pg_trgm/trgm_op.c b/contrib/pg_trgm/trgm_op.c
index c509d15ee4..75e670edf6 100644
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@@ -40,6 +40,9 @@ PG_FUNCTION_INFO_V1(strict_word_similarity_commutator_op);
PG_FUNCTION_INFO_V1(strict_word_similarity_dist_op);
PG_FUNCTION_INFO_V1(strict_word_similarity_dist_commutator_op);
+static inline int CMPTRGM_CHOOSE(const void *a, const void *b);
+int (*CMPTRGM) (const void *a, const void *b) = CMPTRGM_CHOOSE;
+
/* Trigram with position */
typedef struct
{
@@ -105,6 +108,45 @@ _PG_init(void)
MarkGUCPrefixReserved("pg_trgm");
}
+/*
+ * Functions for comparing two trgms while treating each char as "signed char" or
+ * "unsigned char".
+ */
+static inline int
+CMPTRGM_SIGNED(const void *a, const void *b)
+{
+#define CMPPCHAR_S(a,b,i) CMPCHAR( *(((const signed char*)(a))+i), *(((const signed char*)(b))+i) )
+
+ return CMPPCHAR_S(a, b, 0) ? CMPPCHAR_S(a, b, 0)
+ : (CMPPCHAR_S(a, b, 1) ? CMPPCHAR_S(a, b, 1)
+ : CMPPCHAR_S(a, b, 2));
+}
+
+static inline int
+CMPTRGM_UNSIGNED(const void *a, const void *b)
+{
+#define CMPPCHAR_UNS(a,b,i) CMPCHAR( *(((const unsigned char*)(a))+i), *(((const unsigned char*)(b))+i) )
+
+ return CMPPCHAR_UNS(a, b, 0) ? CMPPCHAR_UNS(a, b, 0)
+ : (CMPPCHAR_UNS(a, b, 1) ? CMPPCHAR_UNS(a, b, 1)
+ : CMPPCHAR_UNS(a, b, 2));
+}
+
+/*
+ * This gets called on the first call. It replaces the function pointer so
+ * that subsequent calls are routed directly to the chosen implementation.
+ */
+static inline int
+CMPTRGM_CHOOSE(const void *a, const void *b)
+{
+ if (GetDefaultCharSignedness())
+ CMPTRGM = CMPTRGM_SIGNED;
+ else
+ CMPTRGM = CMPTRGM_UNSIGNED;
+
+ return CMPTRGM(a, b);
+}
+
/*
* Deprecated function.
* Use "pg_trgm.similarity_threshold" GUC variable instead of this function.
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index e24a0f2fdb..2648c1e260 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -2588,6 +2588,11 @@ PredicateLockPage(Relation relation, BlockNumber blkno, Snapshot snapshot)
if (!SerializationNeededForRead(relation, snapshot))
return;
+ ereport(LOG, (errmsg("predicate %s blk %u",
+ RelationGetRelationName(relation),
+ blkno),
+ errbacktrace()));
+
SET_PREDICATELOCKTARGETTAG_PAGE(tag,
relation->rd_locator.dbOid,
relation->rd_id,
--
2.39.3
v2-0004-pg_upgrade-Add-set-char-signedness-to-set-the-def.patchapplication/octet-stream; name=v2-0004-pg_upgrade-Add-set-char-signedness-to-set-the-def.patchDownload
From a70c3b59afd6715fbbd566c3b66ca2c42dfdd25d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 2 Oct 2024 15:12:27 -0700
Subject: [PATCH v2 4/5] pg_upgrade: Add --set-char-signedness to set the
default char signedness of new cluster.
This change adds a new option --set-char-signedness to pg_upgrade. It
enables user to set arbitrary signedness during pg_upgrade. This helps
cases where user who knew they copied the v17 source cluster from
x86 (signedness=true) to ARM (signedness=falese) can pg_upgrade
properly without the prerequisite of acquiring an x86 VM.
Reviewed-by:
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
doc/src/sgml/ref/pgupgrade.sgml | 18 ++++++++++++++++++
src/bin/pg_upgrade/option.c | 12 +++++++++++-
src/bin/pg_upgrade/pg_upgrade.c | 7 ++++++-
src/bin/pg_upgrade/pg_upgrade.h | 3 +++
4 files changed, 38 insertions(+), 2 deletions(-)
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 4777381dac..91c96d70d7 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -276,6 +276,24 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>--set-char-signedness=</option><replaceable>option</replaceable></term>
+ <listitem>
+ <para>
+ Manually set the default char signedness of new clusters. Possible values
+ are <literal>signed</literal> and <literal>unsigned</literal>.
+ </para>
+ <para>
+ The signedness of the 'char' type in C is implementation-dependent. For instance,
+ 'signed char' is used by default on x86 CPUs, while 'unsigned char' is used on aarch
+ CPUs. <application>pg_upgrade</application> automatically inherits the char
+ signedness from the old cluster. This option is useful for upgrading the cluster
+ that user knew they copied it to a platform having different char signedness
+ (e.g. from x86 to aarch).
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-?</option></term>
<term><option>--help</option></term>
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 6f41d63eed..9708e2e745 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -60,6 +60,7 @@ parseCommandLine(int argc, char *argv[])
{"copy", no_argument, NULL, 2},
{"copy-file-range", no_argument, NULL, 3},
{"sync-method", required_argument, NULL, 4},
+ {"set-char-signedness", required_argument, NULL, 5},
{NULL, 0, NULL, 0}
};
@@ -70,6 +71,7 @@ parseCommandLine(int argc, char *argv[])
user_opts.do_sync = true;
user_opts.transfer_mode = TRANSFER_MODE_COPY;
+ user_opts.char_signedness = -1;
os_info.progname = get_progname(argv[0]);
@@ -211,7 +213,14 @@ parseCommandLine(int argc, char *argv[])
exit(1);
user_opts.sync_method = pg_strdup(optarg);
break;
-
+ case 5:
+ if (pg_strcasecmp(optarg, "signed") == 0)
+ user_opts.char_signedness = 1;
+ else if (pg_strcasecmp(optarg, "unsigned") == 0)
+ user_opts.char_signedness = 0;
+ else
+ pg_fatal("invalid argument for option %s", "--set-char-signedness");
+ break;
default:
fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
os_info.progname);
@@ -307,6 +316,7 @@ usage(void)
printf(_(" --copy copy files to new cluster (default)\n"));
printf(_(" --copy-file-range copy files to new cluster with copy_file_range\n"));
printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
+ printf(_(" --set-char-signedness=OPTION set new cluster char signedness to \"signed\" or \"unsigned\""));
printf(_(" -?, --help show this help, then exit\n"));
printf(_("\n"
"Before running pg_upgrade you must:\n"
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 68cbac46ce..6be853672c 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -398,7 +398,12 @@ set_new_cluster_char_signedness(void)
prep_status("Setting the default char signedness for new cluster");
- if (GET_MAJOR_VERSION(old_cluster.major_version) <= 17)
+ if (user_opts.char_signedness != -1)
+ {
+ /* Use the specified char signedness */
+ new_char_signedness = (user_opts.char_signedness == 1);
+ }
+ else if (GET_MAJOR_VERSION(old_cluster.major_version) <= 17)
{
/*
* Pre-v18 database clusters don't have the default char signedness
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index a2cbf2d6d9..2fbed64d7f 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -328,6 +328,9 @@ typedef struct
int jobs; /* number of processes/threads to use */
char *socketdir; /* directory to use for Unix sockets */
char *sync_method;
+ int char_signedness; /* default char signedness: -1 for initial
+ * value, 1 for "signed" and 0 for
+ * "unsigned" */
} UserOpts;
typedef struct
--
2.39.3
v2-0001-Add-default_char_signedness-field-to-ControlFileD.patchapplication/octet-stream; name=v2-0001-Add-default_char_signedness-field-to-ControlFileD.patchDownload
From 5316978caae101be5d6bb4d7043d8e4b22532eb5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Sep 2024 11:08:57 -0700
Subject: [PATCH v2 1/5] Add default_char_signedness field to ControlFileData.
The signedness of the 'char' type in C is
implementation-dependent. For instance, 'signed char' is used by
default on x86 CPUs, while 'unsigned char' is used on aarch
CPUs. Previously, we accidentally let C implementation signedness
affect persistent data. This led to inconsistent results when
comparing char data across different platforms.
This commit introduces a new 'default_char_signedness' field in
ControlFileData to store the signedness of the 'char' type. While this
change does not encourage the use of 'char' without explicitly
specifying its signedness, this field can be used as a hint to ensure
consistent behavior for pre-v18 data files that store data sorted by
the 'char' type on disk (e.g., GIN and GiST indexes), especially in
cross-platform replication scenarios.
Newly created database clusters unconditionally set the default char
signedness to true. pg_upgrade (with an upcoming commit) changes this
flag for clusters if the source database cluster has
signedness=false. As a result, signedness=false setting will become
rare over time. If we had known about the problem during the last
development cycle that forced initdb (v8.3), we would have made all
clusters signed or all clusters unsigned. Making pg_upgrade the only
source of signedness=false will cause the population of database
clusters to converge toward that retrospective ideal.
XXX bump catversion.
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
doc/src/sgml/func.sgml | 5 ++++
src/backend/access/transam/xlog.c | 40 +++++++++++++++++++++++++
src/backend/utils/misc/pg_controldata.c | 7 +++--
src/bin/pg_controldata/pg_controldata.c | 2 ++
src/include/access/xlog.h | 1 +
src/include/catalog/pg_control.h | 6 ++++
src/include/catalog/pg_proc.dat | 6 ++--
7 files changed, 62 insertions(+), 5 deletions(-)
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7b4fbb5047..3730f5190b 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27858,6 +27858,11 @@ acl | {postgres=arwdDxtm/postgres,foo=r/postgres}
<entry><type>integer</type></entry>
</row>
+ <row>
+ <entry><structfield>signed_char</structfield></entry>
+ <entry><type>bool</type></entry>
+ </row>
+
</tbody>
</tgroup>
</table>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9102c8d772..50bb9f3514 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4256,6 +4256,33 @@ WriteControlFile(void)
ControlFile->float8ByVal = FLOAT8PASSBYVAL;
+ /*
+ * Initialize the default 'char' signedness.
+ *
+ * The signedness of the char type is implementation-defined. For instance
+ * on x86 architecture CPUs, the char data type is typically treated as
+ * signed by default, whereas on aarch architecture CPUs, it is typically
+ * treated as unsigned by default. In v17 or earlier, we accidentally let
+ * C implementation signedness affect persistent data. This led to
+ * inconsistent results when comparing char data across different
+ * platforms.
+ *
+ * This flag can be used as a hint to ensure consistent behavior for
+ * pre-v18 data files that store data sorted by the 'char' type on disk,
+ * especially in cross-platform replication scenarios.
+ *
+ * Newly created database clusters unconditionally set the default char
+ * signedness to true. pg_upgrade changes this flag for clusters that were
+ * initialized on signedness=false platforms. As a result,
+ * signedness=false setting will become rare over time. will get rarer
+ * over time. If we had known about this problem during the last
+ * development cycle that forced initdb (v8.3), we would have made all
+ * clusters signed or all clusters unsigned. Making pg_upgrade the only
+ * source of signedness=false will cause the population of database
+ * clusters to converge toward that retrospective ideal.
+ */
+ ControlFile->default_char_signedness = true;
+
/* Contents are protected with a CRC */
INIT_CRC32C(ControlFile->crc);
COMP_CRC32C(ControlFile->crc,
@@ -4584,6 +4611,19 @@ DataChecksumsEnabled(void)
return (ControlFile->data_checksum_version > 0);
}
+/*
+ * Return true if the cluster was initialized on a platform where the
+ * default signedness of char is "signed". This function exists for code
+ * that deals with pre-v18 data files that store data sorted by the 'char'
+ * type on disk (e.g., GIN and GiST indexes). See the comments in
+ * WriteControlFile() for details.
+ */
+bool
+GetDefaultCharSignedness(void)
+{
+ return ControlFile->default_char_signedness;
+}
+
/*
* Returns a fake LSN for unlogged relations.
*
diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index 98c932dc7b..7c750bf0b9 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -203,8 +203,8 @@ pg_control_recovery(PG_FUNCTION_ARGS)
Datum
pg_control_init(PG_FUNCTION_ARGS)
{
- Datum values[11];
- bool nulls[11];
+ Datum values[12];
+ bool nulls[12];
TupleDesc tupdesc;
HeapTuple htup;
ControlFileData *ControlFile;
@@ -254,6 +254,9 @@ pg_control_init(PG_FUNCTION_ARGS)
values[10] = Int32GetDatum(ControlFile->data_checksum_version);
nulls[10] = false;
+ values[11] = BoolGetDatum(ControlFile->default_char_signedness);
+ nulls[11] = false;
+
htup = heap_form_tuple(tupdesc, values, nulls);
PG_RETURN_DATUM(HeapTupleGetDatum(htup));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca..ac74c2c5e9 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -325,6 +325,8 @@ main(int argc, char *argv[])
(ControlFile->float8ByVal ? _("by value") : _("by reference")));
printf(_("Data page checksum version: %u\n"),
ControlFile->data_checksum_version);
+ printf(_("Default char data signedness: %s\n"),
+ (ControlFile->default_char_signedness ? _("signed") : _("unsigned")));
printf(_("Mock authentication nonce: %s\n"),
mock_auth_nonce_str);
return 0;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 34ad46c067..45c2aef4b6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -231,6 +231,7 @@ extern XLogRecPtr GetXLogWriteRecPtr(void);
extern uint64 GetSystemIdentifier(void);
extern char *GetMockAuthenticationNonce(void);
extern bool DataChecksumsEnabled(void);
+extern bool GetDefaultCharSignedness(void);
extern XLogRecPtr GetFakeLSNForUnloggedRel(void);
extern Size XLOGShmemSize(void);
extern void XLOGShmemInit(void);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e80ff8e414..3663790589 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -221,6 +221,12 @@ typedef struct ControlFileData
/* Are data pages protected by checksums? Zero if no checksum version */
uint32 data_checksum_version;
+ /*
+ * True if the default signedness of char is "signed" on a platform where
+ * the cluster is initialized.
+ */
+ bool default_char_signedness;
+
/*
* Random nonce, used in authentication requests that need to proceed
* based on values that are cluster-unique, like a SASL exchange that
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 77f54a79e6..3245527598 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12128,9 +12128,9 @@
descr => 'pg_controldata init state information as a function',
proname => 'pg_control_init', provolatile => 'v', prorettype => 'record',
proargtypes => '',
- proallargtypes => '{int4,int4,int4,int4,int4,int4,int4,int4,int4,bool,int4}',
- proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
- proargnames => '{max_data_alignment,database_block_size,blocks_per_segment,wal_block_size,bytes_per_wal_segment,max_identifier_length,max_index_columns,max_toast_chunk_size,large_object_chunk_size,float8_pass_by_value,data_page_checksum_version}',
+ proallargtypes => '{int4,int4,int4,int4,int4,int4,int4,int4,int4,bool,int4,bool}',
+ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{max_data_alignment,database_block_size,blocks_per_segment,wal_block_size,bytes_per_wal_segment,max_identifier_length,max_index_columns,max_toast_chunk_size,large_object_chunk_size,float8_pass_by_value,data_page_checksum_version,default_char_signedness}',
prosrc => 'pg_control_init' },
# subscripting support for built-in types
--
2.39.3
v2-0003-pg_upgrade-Inherit-default-char-signedness-from-o.patchapplication/octet-stream; name=v2-0003-pg_upgrade-Inherit-default-char-signedness-from-o.patchDownload
From 49b24b4c67b43765d06ff699cd481c2abe094c8d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 2 Oct 2024 15:11:23 -0700
Subject: [PATCH v2 3/5] pg_upgrade: Inherit default char signedness from old
cluster.
Commit XXX introduced the 'default_char_signedness' field in
controlfile. Newly created database clusters always set this field to
'signed'.
This change ensures that pg_upgrade updates the
'default_char_signedness' to 'unsigned' if the source database cluster
has signedness=false. For source clusters from v17 or earlier, which
lack the 'default_char_signedness' information, pg_upgrade assumes the
source cluster was initialized on the same platform where pg_upgrae is
running. It then sets the 'default_char_signedness' value according to
the current platform's default character signedness.
Reviewed-by:
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
src/bin/pg_upgrade/controldata.c | 26 ++++++++++++++++++++-
src/bin/pg_upgrade/pg_upgrade.c | 40 ++++++++++++++++++++++++++++++++
src/bin/pg_upgrade/pg_upgrade.h | 1 +
3 files changed, 66 insertions(+), 1 deletion(-)
diff --git a/src/bin/pg_upgrade/controldata.c b/src/bin/pg_upgrade/controldata.c
index 854c6887a2..6c63b25647 100644
--- a/src/bin/pg_upgrade/controldata.c
+++ b/src/bin/pg_upgrade/controldata.c
@@ -62,6 +62,7 @@ get_control_data(ClusterInfo *cluster)
bool got_date_is_int = false;
bool got_data_checksum_version = false;
bool got_cluster_state = false;
+ bool got_default_char_signedness = false;
char *lc_collate = NULL;
char *lc_ctype = NULL;
char *lc_monetary = NULL;
@@ -206,6 +207,13 @@ get_control_data(ClusterInfo *cluster)
got_data_checksum_version = true;
}
+ /* Only in <= 17 */
+ if (GET_MAJOR_VERSION(cluster->major_version) <= 1700)
+ {
+ cluster->controldata.default_char_signedness = true;
+ got_default_char_signedness = true;
+ }
+
/* we have the result of cmd in "output". so parse it line by line now */
while (fgets(bufin, sizeof(bufin), output))
{
@@ -501,6 +509,17 @@ get_control_data(ClusterInfo *cluster)
cluster->controldata.data_checksum_version = str2uint(p);
got_data_checksum_version = true;
}
+ else if ((p = strstr(bufin, "char data signedness:")) != NULL)
+ {
+ p = strchr(p, ':');
+
+ if (p == NULL || strlen(p) <= 1)
+ pg_fatal("%d: controldata retrieval problem", __LINE__);
+
+ p++; /* remove ':' char */
+ cluster->controldata.default_char_signedness = strstr(p, "signed") != NULL;
+ got_default_char_signedness = true;
+ }
}
rc = pclose(output);
@@ -572,7 +591,8 @@ get_control_data(ClusterInfo *cluster)
!got_index || !got_toast ||
(!got_large_object &&
cluster->controldata.ctrl_ver >= LARGE_OBJECT_SIZE_PG_CONTROL_VER) ||
- !got_date_is_int || !got_data_checksum_version)
+ !got_date_is_int || !got_data_checksum_version ||
+ !got_default_char_signedness)
{
if (cluster == &old_cluster)
pg_log(PG_REPORT,
@@ -641,6 +661,10 @@ get_control_data(ClusterInfo *cluster)
if (!got_data_checksum_version)
pg_log(PG_REPORT, " data checksum version");
+ /* value added in Postgres 18 */
+ if (!got_default_char_signedness)
+ pg_log(PG_REPORT, " default char signedness");
+
pg_fatal("Cannot continue without required control information, terminating");
}
}
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 663235816f..68cbac46ce 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -39,6 +39,7 @@
#include "postgres_fe.h"
#include <time.h>
+#include <limits.h> /* for CHAR_MIN */
#include "catalog/pg_class_d.h"
#include "common/file_perm.h"
@@ -54,6 +55,7 @@
*/
#define RESTORE_TRANSACTION_SIZE 1000
+static void set_new_cluster_char_signedness(void);
static void set_locale_and_encoding(void);
static void prepare_new_cluster(void);
static void prepare_new_globals(void);
@@ -154,6 +156,7 @@ main(int argc, char **argv)
*/
copy_xact_xlog_xid();
+ set_new_cluster_char_signedness();
/* New now using xids of the old system */
@@ -388,6 +391,43 @@ setup(char *argv0)
}
}
+static void
+set_new_cluster_char_signedness(void)
+{
+ bool new_char_signedness;
+
+ prep_status("Setting the default char signedness for new cluster");
+
+ if (GET_MAJOR_VERSION(old_cluster.major_version) <= 17)
+ {
+ /*
+ * Pre-v18 database clusters don't have the default char signedness
+ * information. We use the char signedness of the platform where
+ * pg_upgrade was built.
+ */
+#if CHAR_MIN != 0
+ new_char_signedness = true;
+#else
+ new_char_signedness = false;
+#endif
+ }
+ else
+ {
+ /* Set the source database signedness */
+ new_char_signedness = old_cluster.controldata.default_char_signedness;
+ }
+
+ /* Change the char signedness of the new cluster, if necessary */
+ if (new_cluster.controldata.default_char_signedness != new_char_signedness)
+ {
+ exec_prog(UTILITY_LOG_FILE, NULL, true, true,
+ "\"%s/pg_resetwal\" --char-signedness %s \"%s\"",
+ new_cluster.bindir,
+ new_char_signedness ? "signed" : "unsigned",
+ new_cluster.pgdata);
+ check_ok();
+ }
+}
/*
* Copy locale and encoding information into the new cluster's template0.
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 53f693c2d4..a2cbf2d6d9 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -245,6 +245,7 @@ typedef struct
bool date_is_int;
bool float8_pass_by_value;
uint32 data_checksum_version;
+ bool default_char_signedness;
} ControlData;
/*
--
2.39.3
v2-0002-pg_resetwal-Add-char-signedness-option-to-change-.patchapplication/octet-stream; name=v2-0002-pg_resetwal-Add-char-signedness-option-to-change-.patchDownload
From d32cb3c21734cdebe2ea1d41f8fa9999f2fe257c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 2 Oct 2024 15:08:26 -0700
Subject: [PATCH v2 2/5] pg_resetwal: Add --char-signedness option to change
the default char signedness.
With the newly added option --char-signedness, pg_resetwal updates the
default char signeness flag in the controlfile.
Reviewed-by:
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
doc/src/sgml/ref/pg_resetwal.sgml | 10 ++++++++++
src/bin/pg_resetwal/pg_resetwal.c | 25 +++++++++++++++++++++++++
src/bin/pg_resetwal/t/001_basic.pl | 6 ++++++
3 files changed, 41 insertions(+)
diff --git a/doc/src/sgml/ref/pg_resetwal.sgml b/doc/src/sgml/ref/pg_resetwal.sgml
index cf9c7e70f2..f1a349d364 100644
--- a/doc/src/sgml/ref/pg_resetwal.sgml
+++ b/doc/src/sgml/ref/pg_resetwal.sgml
@@ -171,6 +171,16 @@ PostgreSQL documentation
</para>
<variablelist>
+ <varlistentry>
+ <term><option>--char-signedness=<replaceable class="parameter">option</replaceable></option></term>
+ <listitem>
+ <para>
+ Manually set the default char signedness. Possible values are
+ <literal>signed</literal> and <literal>unsigned</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-c <replaceable class="parameter">xid</replaceable>,<replaceable class="parameter">xid</replaceable></option></term>
<term><option>--commit-timestamp-ids=<replaceable class="parameter">xid</replaceable>,<replaceable class="parameter">xid</replaceable></option></term>
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d8..55eb584d0a 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -75,6 +75,7 @@ static TimeLineID minXlogTli = 0;
static XLogSegNo minXlogSegNo = 0;
static int WalSegSz;
static int set_wal_segsize;
+static int set_char_signedness = -1;
static void CheckDataVersion(void);
static bool read_controlfile(void);
@@ -106,6 +107,7 @@ main(int argc, char *argv[])
{"oldest-transaction-id", required_argument, NULL, 'u'},
{"next-transaction-id", required_argument, NULL, 'x'},
{"wal-segsize", required_argument, NULL, 1},
+ {"char-signedness", required_argument, NULL, 2},
{NULL, 0, NULL, 0}
};
@@ -302,6 +304,23 @@ main(int argc, char *argv[])
break;
}
+ case 2:
+ {
+ errno = 0;
+
+ if (pg_strcasecmp(optarg, "signed") == 0)
+ set_char_signedness = 1;
+ else if (pg_strcasecmp(optarg, "unsigned") == 0)
+ set_char_signedness = 0;
+ else
+ {
+ pg_log_error("invalid argument for option %s", "--char-signedness");
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ break;
+ }
+
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -456,6 +475,9 @@ main(int argc, char *argv[])
if (set_wal_segsize != 0)
ControlFile.xlog_seg_size = WalSegSz;
+ if (set_char_signedness != -1)
+ ControlFile.default_char_signedness = (set_char_signedness == 1);
+
if (minXlogSegNo > newXlogSegNo)
newXlogSegNo = minXlogSegNo;
@@ -779,6 +801,8 @@ PrintControlValues(bool guessed)
(ControlFile.float8ByVal ? _("by value") : _("by reference")));
printf(_("Data page checksum version: %u\n"),
ControlFile.data_checksum_version);
+ printf(_("Default char data signedness: %s\n"),
+ (ControlFile.default_char_signedness ? _("signed") : _("unsigned")));
}
@@ -1189,6 +1213,7 @@ usage(void)
printf(_(" -u, --oldest-transaction-id=XID set oldest transaction ID\n"));
printf(_(" -x, --next-transaction-id=XID set next transaction ID\n"));
printf(_(" --wal-segsize=SIZE size of WAL segments, in megabytes\n"));
+ printf(_(" --char-signedness=OPTION set char signedness to \"signed\" or \"unsigned\"\n"));
printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 9829e48106..3ec3302804 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -172,6 +172,12 @@ command_fails_like(
qr/must be greater than/,
'fails with -x value too small');
+# --char-signedness
+command_fails_like(
+ [ 'pg_resetwal', '--char-signedness', 'foo', $node->data_dir ],
+ qr/error: invalid argument for option --char-signedness/,
+ 'fails with incorrect --char-signedness option');
+
# run with control override options
my $out = (run_command([ 'pg_resetwal', '-n', $node->data_dir ]))[0];
--
2.39.3
On Thu, Oct 03, 2024 at 06:55:47AM -0700, Masahiko Sawada wrote:
I've attached PoC patches for the idea Noah proposed. Newly created
clusters unconditionally have default_char_signedness=true, and the
only source of signedness=false is pg_upgrade. To update the
signedness in the controlfile, pg_resetwal now has a new option
--char-signedness, which is used by pg_upgrade internally. Feedback is
very welcome.
Upthread, we discussed testability. Does pg_resetwal facilitate all
appropriate testing, or do testing difficulties remain?
I reviewed these patches, finding only one non-cosmetic review comment. Given
the PoC status, some of the observations below are likely ones you already
know or would have found before exiting PoC.
--- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -27858,6 +27858,11 @@ acl | {postgres=arwdDxtm/postgres,foo=r/postgres} <entry><type>integer</type></entry> </row>+ <row> + <entry><structfield>signed_char</structfield></entry> + <entry><type>bool</type></entry>
s/signed_char/default_char_signedness/ to align with the naming below.
--- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4256,6 +4256,33 @@ WriteControlFile(void)
+ * Newly created database clusters unconditionally set the default char + * signedness to true. pg_upgrade changes this flag for clusters that were + * initialized on signedness=false platforms. As a result, + * signedness=false setting will become rare over time. will get rarer + * over time.
There's a redundant fragment of sentence.
--- a/doc/src/sgml/ref/pg_resetwal.sgml +++ b/doc/src/sgml/ref/pg_resetwal.sgml @@ -171,6 +171,16 @@ PostgreSQL documentation </para><variablelist> + <varlistentry> + <term><option>--char-signedness=<replaceable class="parameter">option</replaceable></option></term> + <listitem> + <para> + Manually set the default char signedness. Possible values are + <literal>signed</literal> and <literal>unsigned</literal>. + </para> + </listitem> + </varlistentry>
This needs more docs, like its <varlistentry> neighbors have. First, see the
point below about the pg_upgrade docs.
--- a/src/bin/pg_resetwal/pg_resetwal.c +++ b/src/bin/pg_resetwal/pg_resetwal.c @@ -1189,6 +1213,7 @@ usage(void) printf(_(" -u, --oldest-transaction-id=XID set oldest transaction ID\n")); printf(_(" -x, --next-transaction-id=XID set next transaction ID\n")); printf(_(" --wal-segsize=SIZE size of WAL segments, in megabytes\n")); + printf(_(" --char-signedness=OPTION set char signedness to \"signed\" or \"unsigned\"\n"));
This help output alphabetizes by short option, with the one non-short option
at the end. So I'd also alphabetize among the non-short options, specifically
putting the new line before --wal-segsize.
source cluster was initialized on the same platform where pg_upgrae is
Typo "pg_upgrade"
--- a/src/bin/pg_upgrade/controldata.c +++ b/src/bin/pg_upgrade/controldata.c @@ -62,6 +62,7 @@ get_control_data(ClusterInfo *cluster) bool got_date_is_int = false; bool got_data_checksum_version = false; bool got_cluster_state = false; + bool got_default_char_signedness = false; char *lc_collate = NULL; char *lc_ctype = NULL; char *lc_monetary = NULL; @@ -206,6 +207,13 @@ get_control_data(ClusterInfo *cluster) got_data_checksum_version = true; }+ /* Only in <= 17 */ + if (GET_MAJOR_VERSION(cluster->major_version) <= 1700) + { + cluster->controldata.default_char_signedness = true;
If anything reads the "true" stored here, that would be a bug. Is there a
reasonable way do one or two of the following?
1. not initialize it at all
2. store something like -1 instead
3. add a comment that it's never read
+ got_default_char_signedness = true;
+ }
I wouldn't say we "got" this. I'd handle this more like got_oldestmulti,
where we know !got_oldestmulti is okay until MULTIXACT_FORMATCHANGE_CAT_VER.
Maybe also replace the "<= 1700" checks with catversion checks.
+ /* we have the result of cmd in "output". so parse it line by line now */ while (fgets(bufin, sizeof(bufin), output)) { @@ -501,6 +509,17 @@ get_control_data(ClusterInfo *cluster) cluster->controldata.data_checksum_version = str2uint(p); got_data_checksum_version = true; } + else if ((p = strstr(bufin, "char data signedness:")) != NULL) + { + p = strchr(p, ':'); + + if (p == NULL || strlen(p) <= 1) + pg_fatal("%d: controldata retrieval problem", __LINE__); + + p++; /* remove ':' char */ + cluster->controldata.default_char_signedness = strstr(p, "signed") != NULL;
Won't this match "unsigned", too? (This is my only non-cosmetic review
comment.) I'd strcmp for the two expected values. If neither matches, report
the pg_fatal "retrieval problem".
--- a/doc/src/sgml/ref/pgupgrade.sgml +++ b/doc/src/sgml/ref/pgupgrade.sgml @@ -276,6 +276,24 @@ PostgreSQL documentation </listitem> </varlistentry>+ <varlistentry> + <term><option>--set-char-signedness=</option><replaceable>option</replaceable></term> + <listitem> + <para> + Manually set the default char signedness of new clusters. Possible values + are <literal>signed</literal> and <literal>unsigned</literal>. + </para> + <para> + The signedness of the 'char' type in C is implementation-dependent. For instance, + 'signed char' is used by default on x86 CPUs, while 'unsigned char' is used on aarch + CPUs. <application>pg_upgrade</application> automatically inherits the char + signedness from the old cluster. This option is useful for upgrading the cluster + that user knew they copied it to a platform having different char signedness + (e.g. from x86 to aarch).
That last sentence would need some grammar help. Instead, let's rewrite this
paragraph to assume the reader is not a C programmer. Focus on what such a
reader should do to choose what argument, if any, to pass.
Our docs haven't used the term "aarch". This applies to 32-bit ARM in
addition to 64-bit ARM. Hence, I'd write the term "ARM" instead.
--- a/src/bin/pg_upgrade/option.c +++ b/src/bin/pg_upgrade/option.c @@ -307,6 +316,7 @@ usage(void) printf(_(" --copy copy files to new cluster (default)\n")); printf(_(" --copy-file-range copy files to new cluster with copy_file_range\n")); printf(_(" --sync-method=METHOD set method for syncing files to disk\n")); + printf(_(" --set-char-signedness=OPTION set new cluster char signedness to \"signed\" or \"unsigned\""));
Move this up one line, to preserve alphabetization.
--- a/src/backend/storage/lmgr/predicate.c +++ b/src/backend/storage/lmgr/predicate.c @@ -2588,6 +2588,11 @@ PredicateLockPage(Relation relation, BlockNumber blkno, Snapshot snapshot) if (!SerializationNeededForRead(relation, snapshot)) return;+ ereport(LOG, (errmsg("predicate %s blk %u", + RelationGetRelationName(relation), + blkno), + errbacktrace()));
Revert this file.
Thanks,
nm
Hi,
I apologize for the late response. I completely missed tracking
updates on this thread.
On Tue, Nov 19, 2024 at 5:28 PM Noah Misch <noah@leadboat.com> wrote:
On Thu, Oct 03, 2024 at 06:55:47AM -0700, Masahiko Sawada wrote:
I've attached PoC patches for the idea Noah proposed. Newly created
clusters unconditionally have default_char_signedness=true, and the
only source of signedness=false is pg_upgrade. To update the
signedness in the controlfile, pg_resetwal now has a new option
--char-signedness, which is used by pg_upgrade internally. Feedback is
very welcome.Upthread, we discussed testability. Does pg_resetwal facilitate all
appropriate testing, or do testing difficulties remain?
I think the new pg_resetwal facility to change the cluster's char
signedness is useful for testing. We can easily test cross
architecture replication from now on.
I reviewed these patches, finding only one non-cosmetic review comment. Given
the PoC status, some of the observations below are likely ones you already
know or would have found before exiting PoC.
Thank you for reviewing the patches. I agree with all the comments you
made. I've addressed them and I've attached new version patches that
now have some regression tests.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v3-0004-pg_upgrade-Add-set-char-signedness-to-set-the-def.patchapplication/octet-stream; name=v3-0004-pg_upgrade-Add-set-char-signedness-to-set-the-def.patchDownload
From 4ca3e79e0712ff0eaf5ce3e2bc36f6fffa45d3f3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 2 Oct 2024 15:12:27 -0700
Subject: [PATCH v3 4/5] pg_upgrade: Add --set-char-signedness to set the
default char signedness of new cluster.
This change adds a new option --set-char-signedness to pg_upgrade. It
enables user to set arbitrary signedness during pg_upgrade. This helps
cases where user who knew they copied the v17 source cluster from
x86 (signedness=true) to ARM (signedness=falese) can pg_upgrade
properly without the prerequisite of acquiring an x86 VM.
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
doc/src/sgml/ref/pgupgrade.sgml | 48 +++++++++++++++++++++
src/bin/pg_upgrade/option.c | 12 ++++++
src/bin/pg_upgrade/pg_upgrade.c | 10 ++++-
src/bin/pg_upgrade/pg_upgrade.h | 3 ++
src/bin/pg_upgrade/t/005_char_signedness.pl | 27 ++++++++++++
5 files changed, 98 insertions(+), 2 deletions(-)
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 4777381dac2..262632a9a3e 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -276,6 +276,54 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>--set-char-signedness=</option><replaceable>option</replaceable></term>
+ <listitem>
+ <para>
+ Manually set the default char signedness of new clusters. Possible values
+ are <literal>signed</literal> and <literal>unsigned</literal>.
+ </para>
+ <para>
+ In the C language, the default signedness of the <type>char</type> type
+ (when not explicitly specified) varies across platforms. For example,
+ <type>char</type> defaults to <type>signed char</type> on x86 CPUs but
+ to <type>unsigned char</type> on ARM CPUs. When data stored using the
+ <type>char</type> type is persisted to disk, such as in GIN indexes,
+ this platform-dependent behavior results in incorrect data comparisons
+ in two scenarios:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ When data is moved between platforms with different char signedness.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ When data is replicated using streaming replication across different architectures.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ Starting from <productname>PostgreSQL</productname> 18, database clusters
+ maintain their own default char signedness setting, which can be used as
+ a hint to ensure consistent behavior across platforms with different
+ default char signedness. By default, <application>pg_upgrade</application>
+ preserves the char signedness setting when upgrading from an existing cluster.
+ However, when upgrading from <productname>PostgreSQL</productname> 17 or
+ earlier, <application>pg_upgrade</application> adopts the char signedness
+ of the platform on which it was built.
+ </para>
+ <para>
+ This option allows you to explicitly set the default char signedness for
+ the new cluster, overriding any inherited values. This is particularly useful
+ when you plan to migrate the upgraded cluster to a platform with different
+ char signedness (for example, when moving from an x86-based system to an
+ ARM-based system).
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-?</option></term>
<term><option>--help</option></term>
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 108eb7a1ba4..1a580d656bb 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -60,6 +60,7 @@ parseCommandLine(int argc, char *argv[])
{"copy", no_argument, NULL, 2},
{"copy-file-range", no_argument, NULL, 3},
{"sync-method", required_argument, NULL, 4},
+ {"set-char-signedness", required_argument, NULL, 5},
{NULL, 0, NULL, 0}
};
@@ -70,6 +71,7 @@ parseCommandLine(int argc, char *argv[])
user_opts.do_sync = true;
user_opts.transfer_mode = TRANSFER_MODE_COPY;
+ user_opts.char_signedness = -1;
os_info.progname = get_progname(argv[0]);
@@ -212,6 +214,14 @@ parseCommandLine(int argc, char *argv[])
user_opts.sync_method = pg_strdup(optarg);
break;
+ case 5:
+ if (pg_strcasecmp(optarg, "signed") == 0)
+ user_opts.char_signedness = 1;
+ else if (pg_strcasecmp(optarg, "unsigned") == 0)
+ user_opts.char_signedness = 0;
+ else
+ pg_fatal("invalid argument for option %s", "--set-char-signedness");
+ break;
default:
fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
os_info.progname);
@@ -306,6 +316,8 @@ usage(void)
printf(_(" --clone clone instead of copying files to new cluster\n"));
printf(_(" --copy copy files to new cluster (default)\n"));
printf(_(" --copy-file-range copy files to new cluster with copy_file_range\n"));
+ printf(_(" --set-char-signedness=OPTION set new cluster char signedness to \"signed\" or\n"));
+ printf(_(" \"unsigned\"\n"));
printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
printf(_(" -?, --help show this help, then exit\n"));
printf(_("\n"
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index cc7357b5599..cd8e72f7bde 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -399,8 +399,14 @@ set_new_cluster_char_signedness(void)
{
bool new_char_signedness;
- /* Inherit the source database's signedness */
- new_char_signedness = old_cluster.controldata.default_char_signedness;
+ /*
+ * Use the specified char signedness if specifies. Otherwise we inherit
+ * inherit the source database's signedness.
+ */
+ if (user_opts.char_signedness != -1)
+ new_char_signedness = (user_opts.char_signedness == 1);
+ else
+ new_char_signedness = old_cluster.controldata.default_char_signedness;
/* Change the char signedness of the new cluster, if necessary */
if (new_cluster.controldata.default_char_signedness != new_char_signedness)
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index ee65cf795b7..ef86c7f3420 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -334,6 +334,9 @@ typedef struct
int jobs; /* number of processes/threads to use */
char *socketdir; /* directory to use for Unix sockets */
char *sync_method;
+ int char_signedness; /* default char signedness: -1 for initial
+ * value, 1 for "signed" and 0 for
+ * "unsigned" */
} UserOpts;
typedef struct
diff --git a/src/bin/pg_upgrade/t/005_char_signedness.pl b/src/bin/pg_upgrade/t/005_char_signedness.pl
index 780ad71406d..a2236df776e 100644
--- a/src/bin/pg_upgrade/t/005_char_signedness.pl
+++ b/src/bin/pg_upgrade/t/005_char_signedness.pl
@@ -63,4 +63,31 @@ command_like(
qr/Default char data signedness:\s+unsigned/,
'the default char signedness is updated during pg_upgrade');
+# Setup another set of old and new clusters.
+my $old2 = PostgreSQL::Test::Cluster->new('old2');
+my $new2 = PostgreSQL::Test::Cluster->new('new2');
+$old2->init();
+$new2->init();
+
+# pg_upgrade should be successful.
+command_ok(
+ [ 'pg_upgrade', '--no-sync',
+ '-d', $old->data_dir,
+ '-D', $new->data_dir,
+ '-b', $old->config_data('--bindir'),
+ '-B', $new->config_data('--bindir'),
+ '-s', $new->host,
+ '-p', $old->port,
+ '-P', $new->port,
+ '--set-char-signedness', 'unsigned',
+ $mode ],
+ 'run of pg_upgrade with --set-char-signedness option');
+
+# Check if --set-char-signedness successfully sets the new cluster's
+# default char signedness.
+command_like(
+ [ 'pg_controldata', $new->data_dir ],
+ qr/Default char data signedness:\s+unsigned/,
+ '--set-char-signedness sets unsigned');
+
done_testing();
--
2.43.5
v3-0005-Fix-an-issue-with-index-scan-using-pg_trgm-due-to.patchapplication/octet-stream; name=v3-0005-Fix-an-issue-with-index-scan-using-pg_trgm-due-to.patchDownload
From 07c2bca594d4a5fe28cf30345a2d65c825e63643 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Sep 2024 15:17:02 -0700
Subject: [PATCH v3 5/5] Fix an issue with index scan using pg_trgm due to char
signedness on different architectures.
GIN and GiST indexes utilizing pg_trgm's opclasses store sorted
trigrams within index tuples. When comparing and sorting each trigram,
pg_trgm treats each character as a 'char[3]' type in C. However, the
char type in C can be interpreted as either signed char or unsigned
char, depending on the platform, if the signedness is not explicitly
specified. Consequently, during replication between different CPU
architectures, there was an issue where index scans on standby servers
could not locate matching index tuples due to the differing treatment
of character signedness.
This change introduces comparison functions for trgm that explicitly
handle signed char and unsigned char. The appropriate comparison
function will be dynamically selected based on the character
signedness stored in the control file. Therefore, upgraded clusters
can utilize the indexes without rebuilding, provided the cluster
upgrade occurs on platforms with the same character signedness as the
original cluster initialization.
The default char signedness information was introduced in XXXX, so no
backpatch.
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
contrib/pg_trgm/trgm.h | 4 +---
contrib/pg_trgm/trgm_op.c | 42 +++++++++++++++++++++++++++++++++++++++
2 files changed, 43 insertions(+), 3 deletions(-)
diff --git a/contrib/pg_trgm/trgm.h b/contrib/pg_trgm/trgm.h
index 10827563694..691712ed105 100644
--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -41,14 +41,12 @@
typedef char trgm[3];
#define CMPCHAR(a,b) ( ((a)==(b)) ? 0 : ( ((a)<(b)) ? -1 : 1 ) )
-#define CMPPCHAR(a,b,i) CMPCHAR( *(((const char*)(a))+i), *(((const char*)(b))+i) )
-#define CMPTRGM(a,b) ( CMPPCHAR(a,b,0) ? CMPPCHAR(a,b,0) : ( CMPPCHAR(a,b,1) ? CMPPCHAR(a,b,1) : CMPPCHAR(a,b,2) ) )
-
#define CPTRGM(a,b) do { \
*(((char*)(a))+0) = *(((char*)(b))+0); \
*(((char*)(a))+1) = *(((char*)(b))+1); \
*(((char*)(a))+2) = *(((char*)(b))+2); \
} while(0)
+extern int (*CMPTRGM) (const void *a, const void *b);
#define ISWORDCHR(c) (t_isalnum(c))
#define ISPRINTABLECHAR(a) ( isascii( *(unsigned char*)(a) ) && (isalnum( *(unsigned char*)(a) ) || *(unsigned char*)(a)==' ') )
diff --git a/contrib/pg_trgm/trgm_op.c b/contrib/pg_trgm/trgm_op.c
index d0833b3e4a1..3549045b679 100644
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@@ -42,6 +42,9 @@ PG_FUNCTION_INFO_V1(strict_word_similarity_commutator_op);
PG_FUNCTION_INFO_V1(strict_word_similarity_dist_op);
PG_FUNCTION_INFO_V1(strict_word_similarity_dist_commutator_op);
+static inline int CMPTRGM_CHOOSE(const void *a, const void *b);
+int (*CMPTRGM) (const void *a, const void *b) = CMPTRGM_CHOOSE;
+
/* Trigram with position */
typedef struct
{
@@ -107,6 +110,45 @@ _PG_init(void)
MarkGUCPrefixReserved("pg_trgm");
}
+/*
+ * Functions for comparing two trgms while treating each char as "signed char" or
+ * "unsigned char".
+ */
+static inline int
+CMPTRGM_SIGNED(const void *a, const void *b)
+{
+#define CMPPCHAR_S(a,b,i) CMPCHAR( *(((const signed char*)(a))+i), *(((const signed char*)(b))+i) )
+
+ return CMPPCHAR_S(a, b, 0) ? CMPPCHAR_S(a, b, 0)
+ : (CMPPCHAR_S(a, b, 1) ? CMPPCHAR_S(a, b, 1)
+ : CMPPCHAR_S(a, b, 2));
+}
+
+static inline int
+CMPTRGM_UNSIGNED(const void *a, const void *b)
+{
+#define CMPPCHAR_UNS(a,b,i) CMPCHAR( *(((const unsigned char*)(a))+i), *(((const unsigned char*)(b))+i) )
+
+ return CMPPCHAR_UNS(a, b, 0) ? CMPPCHAR_UNS(a, b, 0)
+ : (CMPPCHAR_UNS(a, b, 1) ? CMPPCHAR_UNS(a, b, 1)
+ : CMPPCHAR_UNS(a, b, 2));
+}
+
+/*
+ * This gets called on the first call. It replaces the function pointer so
+ * that subsequent calls are routed directly to the chosen implementation.
+ */
+static inline int
+CMPTRGM_CHOOSE(const void *a, const void *b)
+{
+ if (GetDefaultCharSignedness())
+ CMPTRGM = CMPTRGM_SIGNED;
+ else
+ CMPTRGM = CMPTRGM_UNSIGNED;
+
+ return CMPTRGM(a, b);
+}
+
/*
* Deprecated function.
* Use "pg_trgm.similarity_threshold" GUC variable instead of this function.
--
2.43.5
v3-0003-pg_upgrade-Preserve-default-char-signedness-value.patchapplication/octet-stream; name=v3-0003-pg_upgrade-Preserve-default-char-signedness-value.patchDownload
From d98b3d8145fed54312158432ad1a2e953d4c7736 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 2 Oct 2024 15:11:23 -0700
Subject: [PATCH v3 3/5] pg_upgrade: Preserve default char signedness value
from old cluster.
Commit XXX introduced the 'default_char_signedness' field in
controlfile. Newly created database clusters always set this field to
'signed'.
This change ensures that pg_upgrade updates the
'default_char_signedness' to 'unsigned' if the source database cluster
has signedness=false. For source clusters from v17 or earlier, which
lack the 'default_char_signedness' information, pg_upgrade assumes the
source cluster was initialized on the same platform where pg_upgrade
is running. It then sets the 'default_char_signedness' value according
to the current platform's default character signedness.
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
src/bin/pg_upgrade/controldata.c | 44 +++++++++++++-
src/bin/pg_upgrade/pg_upgrade.c | 28 +++++++++
src/bin/pg_upgrade/pg_upgrade.h | 7 +++
src/bin/pg_upgrade/t/005_char_signedness.pl | 66 +++++++++++++++++++++
4 files changed, 144 insertions(+), 1 deletion(-)
create mode 100644 src/bin/pg_upgrade/t/005_char_signedness.pl
diff --git a/src/bin/pg_upgrade/controldata.c b/src/bin/pg_upgrade/controldata.c
index ab16716f319..0cfc5c2233d 100644
--- a/src/bin/pg_upgrade/controldata.c
+++ b/src/bin/pg_upgrade/controldata.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include <ctype.h>
+#include <limits.h> /* for CHAR_MIN */
#include "common/string.h"
#include "pg_upgrade.h"
@@ -62,6 +63,7 @@ get_control_data(ClusterInfo *cluster)
bool got_date_is_int = false;
bool got_data_checksum_version = false;
bool got_cluster_state = false;
+ bool got_default_char_signedness = false;
char *lc_collate = NULL;
char *lc_ctype = NULL;
char *lc_monetary = NULL;
@@ -501,6 +503,25 @@ get_control_data(ClusterInfo *cluster)
cluster->controldata.data_checksum_version = str2uint(p);
got_data_checksum_version = true;
}
+ else if ((p = strstr(bufin, "Default char data signedness:")) != NULL)
+ {
+ p = strchr(p, ':');
+
+ if (p == NULL || strlen(p) <= 1)
+ pg_fatal("%d: controldata retrieval problem", __LINE__);
+
+ /* Skip the colon and any whitespace after it */
+ p++;
+ while (isspace((unsigned char) *p))
+ p++;
+
+ /* The value should be either 'signed' or 'unsigned' */
+ if (strcmp(p, "signed") != 0 && strcmp(p, "unsigned") != 0)
+ pg_fatal("%d: controldata retrieval problem", __LINE__);
+
+ cluster->controldata.default_char_signedness = strcmp(p, "signed") == 0;
+ got_default_char_signedness = true;
+ }
}
rc = pclose(output);
@@ -561,6 +582,21 @@ get_control_data(ClusterInfo *cluster)
}
}
+ /*
+ * Pre-v18 database clusters don't have the default char signedness
+ * information. We use the char signedness of the platform where
+ * pg_upgrade was built.
+ */
+ if (cluster->controldata.cat_ver < DEFAULT_CHAR_SIGNEDNESS_VAT_VER)
+ {
+ Assert(!got_default_char_signedness);
+#if CHAR_MIN != 0
+ cluster->controldata.default_char_signedness = true;
+#else
+ cluster->controldata.default_char_signedness = false;
+#endif
+ }
+
/* verify that we got all the mandatory pg_control data */
if (!got_xid || !got_oid ||
!got_multi || !got_oldestxid ||
@@ -572,7 +608,9 @@ get_control_data(ClusterInfo *cluster)
!got_index || !got_toast ||
(!got_large_object &&
cluster->controldata.ctrl_ver >= LARGE_OBJECT_SIZE_PG_CONTROL_VER) ||
- !got_date_is_int || !got_data_checksum_version)
+ !got_date_is_int || !got_data_checksum_version ||
+ (!got_default_char_signedness &&
+ cluster->controldata.cat_ver >= DEFAULT_CHAR_SIGNEDNESS_VAT_VER))
{
if (cluster == &old_cluster)
pg_log(PG_REPORT,
@@ -641,6 +679,10 @@ get_control_data(ClusterInfo *cluster)
if (!got_data_checksum_version)
pg_log(PG_REPORT, " data checksum version");
+ /* value added in Postgres 18 */
+ if (!got_default_char_signedness)
+ pg_log(PG_REPORT, " default char signedness");
+
pg_fatal("Cannot continue without required control information, terminating");
}
}
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 36c7f3879d5..cc7357b5599 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -54,6 +54,7 @@
*/
#define RESTORE_TRANSACTION_SIZE 1000
+static void set_new_cluster_char_signedness(void);
static void set_locale_and_encoding(void);
static void prepare_new_cluster(void);
static void prepare_new_globals(void);
@@ -154,6 +155,7 @@ main(int argc, char **argv)
*/
copy_xact_xlog_xid();
+ set_new_cluster_char_signedness();
/* New now using xids of the old system */
@@ -388,6 +390,32 @@ setup(char *argv0)
}
}
+/*
+ * Set the new cluster's default char signedness using the old cluster's
+ * value.
+ */
+static void
+set_new_cluster_char_signedness(void)
+{
+ bool new_char_signedness;
+
+ /* Inherit the source database's signedness */
+ new_char_signedness = old_cluster.controldata.default_char_signedness;
+
+ /* Change the char signedness of the new cluster, if necessary */
+ if (new_cluster.controldata.default_char_signedness != new_char_signedness)
+ {
+ prep_status("Setting the default char signedness for new cluster");
+
+ exec_prog(UTILITY_LOG_FILE, NULL, true, true,
+ "\"%s/pg_resetwal\" --char-signedness %s \"%s\"",
+ new_cluster.bindir,
+ new_char_signedness ? "signed" : "unsigned",
+ new_cluster.pgdata);
+
+ check_ok();
+ }
+}
/*
* Copy locale and encoding information into the new cluster's template0.
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 0cdd675e4f1..ee65cf795b7 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -125,6 +125,12 @@ extern char *output_files[];
*/
#define JSONB_FORMAT_CHANGE_CAT_VER 201409291
+/*
+ * The control file was changed to have the default char signedness,
+ * commit XXXXX.
+ */
+#define DEFAULT_CHAR_SIGNEDNESS_VAT_VER 202501161
+
/*
* Each relation is represented by a relinfo structure.
@@ -245,6 +251,7 @@ typedef struct
bool date_is_int;
bool float8_pass_by_value;
uint32 data_checksum_version;
+ bool default_char_signedness;
} ControlData;
/*
diff --git a/src/bin/pg_upgrade/t/005_char_signedness.pl b/src/bin/pg_upgrade/t/005_char_signedness.pl
new file mode 100644
index 00000000000..780ad71406d
--- /dev/null
+++ b/src/bin/pg_upgrade/t/005_char_signedness.pl
@@ -0,0 +1,66 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Tests for handling the default char signedness during upgrade.
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Can be changed to test the other modes
+my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
+
+# Initialize old and new clusters
+my $old = PostgreSQL::Test::Cluster->new('old');
+my $new = PostgreSQL::Test::Cluster->new('new');
+$old->init();
+$new->init();
+
+# Check the default char signedness of both the old and the new clusters.
+# Newly created clusters unconditionally use 'signed'.
+command_like(
+ [ 'pg_controldata', $old->data_dir ],
+ qr/Default char data signedness:\s+signed/,
+ 'default char signedness of old cluster is signed in control file');
+command_like(
+ [ 'pg_controldata', $new->data_dir ],
+ qr/Default char data signedness:\s+signed/,
+ 'default char signedness is new cluster signed in control file');
+
+# Set the old cluster's default char signedness to unsigned for test.
+command_ok(
+ [
+ 'pg_resetwal', '--char-signedness', 'unsigned',
+ '-f', $old->data_dir
+ ],
+ "set old cluster's default char signeness to unsigned");
+
+# Check if the value is successfully updated.
+command_like(
+ [ 'pg_controldata', $old->data_dir ],
+ qr/Default char data signedness:\s+unsigned/,
+ 'updated default char signedness is unsigned in control file');
+
+# pg_upgrade should be successful.
+command_ok(
+ [ 'pg_upgrade', '--no-sync',
+ '-d', $old->data_dir,
+ '-D', $new->data_dir,
+ '-b', $old->config_data('--bindir'),
+ '-B', $new->config_data('--bindir'),
+ '-s', $new->host,
+ '-p', $old->port,
+ '-P', $new->port,
+ $mode ],
+ 'run of pg_upgrade');
+
+# Check if the default char signedness of the new cluster inherited
+# the old cluster's value.
+command_like(
+ [ 'pg_controldata', $new->data_dir ],
+ qr/Default char data signedness:\s+unsigned/,
+ 'the default char signedness is updated during pg_upgrade');
+
+done_testing();
--
2.43.5
v3-0002-pg_resetwal-Add-char-signedness-option-to-change-.patchapplication/octet-stream; name=v3-0002-pg_resetwal-Add-char-signedness-option-to-change-.patchDownload
From ae80c3382ffcf6a9b13ccdf67dad972bdbf91996 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 2 Oct 2024 15:08:26 -0700
Subject: [PATCH v3 2/5] pg_resetwal: Add --char-signedness option to change
the default char signedness.
With the newly added option --char-signedness, pg_resetwal updates the
default char signeness flag in the controlfile.
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
doc/src/sgml/ref/pg_resetwal.sgml | 14 ++++++++++++++
src/bin/pg_resetwal/pg_resetwal.c | 25 +++++++++++++++++++++++++
src/bin/pg_resetwal/t/001_basic.pl | 6 ++++++
3 files changed, 45 insertions(+)
diff --git a/doc/src/sgml/ref/pg_resetwal.sgml b/doc/src/sgml/ref/pg_resetwal.sgml
index cf9c7e70f27..5c1d1d6bf6e 100644
--- a/doc/src/sgml/ref/pg_resetwal.sgml
+++ b/doc/src/sgml/ref/pg_resetwal.sgml
@@ -171,6 +171,20 @@ PostgreSQL documentation
</para>
<variablelist>
+ <varlistentry>
+ <term><option>--char-signedness=<replaceable class="parameter">option</replaceable></option></term>
+ <listitem>
+ <para>
+ Manually set the default char signedness. Possible values are
+ <literal>signed</literal> and <literal>unsigned</literal>.
+ </para>
+ <para>
+ This option can also be used to change the default char signedness
+ of an existing database cluster.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-c <replaceable class="parameter">xid</replaceable>,<replaceable class="parameter">xid</replaceable></option></term>
<term><option>--commit-timestamp-ids=<replaceable class="parameter">xid</replaceable>,<replaceable class="parameter">xid</replaceable></option></term>
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index ed73607a46f..eeb5710107d 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -75,6 +75,7 @@ static TimeLineID minXlogTli = 0;
static XLogSegNo minXlogSegNo = 0;
static int WalSegSz;
static int set_wal_segsize;
+static int set_char_signedness = -1;
static void CheckDataVersion(void);
static bool read_controlfile(void);
@@ -105,6 +106,7 @@ main(int argc, char *argv[])
{"multixact-offset", required_argument, NULL, 'O'},
{"oldest-transaction-id", required_argument, NULL, 'u'},
{"next-transaction-id", required_argument, NULL, 'x'},
+ {"char-signedness", required_argument, NULL, 2},
{"wal-segsize", required_argument, NULL, 1},
{NULL, 0, NULL, 0}
};
@@ -302,6 +304,23 @@ main(int argc, char *argv[])
break;
}
+ case 2:
+ {
+ errno = 0;
+
+ if (pg_strcasecmp(optarg, "signed") == 0)
+ set_char_signedness = 1;
+ else if (pg_strcasecmp(optarg, "unsigned") == 0)
+ set_char_signedness = 0;
+ else
+ {
+ pg_log_error("invalid argument for option %s", "--char-signedness");
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ break;
+ }
+
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -456,6 +475,9 @@ main(int argc, char *argv[])
if (set_wal_segsize != 0)
ControlFile.xlog_seg_size = WalSegSz;
+ if (set_char_signedness != -1)
+ ControlFile.default_char_signedness = (set_char_signedness == 1);
+
if (minXlogSegNo > newXlogSegNo)
newXlogSegNo = minXlogSegNo;
@@ -779,6 +801,8 @@ PrintControlValues(bool guessed)
(ControlFile.float8ByVal ? _("by value") : _("by reference")));
printf(_("Data page checksum version: %u\n"),
ControlFile.data_checksum_version);
+ printf(_("Default char data signedness: %s\n"),
+ (ControlFile.default_char_signedness ? _("signed") : _("unsigned")));
}
@@ -1189,6 +1213,7 @@ usage(void)
printf(_(" -u, --oldest-transaction-id=XID set oldest transaction ID\n"));
printf(_(" -x, --next-transaction-id=XID set next transaction ID\n"));
printf(_(" --wal-segsize=SIZE size of WAL segments, in megabytes\n"));
+ printf(_(" --char-signedness=OPTION set char signedness to \"signed\" or \"unsigned\"\n"));
printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index d0bd1f7ace8..ec1d4444ab7 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -172,6 +172,12 @@ command_fails_like(
qr/must be greater than/,
'fails with -x value too small');
+# --char-signedness
+command_fails_like(
+ [ 'pg_resetwal', '--char-signedness', 'foo', $node->data_dir ],
+ qr/error: invalid argument for option --char-signedness/,
+ 'fails with incorrect --char-signedness option');
+
# run with control override options
my $out = (run_command([ 'pg_resetwal', '-n', $node->data_dir ]))[0];
--
2.43.5
v3-0001-Add-default_char_signedness-field-to-ControlFileD.patchapplication/octet-stream; name=v3-0001-Add-default_char_signedness-field-to-ControlFileD.patchDownload
From f354092916a66b64310676fcba6a399bc8a29b35 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Sep 2024 11:08:57 -0700
Subject: [PATCH v3 1/5] Add default_char_signedness field to ControlFileData.
The signedness of the 'char' type in C is
implementation-dependent. For instance, 'signed char' is used by
default on x86 CPUs, while 'unsigned char' is used on aarch
CPUs. Previously, we accidentally let C implementation signedness
affect persistent data. This led to inconsistent results when
comparing char data across different platforms.
This commit introduces a new 'default_char_signedness' field in
ControlFileData to store the signedness of the 'char' type. While this
change does not encourage the use of 'char' without explicitly
specifying its signedness, this field can be used as a hint to ensure
consistent behavior for pre-v18 data files that store data sorted by
the 'char' type on disk (e.g., GIN and GiST indexes), especially in
cross-platform replication scenarios.
Newly created database clusters unconditionally set the default char
signedness to true. pg_upgrade (with an upcoming commit) changes this
flag for clusters if the source database cluster has
signedness=false. As a result, signedness=false setting will become
rare over time. If we had known about the problem during the last
development cycle that forced initdb (v8.3), we would have made all
clusters signed or all clusters unsigned. Making pg_upgrade the only
source of signedness=false will cause the population of database
clusters to converge toward that retrospective ideal.
XXX bump catversion.
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
doc/src/sgml/func.sgml | 5 ++++
src/backend/access/transam/xlog.c | 40 +++++++++++++++++++++++++
src/backend/utils/misc/pg_controldata.c | 7 +++--
src/bin/pg_controldata/pg_controldata.c | 2 ++
src/include/access/xlog.h | 1 +
src/include/catalog/pg_control.h | 6 ++++
src/include/catalog/pg_proc.dat | 6 ++--
7 files changed, 62 insertions(+), 5 deletions(-)
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 47370e581ae..b991d663cb5 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27935,6 +27935,11 @@ acl | {postgres=arwdDxtm/postgres,foo=r/postgres}
<entry><type>integer</type></entry>
</row>
+ <row>
+ <entry><structfield>default_char_signedness</structfield></entry>
+ <entry><type>bool</type></entry>
+ </row>
+
</tbody>
</tgroup>
</table>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf3dbda901d..83122a4d68b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4263,6 +4263,33 @@ WriteControlFile(void)
ControlFile->float8ByVal = FLOAT8PASSBYVAL;
+ /*
+ * Initialize the default 'char' signedness.
+ *
+ * The signedness of the char type is implementation-defined. For instance
+ * on x86 architecture CPUs, the char data type is typically treated as
+ * signed by default, whereas on aarch architecture CPUs, it is typically
+ * treated as unsigned by default. In v17 or earlier, we accidentally let
+ * C implementation signedness affect persistent data. This led to
+ * inconsistent results when comparing char data across different
+ * platforms.
+ *
+ * This flag can be used as a hint to ensure consistent behavior for
+ * pre-v18 data files that store data sorted by the 'char' type on disk,
+ * especially in cross-platform replication scenarios.
+ *
+ * Newly created database clusters unconditionally set the default char
+ * signedness to true. pg_upgrade changes this flag for clusters that were
+ * initialized on signedness=false platforms. As a result,
+ * signedness=false setting will become rare over time. If we had known
+ * about this problem during the last development cycle that forced initdb
+ * (v8.3), we would have made all clusters signed or all clusters
+ * unsigned. Making pg_upgrade the only source of signedness=false will
+ * cause the population of database clusters to converge toward that
+ * retrospective ideal.
+ */
+ ControlFile->default_char_signedness = true;
+
/* Contents are protected with a CRC */
INIT_CRC32C(ControlFile->crc);
COMP_CRC32C(ControlFile->crc,
@@ -4591,6 +4618,19 @@ DataChecksumsEnabled(void)
return (ControlFile->data_checksum_version > 0);
}
+/*
+ * Return true if the cluster was initialized on a platform where the
+ * default signedness of char is "signed". This function exists for code
+ * that deals with pre-v18 data files that store data sorted by the 'char'
+ * type on disk (e.g., GIN and GiST indexes). See the comments in
+ * WriteControlFile() for details.
+ */
+bool
+GetDefaultCharSignedness(void)
+{
+ return ControlFile->default_char_signedness;
+}
+
/*
* Returns a fake LSN for unlogged relations.
*
diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index 9dfba499c13..6d036e3bf32 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -203,8 +203,8 @@ pg_control_recovery(PG_FUNCTION_ARGS)
Datum
pg_control_init(PG_FUNCTION_ARGS)
{
- Datum values[11];
- bool nulls[11];
+ Datum values[12];
+ bool nulls[12];
TupleDesc tupdesc;
HeapTuple htup;
ControlFileData *ControlFile;
@@ -254,6 +254,9 @@ pg_control_init(PG_FUNCTION_ARGS)
values[10] = Int32GetDatum(ControlFile->data_checksum_version);
nulls[10] = false;
+ values[11] = BoolGetDatum(ControlFile->default_char_signedness);
+ nulls[11] = false;
+
htup = heap_form_tuple(tupdesc, values, nulls);
PG_RETURN_DATUM(HeapTupleGetDatum(htup));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca7..ac74c2c5e9d 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -325,6 +325,8 @@ main(int argc, char *argv[])
(ControlFile->float8ByVal ? _("by value") : _("by reference")));
printf(_("Data page checksum version: %u\n"),
ControlFile->data_checksum_version);
+ printf(_("Default char data signedness: %s\n"),
+ (ControlFile->default_char_signedness ? _("signed") : _("unsigned")));
printf(_("Mock authentication nonce: %s\n"),
mock_auth_nonce_str);
return 0;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4411c1468ac..d313099c027 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -231,6 +231,7 @@ extern XLogRecPtr GetXLogWriteRecPtr(void);
extern uint64 GetSystemIdentifier(void);
extern char *GetMockAuthenticationNonce(void);
extern bool DataChecksumsEnabled(void);
+extern bool GetDefaultCharSignedness(void);
extern XLogRecPtr GetFakeLSNForUnloggedRel(void);
extern Size XLOGShmemSize(void);
extern void XLOGShmemInit(void);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3797f25b306..0cb86faa43b 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -221,6 +221,12 @@ typedef struct ControlFileData
/* Are data pages protected by checksums? Zero if no checksum version */
uint32 data_checksum_version;
+ /*
+ * True if the default signedness of char is "signed" on a platform where
+ * the cluster is initialized.
+ */
+ bool default_char_signedness;
+
/*
* Random nonce, used in authentication requests that need to proceed
* based on values that are cluster-unique, like a SASL exchange that
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ba02ba53b29..473e9defefc 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12175,9 +12175,9 @@
descr => 'pg_controldata init state information as a function',
proname => 'pg_control_init', provolatile => 'v', prorettype => 'record',
proargtypes => '',
- proallargtypes => '{int4,int4,int4,int4,int4,int4,int4,int4,int4,bool,int4}',
- proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
- proargnames => '{max_data_alignment,database_block_size,blocks_per_segment,wal_block_size,bytes_per_wal_segment,max_identifier_length,max_index_columns,max_toast_chunk_size,large_object_chunk_size,float8_pass_by_value,data_page_checksum_version}',
+ proallargtypes => '{int4,int4,int4,int4,int4,int4,int4,int4,int4,bool,int4,bool}',
+ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{max_data_alignment,database_block_size,blocks_per_segment,wal_block_size,bytes_per_wal_segment,max_identifier_length,max_index_columns,max_toast_chunk_size,large_object_chunk_size,float8_pass_by_value,data_page_checksum_version,default_char_signedness}',
prosrc => 'pg_control_init' },
# subscripting support for built-in types
--
2.43.5
On Fri, Jan 17, 2025 at 05:11:41PM -0800, Masahiko Sawada wrote:
Thank you for reviewing the patches. I agree with all the comments you
made. I've addressed them and I've attached new version patches that
now have some regression tests.
Subject: [PATCH v3 4/5] pg_upgrade: Add --set-char-signedness to set the
default char signedness of new cluster.This change adds a new option --set-char-signedness to pg_upgrade. It
enables user to set arbitrary signedness during pg_upgrade. This helps
cases where user who knew they copied the v17 source cluster from
x86 (signedness=true) to ARM (signedness=falese) can pg_upgrade
s/falese/false/
--- a/doc/src/sgml/ref/pgupgrade.sgml +++ b/doc/src/sgml/ref/pgupgrade.sgml @@ -276,6 +276,54 @@ PostgreSQL documentation </listitem> </varlistentry>+ <varlistentry> + <term><option>--set-char-signedness=</option><replaceable>option</replaceable></term>
If upgrading from v18+, valid use cases for this option are rare. I think
using the option should raise an error for v18+ source. A user who knows what
they're doing can use pg_resetwal manually, before the upgrade. This page
shouldn't refer to the pg_resetwal flag specifically, because that need is so
rare. What do you think?
+ <listitem> + <para> + Manually set the default char signedness of new clusters. Possible values + are <literal>signed</literal> and <literal>unsigned</literal>. + </para> + <para> + In the C language, the default signedness of the <type>char</type> type + (when not explicitly specified) varies across platforms. For example, + <type>char</type> defaults to <type>signed char</type> on x86 CPUs but + to <type>unsigned char</type> on ARM CPUs. When data stored using the + <type>char</type> type is persisted to disk, such as in GIN indexes, + this platform-dependent behavior results in incorrect data comparisons + in two scenarios: + </para> + <itemizedlist> + <listitem> + <para> + When data is moved between platforms with different char signedness. + </para> + </listitem> + <listitem> + <para> + When data is replicated using streaming replication across different architectures. + </para> + </listitem> + </itemizedlist>
After your patch series, core PostgreSQL persistent data doesn't depend on
"char" signedness. Extensions should also start calling
GetDefaultCharSignedness() to become independent of "char" signedness. Hence,
let's remove the part starting at "When data stored using the char type" and
ending here. Future users don't need to understand why pre-v18 had a problem.
They mainly need to know how to set this flag.
If we wanted to say something about the breakage before v18, it might be, "If
a pre-v18 cluster has trgm indexes and ran on different platforms at different
times in the history of those indexes, REINDEX those indexes before or after
running pg_upgrade." I'd put that in the release notes, not here in the
pg_upgrade doc page.
+ <para> + Starting from <productname>PostgreSQL</productname> 18, database clusters + maintain their own default char signedness setting, which can be used as + a hint to ensure consistent behavior across platforms with different
Setting this wrong makes your trgm indexes return wrong answers, so it's not a
"hint". (I think "hint" implies minor consequences for getting it wrong.)
+ default char signedness. By default, <application>pg_upgrade</application> + preserves the char signedness setting when upgrading from an existing cluster. + However, when upgrading from <productname>PostgreSQL</productname> 17 or + earlier, <application>pg_upgrade</application> adopts the char signedness + of the platform on which it was built. + </para> + <para> + This option allows you to explicitly set the default char signedness for + the new cluster, overriding any inherited values. This is particularly useful + when you plan to migrate the upgraded cluster to a platform with different + char signedness (for example, when moving from an x86-based system to an + ARM-based system).
Let's refine the terminology around "plan to migrate". Here's an informal
description (not documentation-quality):
1. If you "plan to migrate" but have not yet done so, don't use this flag.
The default behavior is right. Upgrade on the old platform without using
this flag, and then migrate. This is the most-foolproof approach.
2. If you already migrated the cluster data files to a different platform just
before running pg_upgrade, use this flag. It's essential not to modify any
trgm indexes between migrating the data files and running pg_upgrade. Best
to make pg_upgrade the first action that starts the cluster on the new
platform, not starting the cluster manually. This is trickier than (1).
--- a/src/bin/pg_upgrade/pg_upgrade.c +++ b/src/bin/pg_upgrade/pg_upgrade.c @@ -399,8 +399,14 @@ set_new_cluster_char_signedness(void) { bool new_char_signedness;- /* Inherit the source database's signedness */ - new_char_signedness = old_cluster.controldata.default_char_signedness; + /* + * Use the specified char signedness if specifies. Otherwise we inherit
s/if specifies/if specified/
--- a/contrib/pg_trgm/trgm.h +++ b/contrib/pg_trgm/trgm.h @@ -41,14 +41,12 @@ typedef char trgm[3];#define CMPCHAR(a,b) ( ((a)==(b)) ? 0 : ( ((a)<(b)) ? -1 : 1 ) )
-#define CMPPCHAR(a,b,i) CMPCHAR( *(((const char*)(a))+i), *(((const char*)(b))+i) )
-#define CMPTRGM(a,b) ( CMPPCHAR(a,b,0) ? CMPPCHAR(a,b,0) : ( CMPPCHAR(a,b,1) ? CMPPCHAR(a,b,1) : CMPPCHAR(a,b,2) ) )
I recommend moving CMPCHAR into the C file along with the two related macros
that you're moving.
--- a/contrib/pg_trgm/trgm_op.c +++ b/contrib/pg_trgm/trgm_op.c @@ -42,6 +42,9 @@ PG_FUNCTION_INFO_V1(strict_word_similarity_commutator_op); PG_FUNCTION_INFO_V1(strict_word_similarity_dist_op); PG_FUNCTION_INFO_V1(strict_word_similarity_dist_commutator_op);+static inline int CMPTRGM_CHOOSE(const void *a, const void *b);
All instances of the "inline" keyword in this patch series are on functions
called only via function pointers. Hence, those functions are never inlined.
I'd remove the keyword.
--- a/src/bin/pg_upgrade/pg_upgrade.h +++ b/src/bin/pg_upgrade/pg_upgrade.h @@ -125,6 +125,12 @@ extern char *output_files[]; */ #define JSONB_FORMAT_CHANGE_CAT_VER 201409291+/* + * The control file was changed to have the default char signedness, + * commit XXXXX. + */ +#define DEFAULT_CHAR_SIGNEDNESS_VAT_VER 202501161
s/VAT/CAT/
+command_like( + [ 'pg_controldata', $old->data_dir ], + qr/Default char data signedness:\s+signed/, + 'default char signedness of old cluster is signed in control file'); +command_like( + [ 'pg_controldata', $new->data_dir ], + qr/Default char data signedness:\s+signed/, + 'default char signedness is new cluster signed in control file');
s/is new cluster signed/of new cluster is signed/ for consistency with the
previous test name.
+ +# Set the old cluster's default char signedness to unsigned for test. +command_ok( + [ + 'pg_resetwal', '--char-signedness', 'unsigned', + '-f', $old->data_dir + ], + "set old cluster's default char signeness to unsigned");
s/signeness/signedness/
Subject: [PATCH v3 2/5] pg_resetwal: Add --char-signedness option to change
the default char signedness.With the newly added option --char-signedness, pg_resetwal updates the
default char signeness flag in the controlfile.
s/signeness/signedness/
--- a/doc/src/sgml/ref/pg_resetwal.sgml +++ b/doc/src/sgml/ref/pg_resetwal.sgml @@ -171,6 +171,20 @@ PostgreSQL documentation </para><variablelist> + <varlistentry> + <term><option>--char-signedness=<replaceable class="parameter">option</replaceable></option></term> + <listitem> + <para> + Manually set the default char signedness. Possible values are + <literal>signed</literal> and <literal>unsigned</literal>. + </para> + <para> + This option can also be used to change the default char signedness + of an existing database cluster. + </para> + </listitem> + </varlistentry>
Each other list entry discusses how to choose a safe value, so this should say
something like that. For users in doubt, nothing other than pg_upgrade should
use this flag. Theoretically, if you just ran pg_upgrade with the wrong value
and had not yet modified any trgm indexes, you could run this to fix the
problem. That's too rare of a situation and too dangerous to document, so the
documentation for this flag should discourage using the flag.
--- a/src/bin/pg_resetwal/pg_resetwal.c +++ b/src/bin/pg_resetwal/pg_resetwal.c @@ -105,6 +106,7 @@ main(int argc, char *argv[]) {"multixact-offset", required_argument, NULL, 'O'}, {"oldest-transaction-id", required_argument, NULL, 'u'}, {"next-transaction-id", required_argument, NULL, 'x'}, + {"char-signedness", required_argument, NULL, 2}, {"wal-segsize", required_argument, NULL, 1},
Put the ", 2}" entry after the ", 1}" entry. (We alphabetize in usage(), but
we append to the end in these C arrays.)
@@ -1189,6 +1213,7 @@ usage(void) printf(_(" -u, --oldest-transaction-id=XID set oldest transaction ID\n")); printf(_(" -x, --next-transaction-id=XID set next transaction ID\n")); printf(_(" --wal-segsize=SIZE size of WAL segments, in megabytes\n")); + printf(_(" --char-signedness=OPTION set char signedness to \"signed\" or \"unsigned\"\n"));
This --help output alphabetizes by short option, with the one non-short option
at the end. So I'd also alphabetize among the non-short options, specifically
putting the new line before --wal-segsize.
On Mon, Feb 17, 2025 at 2:57 PM Noah Misch <noah@leadboat.com> wrote:
On Fri, Jan 17, 2025 at 05:11:41PM -0800, Masahiko Sawada wrote:
Thank you for reviewing the patches. I agree with all the comments you
made. I've addressed them and I've attached new version patches that
now have some regression tests.
Thank you for reviewing the patches.
Subject: [PATCH v3 4/5] pg_upgrade: Add --set-char-signedness to set the
default char signedness of new cluster.This change adds a new option --set-char-signedness to pg_upgrade. It
enables user to set arbitrary signedness during pg_upgrade. This helps
cases where user who knew they copied the v17 source cluster from
x86 (signedness=true) to ARM (signedness=falese) can pg_upgrades/falese/false/
Fixed.
--- a/doc/src/sgml/ref/pgupgrade.sgml +++ b/doc/src/sgml/ref/pgupgrade.sgml @@ -276,6 +276,54 @@ PostgreSQL documentation </listitem> </varlistentry>+ <varlistentry> + <term><option>--set-char-signedness=</option><replaceable>option</replaceable></term>If upgrading from v18+, valid use cases for this option are rare. I think
using the option should raise an error for v18+ source. A user who knows what
they're doing can use pg_resetwal manually, before the upgrade. This page
shouldn't refer to the pg_resetwal flag specifically, because that need is so
rare. What do you think?
Makes sense. I've added that check and updated the regression tests accordingly.
+ <listitem> + <para> + Manually set the default char signedness of new clusters. Possible values + are <literal>signed</literal> and <literal>unsigned</literal>. + </para> + <para> + In the C language, the default signedness of the <type>char</type> type + (when not explicitly specified) varies across platforms. For example, + <type>char</type> defaults to <type>signed char</type> on x86 CPUs but + to <type>unsigned char</type> on ARM CPUs. When data stored using the + <type>char</type> type is persisted to disk, such as in GIN indexes, + this platform-dependent behavior results in incorrect data comparisons + in two scenarios: + </para> + <itemizedlist> + <listitem> + <para> + When data is moved between platforms with different char signedness. + </para> + </listitem> + <listitem> + <para> + When data is replicated using streaming replication across different architectures. + </para> + </listitem> + </itemizedlist>After your patch series, core PostgreSQL persistent data doesn't depend on
"char" signedness. Extensions should also start calling
GetDefaultCharSignedness() to become independent of "char" signedness. Hence,
let's remove the part starting at "When data stored using the char type" and
ending here. Future users don't need to understand why pre-v18 had a problem.
They mainly need to know how to set this flag.
Removed.
If we wanted to say something about the breakage before v18, it might be, "If
a pre-v18 cluster has trgm indexes and ran on different platforms at different
times in the history of those indexes, REINDEX those indexes before or after
running pg_upgrade." I'd put that in the release notes, not here in the
pg_upgrade doc page.
Agreed.
+ <para> + Starting from <productname>PostgreSQL</productname> 18, database clusters + maintain their own default char signedness setting, which can be used as + a hint to ensure consistent behavior across platforms with differentSetting this wrong makes your trgm indexes return wrong answers, so it's not a
"hint". (I think "hint" implies minor consequences for getting it wrong.)
Fixed.
+ default char signedness. By default, <application>pg_upgrade</application> + preserves the char signedness setting when upgrading from an existing cluster. + However, when upgrading from <productname>PostgreSQL</productname> 17 or + earlier, <application>pg_upgrade</application> adopts the char signedness + of the platform on which it was built. + </para> + <para> + This option allows you to explicitly set the default char signedness for + the new cluster, overriding any inherited values. This is particularly useful + when you plan to migrate the upgraded cluster to a platform with different + char signedness (for example, when moving from an x86-based system to an + ARM-based system).Let's refine the terminology around "plan to migrate". Here's an informal
description (not documentation-quality):1. If you "plan to migrate" but have not yet done so, don't use this flag.
The default behavior is right. Upgrade on the old platform without using
this flag, and then migrate. This is the most-foolproof approach.2. If you already migrated the cluster data files to a different platform just
before running pg_upgrade, use this flag. It's essential not to modify any
trgm indexes between migrating the data files and running pg_upgrade. Best
to make pg_upgrade the first action that starts the cluster on the new
platform, not starting the cluster manually. This is trickier than (1).
Updated this part.
--- a/src/bin/pg_upgrade/pg_upgrade.c +++ b/src/bin/pg_upgrade/pg_upgrade.c @@ -399,8 +399,14 @@ set_new_cluster_char_signedness(void) { bool new_char_signedness;- /* Inherit the source database's signedness */ - new_char_signedness = old_cluster.controldata.default_char_signedness; + /* + * Use the specified char signedness if specifies. Otherwise we inherits/if specifies/if specified/
Fixed.
--- a/contrib/pg_trgm/trgm.h +++ b/contrib/pg_trgm/trgm.h @@ -41,14 +41,12 @@ typedef char trgm[3];#define CMPCHAR(a,b) ( ((a)==(b)) ? 0 : ( ((a)<(b)) ? -1 : 1 ) )
-#define CMPPCHAR(a,b,i) CMPCHAR( *(((const char*)(a))+i), *(((const char*)(b))+i) )
-#define CMPTRGM(a,b) ( CMPPCHAR(a,b,0) ? CMPPCHAR(a,b,0) : ( CMPPCHAR(a,b,1) ? CMPPCHAR(a,b,1) : CMPPCHAR(a,b,2) ) )I recommend moving CMPCHAR into the C file along with the two related macros
that you're moving.
Moved.
--- a/contrib/pg_trgm/trgm_op.c +++ b/contrib/pg_trgm/trgm_op.c @@ -42,6 +42,9 @@ PG_FUNCTION_INFO_V1(strict_word_similarity_commutator_op); PG_FUNCTION_INFO_V1(strict_word_similarity_dist_op); PG_FUNCTION_INFO_V1(strict_word_similarity_dist_commutator_op);+static inline int CMPTRGM_CHOOSE(const void *a, const void *b);
All instances of the "inline" keyword in this patch series are on functions
called only via function pointers. Hence, those functions are never inlined.
I'd remove the keyword.
Removed.
--- a/src/bin/pg_upgrade/pg_upgrade.h +++ b/src/bin/pg_upgrade/pg_upgrade.h @@ -125,6 +125,12 @@ extern char *output_files[]; */ #define JSONB_FORMAT_CHANGE_CAT_VER 201409291+/* + * The control file was changed to have the default char signedness, + * commit XXXXX. + */ +#define DEFAULT_CHAR_SIGNEDNESS_VAT_VER 202501161s/VAT/CAT/
Fixed.
+command_like( + [ 'pg_controldata', $old->data_dir ], + qr/Default char data signedness:\s+signed/, + 'default char signedness of old cluster is signed in control file'); +command_like( + [ 'pg_controldata', $new->data_dir ], + qr/Default char data signedness:\s+signed/, + 'default char signedness is new cluster signed in control file');s/is new cluster signed/of new cluster is signed/ for consistency with the
previous test name.
Fixed.
+ +# Set the old cluster's default char signedness to unsigned for test. +command_ok( + [ + 'pg_resetwal', '--char-signedness', 'unsigned', + '-f', $old->data_dir + ], + "set old cluster's default char signeness to unsigned");s/signeness/signedness/
Fixed.
Subject: [PATCH v3 2/5] pg_resetwal: Add --char-signedness option to change
the default char signedness.With the newly added option --char-signedness, pg_resetwal updates the
default char signeness flag in the controlfile.s/signeness/signedness/
Fixed.
--- a/doc/src/sgml/ref/pg_resetwal.sgml +++ b/doc/src/sgml/ref/pg_resetwal.sgml @@ -171,6 +171,20 @@ PostgreSQL documentation </para><variablelist> + <varlistentry> + <term><option>--char-signedness=<replaceable class="parameter">option</replaceable></option></term> + <listitem> + <para> + Manually set the default char signedness. Possible values are + <literal>signed</literal> and <literal>unsigned</literal>. + </para> + <para> + This option can also be used to change the default char signedness + of an existing database cluster. + </para> + </listitem> + </varlistentry>Each other list entry discusses how to choose a safe value, so this should say
something like that. For users in doubt, nothing other than pg_upgrade should
use this flag. Theoretically, if you just ran pg_upgrade with the wrong value
and had not yet modified any trgm indexes, you could run this to fix the
problem. That's too rare of a situation and too dangerous to document, so the
documentation for this flag should discourage using the flag.
Updated the documentation.
--- a/src/bin/pg_resetwal/pg_resetwal.c +++ b/src/bin/pg_resetwal/pg_resetwal.c @@ -105,6 +106,7 @@ main(int argc, char *argv[]) {"multixact-offset", required_argument, NULL, 'O'}, {"oldest-transaction-id", required_argument, NULL, 'u'}, {"next-transaction-id", required_argument, NULL, 'x'}, + {"char-signedness", required_argument, NULL, 2}, {"wal-segsize", required_argument, NULL, 1},Put the ", 2}" entry after the ", 1}" entry. (We alphabetize in usage(), but
we append to the end in these C arrays.)
Fixed.
@@ -1189,6 +1213,7 @@ usage(void) printf(_(" -u, --oldest-transaction-id=XID set oldest transaction ID\n")); printf(_(" -x, --next-transaction-id=XID set next transaction ID\n")); printf(_(" --wal-segsize=SIZE size of WAL segments, in megabytes\n")); + printf(_(" --char-signedness=OPTION set char signedness to \"signed\" or \"unsigned\"\n"));This --help output alphabetizes by short option, with the one non-short option
at the end. So I'd also alphabetize among the non-short options, specifically
putting the new line before --wal-segsize.
Fixed.
I've attached the updated patches.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v4-0003-pg_upgrade-Preserve-default-char-signedness-value.patchapplication/octet-stream; name=v4-0003-pg_upgrade-Preserve-default-char-signedness-value.patchDownload
From 12fae9369a27e692d47a617748d4fed15e09095c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 2 Oct 2024 15:11:23 -0700
Subject: [PATCH v4 3/5] pg_upgrade: Preserve default char signedness value
from old cluster.
Commit XXX introduced the 'default_char_signedness' field in
controlfile. Newly created database clusters always set this field to
'signed'.
This change ensures that pg_upgrade updates the
'default_char_signedness' to 'unsigned' if the source database cluster
has signedness=false. For source clusters from v17 or earlier, which
lack the 'default_char_signedness' information, pg_upgrade assumes the
source cluster was initialized on the same platform where pg_upgrade
is running. It then sets the 'default_char_signedness' value according
to the current platform's default character signedness.
XXX: need to adjust DEFAULT_CHAR_SIGNEDNESS_CAT_VER.
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
src/bin/pg_upgrade/controldata.c | 44 +++++++++++++-
src/bin/pg_upgrade/pg_upgrade.c | 28 +++++++++
src/bin/pg_upgrade/pg_upgrade.h | 7 +++
src/bin/pg_upgrade/t/005_char_signedness.pl | 65 +++++++++++++++++++++
4 files changed, 143 insertions(+), 1 deletion(-)
create mode 100644 src/bin/pg_upgrade/t/005_char_signedness.pl
diff --git a/src/bin/pg_upgrade/controldata.c b/src/bin/pg_upgrade/controldata.c
index ab16716f319..bd49ea867bf 100644
--- a/src/bin/pg_upgrade/controldata.c
+++ b/src/bin/pg_upgrade/controldata.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include <ctype.h>
+#include <limits.h> /* for CHAR_MIN */
#include "common/string.h"
#include "pg_upgrade.h"
@@ -62,6 +63,7 @@ get_control_data(ClusterInfo *cluster)
bool got_date_is_int = false;
bool got_data_checksum_version = false;
bool got_cluster_state = false;
+ bool got_default_char_signedness = false;
char *lc_collate = NULL;
char *lc_ctype = NULL;
char *lc_monetary = NULL;
@@ -501,6 +503,25 @@ get_control_data(ClusterInfo *cluster)
cluster->controldata.data_checksum_version = str2uint(p);
got_data_checksum_version = true;
}
+ else if ((p = strstr(bufin, "Default char data signedness:")) != NULL)
+ {
+ p = strchr(p, ':');
+
+ if (p == NULL || strlen(p) <= 1)
+ pg_fatal("%d: controldata retrieval problem", __LINE__);
+
+ /* Skip the colon and any whitespace after it */
+ p++;
+ while (isspace((unsigned char) *p))
+ p++;
+
+ /* The value should be either 'signed' or 'unsigned' */
+ if (strcmp(p, "signed") != 0 && strcmp(p, "unsigned") != 0)
+ pg_fatal("%d: controldata retrieval problem", __LINE__);
+
+ cluster->controldata.default_char_signedness = strcmp(p, "signed") == 0;
+ got_default_char_signedness = true;
+ }
}
rc = pclose(output);
@@ -561,6 +582,21 @@ get_control_data(ClusterInfo *cluster)
}
}
+ /*
+ * Pre-v18 database clusters don't have the default char signedness
+ * information. We use the char signedness of the platform where
+ * pg_upgrade was built.
+ */
+ if (cluster->controldata.cat_ver < DEFAULT_CHAR_SIGNEDNESS_CAT_VER)
+ {
+ Assert(!got_default_char_signedness);
+#if CHAR_MIN != 0
+ cluster->controldata.default_char_signedness = true;
+#else
+ cluster->controldata.default_char_signedness = false;
+#endif
+ }
+
/* verify that we got all the mandatory pg_control data */
if (!got_xid || !got_oid ||
!got_multi || !got_oldestxid ||
@@ -572,7 +608,9 @@ get_control_data(ClusterInfo *cluster)
!got_index || !got_toast ||
(!got_large_object &&
cluster->controldata.ctrl_ver >= LARGE_OBJECT_SIZE_PG_CONTROL_VER) ||
- !got_date_is_int || !got_data_checksum_version)
+ !got_date_is_int || !got_data_checksum_version ||
+ (!got_default_char_signedness &&
+ cluster->controldata.cat_ver >= DEFAULT_CHAR_SIGNEDNESS_CAT_VER))
{
if (cluster == &old_cluster)
pg_log(PG_REPORT,
@@ -641,6 +679,10 @@ get_control_data(ClusterInfo *cluster)
if (!got_data_checksum_version)
pg_log(PG_REPORT, " data checksum version");
+ /* value added in Postgres 18 */
+ if (!got_default_char_signedness)
+ pg_log(PG_REPORT, " default char signedness");
+
pg_fatal("Cannot continue without required control information, terminating");
}
}
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 36c7f3879d5..cc7357b5599 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -54,6 +54,7 @@
*/
#define RESTORE_TRANSACTION_SIZE 1000
+static void set_new_cluster_char_signedness(void);
static void set_locale_and_encoding(void);
static void prepare_new_cluster(void);
static void prepare_new_globals(void);
@@ -154,6 +155,7 @@ main(int argc, char **argv)
*/
copy_xact_xlog_xid();
+ set_new_cluster_char_signedness();
/* New now using xids of the old system */
@@ -388,6 +390,32 @@ setup(char *argv0)
}
}
+/*
+ * Set the new cluster's default char signedness using the old cluster's
+ * value.
+ */
+static void
+set_new_cluster_char_signedness(void)
+{
+ bool new_char_signedness;
+
+ /* Inherit the source database's signedness */
+ new_char_signedness = old_cluster.controldata.default_char_signedness;
+
+ /* Change the char signedness of the new cluster, if necessary */
+ if (new_cluster.controldata.default_char_signedness != new_char_signedness)
+ {
+ prep_status("Setting the default char signedness for new cluster");
+
+ exec_prog(UTILITY_LOG_FILE, NULL, true, true,
+ "\"%s/pg_resetwal\" --char-signedness %s \"%s\"",
+ new_cluster.bindir,
+ new_char_signedness ? "signed" : "unsigned",
+ new_cluster.pgdata);
+
+ check_ok();
+ }
+}
/*
* Copy locale and encoding information into the new cluster's template0.
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 0cdd675e4f1..26991e71009 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -125,6 +125,12 @@ extern char *output_files[];
*/
#define JSONB_FORMAT_CHANGE_CAT_VER 201409291
+/*
+ * The control file was changed to have the default char signedness,
+ * commit XXXXX.
+ */
+#define DEFAULT_CHAR_SIGNEDNESS_CAT_VER 202501161
+
/*
* Each relation is represented by a relinfo structure.
@@ -245,6 +251,7 @@ typedef struct
bool date_is_int;
bool float8_pass_by_value;
uint32 data_checksum_version;
+ bool default_char_signedness;
} ControlData;
/*
diff --git a/src/bin/pg_upgrade/t/005_char_signedness.pl b/src/bin/pg_upgrade/t/005_char_signedness.pl
new file mode 100644
index 00000000000..05c3014a27d
--- /dev/null
+++ b/src/bin/pg_upgrade/t/005_char_signedness.pl
@@ -0,0 +1,65 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Tests for handling the default char signedness during upgrade.
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Can be changed to test the other modes
+my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
+
+# Initialize old and new clusters
+my $old = PostgreSQL::Test::Cluster->new('old');
+my $new = PostgreSQL::Test::Cluster->new('new');
+$old->init();
+$new->init();
+
+# Check the default char signedness of both the old and the new clusters.
+# Newly created clusters unconditionally use 'signed'.
+command_like(
+ [ 'pg_controldata', $old->data_dir ],
+ qr/Default char data signedness:\s+signed/,
+ 'default char signedness of old cluster is signed in control file');
+command_like(
+ [ 'pg_controldata', $new->data_dir ],
+ qr/Default char data signedness:\s+signed/,
+ 'default char signedness of new cluster is signed in control file');
+
+# Set the old cluster's default char signedness to unsigned for test.
+command_ok(
+ [ 'pg_resetwal', '--char-signedness', 'unsigned', '-f', $old->data_dir ],
+ "set old cluster's default char signedness to unsigned");
+
+# Check if the value is successfully updated.
+command_like(
+ [ 'pg_controldata', $old->data_dir ],
+ qr/Default char data signedness:\s+unsigned/,
+ 'updated default char signedness is unsigned in control file');
+
+# pg_upgrade should be successful.
+command_ok(
+ [
+ 'pg_upgrade', '--no-sync',
+ '-d', $old->data_dir,
+ '-D', $new->data_dir,
+ '-b', $old->config_data('--bindir'),
+ '-B', $new->config_data('--bindir'),
+ '-s', $new->host,
+ '-p', $old->port,
+ '-P', $new->port,
+ $mode
+ ],
+ 'run of pg_upgrade');
+
+# Check if the default char signedness of the new cluster inherited
+# the old cluster's value.
+command_like(
+ [ 'pg_controldata', $new->data_dir ],
+ qr/Default char data signedness:\s+unsigned/,
+ 'the default char signedness is updated during pg_upgrade');
+
+done_testing();
--
2.43.5
v4-0004-pg_upgrade-Add-set-char-signedness-to-set-the-def.patchapplication/octet-stream; name=v4-0004-pg_upgrade-Add-set-char-signedness-to-set-the-def.patchDownload
From 0385f6d006e189edff93ac434f0b46daa28413a5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 2 Oct 2024 15:12:27 -0700
Subject: [PATCH v4 4/5] pg_upgrade: Add --set-char-signedness to set the
default char signedness of new cluster.
This change adds a new option --set-char-signedness to pg_upgrade. It
enables user to set arbitrary signedness during pg_upgrade. This helps
cases where user who knew they copied the v17 source cluster from
x86 (signedness=true) to ARM (signedness=false) can pg_upgrade
properly without the prerequisite of acquiring an x86 VM.
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
doc/src/sgml/ref/pgupgrade.sgml | 53 +++++++++++++++++++++
src/bin/pg_upgrade/check.c | 12 +++++
src/bin/pg_upgrade/option.c | 12 +++++
src/bin/pg_upgrade/pg_upgrade.c | 10 +++-
src/bin/pg_upgrade/pg_upgrade.h | 3 ++
src/bin/pg_upgrade/t/005_char_signedness.pl | 17 +++++++
6 files changed, 105 insertions(+), 2 deletions(-)
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 4777381dac2..cb9cc838b27 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -276,6 +276,59 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>--set-char-signedness=</option><replaceable>option</replaceable></term>
+ <listitem>
+ <para>
+ Manually set the default char signedness of new clusters. Possible values
+ are <literal>signed</literal> and <literal>unsigned</literal>.
+ </para>
+ <para>
+ In the C language, the default signedness of the <type>char</type> type
+ (when not explicitly specified) varies across platforms. For example,
+ <type>char</type> defaults to <type>signed char</type> on x86 CPUs but
+ to <type>unsigned char</type> on ARM CPUs.
+ </para>
+ <para>
+ Starting from <productname>PostgreSQL</productname> 18, database clusters
+ maintain their own default char signedness setting, which can be used to
+ ensure consistent behavior across platforms with different default char
+ signedness. By default, <application>pg_upgrade</application> preserves
+ the char signedness setting when upgrading from an existing cluster.
+ However, when upgrading from <productname>PostgreSQL</productname> 17 or,
+ earlier <application>pg_upgrade</application> adopts the char signedness
+ of the platform on which it was built.
+ </para>
+ <para>
+ This option allows you to explicitly set the default char signedness for
+ the new cluster, overriding any inherited values. There are two specific
+ scenarios where this option is relevant:
+ <itemizedlist>
+ <listitem>
+ <para>
+ If you are planning to migrate to a different platform after the upgrade,
+ you should not use this option. The default behavior is right in this case.
+ Instead, perform the upgrade on the original platform without this flag,
+ and then migrate the cluster afterward. This is the recommended and safest
+ approach.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ If you have already migrated the cluster to a platform with different
+ char signedness (for example, from an x86-based system to an ARM-based
+ system), you should use this option to specify the signedness matching
+ the original platform's default char signedness. Additionally, it's
+ essential not to modify any data files between migrating data files and
+ running <command>pg_upgrade</command>. <command>pg_upgrade</command>
+ should be the first operation that starts the cluster on the new platform.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-?</option></term>
<term><option>--help</option></term>
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 7ca1d8fffc9..d6f629dd3a2 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -838,6 +838,18 @@ check_cluster_versions(void)
GET_MAJOR_VERSION(new_cluster.bin_version))
pg_fatal("New cluster data and binary directories are from different major versions.");
+ /*
+ * Since from version 18, newly created database clusters always have
+ * 'signed' default char-signedness, it makes less sense to use
+ * --set-char-signedness option for upgrading from version 18 or later.
+ * Users who want to change the default char signedness of the new
+ * cluster, they can use pg_resetwal manually before the upgrade.
+ */
+ if (GET_MAJOR_VERSION(old_cluster.major_version) >= 1800 &&
+ user_opts.char_signedness != -1)
+ pg_fatal("%s option cannot be used to upgrade from PostgreSQL %s and later.",
+ "--set-char-signedness", "18");
+
check_ok();
}
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 108eb7a1ba4..1a580d656bb 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -60,6 +60,7 @@ parseCommandLine(int argc, char *argv[])
{"copy", no_argument, NULL, 2},
{"copy-file-range", no_argument, NULL, 3},
{"sync-method", required_argument, NULL, 4},
+ {"set-char-signedness", required_argument, NULL, 5},
{NULL, 0, NULL, 0}
};
@@ -70,6 +71,7 @@ parseCommandLine(int argc, char *argv[])
user_opts.do_sync = true;
user_opts.transfer_mode = TRANSFER_MODE_COPY;
+ user_opts.char_signedness = -1;
os_info.progname = get_progname(argv[0]);
@@ -212,6 +214,14 @@ parseCommandLine(int argc, char *argv[])
user_opts.sync_method = pg_strdup(optarg);
break;
+ case 5:
+ if (pg_strcasecmp(optarg, "signed") == 0)
+ user_opts.char_signedness = 1;
+ else if (pg_strcasecmp(optarg, "unsigned") == 0)
+ user_opts.char_signedness = 0;
+ else
+ pg_fatal("invalid argument for option %s", "--set-char-signedness");
+ break;
default:
fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
os_info.progname);
@@ -306,6 +316,8 @@ usage(void)
printf(_(" --clone clone instead of copying files to new cluster\n"));
printf(_(" --copy copy files to new cluster (default)\n"));
printf(_(" --copy-file-range copy files to new cluster with copy_file_range\n"));
+ printf(_(" --set-char-signedness=OPTION set new cluster char signedness to \"signed\" or\n"));
+ printf(_(" \"unsigned\"\n"));
printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
printf(_(" -?, --help show this help, then exit\n"));
printf(_("\n"
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index cc7357b5599..e95be8b459d 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -399,8 +399,14 @@ set_new_cluster_char_signedness(void)
{
bool new_char_signedness;
- /* Inherit the source database's signedness */
- new_char_signedness = old_cluster.controldata.default_char_signedness;
+ /*
+ * Use the specified char signedness if specified. Otherwise we inherit
+ * inherit the source database's signedness.
+ */
+ if (user_opts.char_signedness != -1)
+ new_char_signedness = (user_opts.char_signedness == 1);
+ else
+ new_char_signedness = old_cluster.controldata.default_char_signedness;
/* Change the char signedness of the new cluster, if necessary */
if (new_cluster.controldata.default_char_signedness != new_char_signedness)
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 26991e71009..7d50c83d0bf 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -334,6 +334,9 @@ typedef struct
int jobs; /* number of processes/threads to use */
char *socketdir; /* directory to use for Unix sockets */
char *sync_method;
+ int char_signedness; /* default char signedness: -1 for initial
+ * value, 1 for "signed" and 0 for
+ * "unsigned" */
} UserOpts;
typedef struct
diff --git a/src/bin/pg_upgrade/t/005_char_signedness.pl b/src/bin/pg_upgrade/t/005_char_signedness.pl
index 05c3014a27d..c024106863e 100644
--- a/src/bin/pg_upgrade/t/005_char_signedness.pl
+++ b/src/bin/pg_upgrade/t/005_char_signedness.pl
@@ -40,6 +40,23 @@ command_like(
qr/Default char data signedness:\s+unsigned/,
'updated default char signedness is unsigned in control file');
+# Cannot use --set-char-signedness option for upgrading from v18+
+command_fails(
+ [
+ 'pg_upgrade', '--no-sync',
+ '-d', $old->data_dir,
+ '-D', $new->data_dir,
+ '-b', $old->config_data('--bindir'),
+ '-B', $new->config_data('--bindir'),
+ '-s', $new->host,
+ '-p', $old->port,
+ '-P', $new->port,
+ '-set-char-signedness', 'signed',
+ $mode
+ ],
+ '--set-char-signedness option cannot be used for upgrading from v18 or later'
+);
+
# pg_upgrade should be successful.
command_ok(
[
--
2.43.5
v4-0002-pg_resetwal-Add-char-signedness-option-to-change-.patchapplication/octet-stream; name=v4-0002-pg_resetwal-Add-char-signedness-option-to-change-.patchDownload
From f5c324b3f91510a155a9afe3ad3ccfdff466aff8 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 2 Oct 2024 15:08:26 -0700
Subject: [PATCH v4 2/5] pg_resetwal: Add --char-signedness option to change
the default char signedness.
With the newly added option --char-signedness, pg_resetwal updates the
default char signedness flag in the controlfile. This option is
primarily intended for an upcoming patch that pg_upgrade supports
preserving the default char signedness during upgrades, and is not
meant for manual operation.
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
doc/src/sgml/ref/pg_resetwal.sgml | 16 ++++++++++++++++
src/bin/pg_resetwal/pg_resetwal.c | 25 +++++++++++++++++++++++++
src/bin/pg_resetwal/t/001_basic.pl | 6 ++++++
3 files changed, 47 insertions(+)
diff --git a/doc/src/sgml/ref/pg_resetwal.sgml b/doc/src/sgml/ref/pg_resetwal.sgml
index cf9c7e70f27..a72678dfef7 100644
--- a/doc/src/sgml/ref/pg_resetwal.sgml
+++ b/doc/src/sgml/ref/pg_resetwal.sgml
@@ -171,6 +171,22 @@ PostgreSQL documentation
</para>
<variablelist>
+ <varlistentry>
+ <term><option>--char-signedness=<replaceable class="parameter">option</replaceable></option></term>
+ <listitem>
+ <para>
+ Manually set the default char signedness. Possible values are
+ <literal>signed</literal> and <literal>unsigned</literal>.
+ </para>
+ <para>
+ A safe value for this option is, if known, the default char signedness
+ of the platform where the database cluster was initialized. However,
+ this option is exclusively for use with <command>pg_upgrade</command>
+ and should not normally be used manually.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-c <replaceable class="parameter">xid</replaceable>,<replaceable class="parameter">xid</replaceable></option></term>
<term><option>--commit-timestamp-ids=<replaceable class="parameter">xid</replaceable>,<replaceable class="parameter">xid</replaceable></option></term>
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index ed73607a46f..31bc0abff16 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -75,6 +75,7 @@ static TimeLineID minXlogTli = 0;
static XLogSegNo minXlogSegNo = 0;
static int WalSegSz;
static int set_wal_segsize;
+static int set_char_signedness = -1;
static void CheckDataVersion(void);
static bool read_controlfile(void);
@@ -106,6 +107,7 @@ main(int argc, char *argv[])
{"oldest-transaction-id", required_argument, NULL, 'u'},
{"next-transaction-id", required_argument, NULL, 'x'},
{"wal-segsize", required_argument, NULL, 1},
+ {"char-signedness", required_argument, NULL, 2},
{NULL, 0, NULL, 0}
};
@@ -302,6 +304,23 @@ main(int argc, char *argv[])
break;
}
+ case 2:
+ {
+ errno = 0;
+
+ if (pg_strcasecmp(optarg, "signed") == 0)
+ set_char_signedness = 1;
+ else if (pg_strcasecmp(optarg, "unsigned") == 0)
+ set_char_signedness = 0;
+ else
+ {
+ pg_log_error("invalid argument for option %s", "--char-signedness");
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ break;
+ }
+
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -456,6 +475,9 @@ main(int argc, char *argv[])
if (set_wal_segsize != 0)
ControlFile.xlog_seg_size = WalSegSz;
+ if (set_char_signedness != -1)
+ ControlFile.default_char_signedness = (set_char_signedness == 1);
+
if (minXlogSegNo > newXlogSegNo)
newXlogSegNo = minXlogSegNo;
@@ -779,6 +801,8 @@ PrintControlValues(bool guessed)
(ControlFile.float8ByVal ? _("by value") : _("by reference")));
printf(_("Data page checksum version: %u\n"),
ControlFile.data_checksum_version);
+ printf(_("Default char data signedness: %s\n"),
+ (ControlFile.default_char_signedness ? _("signed") : _("unsigned")));
}
@@ -1188,6 +1212,7 @@ usage(void)
printf(_(" -O, --multixact-offset=OFFSET set next multitransaction offset\n"));
printf(_(" -u, --oldest-transaction-id=XID set oldest transaction ID\n"));
printf(_(" -x, --next-transaction-id=XID set next transaction ID\n"));
+ printf(_(" --char-signedness=OPTION set char signedness to \"signed\" or \"unsigned\"\n"));
printf(_(" --wal-segsize=SIZE size of WAL segments, in megabytes\n"));
printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 323cd483cf3..d6bbbd0ceda 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -173,6 +173,12 @@ command_fails_like(
qr/must be greater than/,
'fails with -x value too small');
+# --char-signedness
+command_fails_like(
+ [ 'pg_resetwal', '--char-signedness', 'foo', $node->data_dir ],
+ qr/error: invalid argument for option --char-signedness/,
+ 'fails with incorrect --char-signedness option');
+
# run with control override options
my $out = (run_command([ 'pg_resetwal', '--dry-run', $node->data_dir ]))[0];
--
2.43.5
v4-0005-Fix-an-issue-with-index-scan-using-pg_trgm-due-to.patchapplication/octet-stream; name=v4-0005-Fix-an-issue-with-index-scan-using-pg_trgm-due-to.patchDownload
From fd0d6406d63fcc9bea385ec4479cebc91eff7aae Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Sep 2024 15:17:02 -0700
Subject: [PATCH v4 5/5] Fix an issue with index scan using pg_trgm due to char
signedness on different architectures.
GIN and GiST indexes utilizing pg_trgm's opclasses store sorted
trigrams within index tuples. When comparing and sorting each trigram,
pg_trgm treats each character as a 'char[3]' type in C. However, the
char type in C can be interpreted as either signed char or unsigned
char, depending on the platform, if the signedness is not explicitly
specified. Consequently, during replication between different CPU
architectures, there was an issue where index scans on standby servers
could not locate matching index tuples due to the differing treatment
of character signedness.
This change introduces comparison functions for trgm that explicitly
handle signed char and unsigned char. The appropriate comparison
function will be dynamically selected based on the character
signedness stored in the control file. Therefore, upgraded clusters
can utilize the indexes without rebuilding, provided the cluster
upgrade occurs on platforms with the same character signedness as the
original cluster initialization.
The default char signedness information was introduced in XXXX, so no
backpatch.
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
contrib/pg_trgm/trgm.h | 5 +----
contrib/pg_trgm/trgm_op.c | 44 +++++++++++++++++++++++++++++++++++++++
2 files changed, 45 insertions(+), 4 deletions(-)
diff --git a/contrib/pg_trgm/trgm.h b/contrib/pg_trgm/trgm.h
index 10827563694..ca017585369 100644
--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -40,15 +40,12 @@
typedef char trgm[3];
-#define CMPCHAR(a,b) ( ((a)==(b)) ? 0 : ( ((a)<(b)) ? -1 : 1 ) )
-#define CMPPCHAR(a,b,i) CMPCHAR( *(((const char*)(a))+i), *(((const char*)(b))+i) )
-#define CMPTRGM(a,b) ( CMPPCHAR(a,b,0) ? CMPPCHAR(a,b,0) : ( CMPPCHAR(a,b,1) ? CMPPCHAR(a,b,1) : CMPPCHAR(a,b,2) ) )
-
#define CPTRGM(a,b) do { \
*(((char*)(a))+0) = *(((char*)(b))+0); \
*(((char*)(a))+1) = *(((char*)(b))+1); \
*(((char*)(a))+2) = *(((char*)(b))+2); \
} while(0)
+extern int (*CMPTRGM) (const void *a, const void *b);
#define ISWORDCHR(c) (t_isalnum(c))
#define ISPRINTABLECHAR(a) ( isascii( *(unsigned char*)(a) ) && (isalnum( *(unsigned char*)(a) ) || *(unsigned char*)(a)==' ') )
diff --git a/contrib/pg_trgm/trgm_op.c b/contrib/pg_trgm/trgm_op.c
index d0833b3e4a1..94b9015fd67 100644
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@@ -42,6 +42,9 @@ PG_FUNCTION_INFO_V1(strict_word_similarity_commutator_op);
PG_FUNCTION_INFO_V1(strict_word_similarity_dist_op);
PG_FUNCTION_INFO_V1(strict_word_similarity_dist_commutator_op);
+static int CMPTRGM_CHOOSE(const void *a, const void *b);
+int (*CMPTRGM) (const void *a, const void *b) = CMPTRGM_CHOOSE;
+
/* Trigram with position */
typedef struct
{
@@ -107,6 +110,47 @@ _PG_init(void)
MarkGUCPrefixReserved("pg_trgm");
}
+#define CMPCHAR(a,b) ( ((a)==(b)) ? 0 : ( ((a)<(b)) ? -1 : 1 ) )
+
+/*
+ * Functions for comparing two trgms while treating each char as "signed char" or
+ * "unsigned char".
+ */
+static inline int
+CMPTRGM_SIGNED(const void *a, const void *b)
+{
+#define CMPPCHAR_S(a,b,i) CMPCHAR( *(((const signed char*)(a))+i), *(((const signed char*)(b))+i) )
+
+ return CMPPCHAR_S(a, b, 0) ? CMPPCHAR_S(a, b, 0)
+ : (CMPPCHAR_S(a, b, 1) ? CMPPCHAR_S(a, b, 1)
+ : CMPPCHAR_S(a, b, 2));
+}
+
+static inline int
+CMPTRGM_UNSIGNED(const void *a, const void *b)
+{
+#define CMPPCHAR_UNS(a,b,i) CMPCHAR( *(((const unsigned char*)(a))+i), *(((const unsigned char*)(b))+i) )
+
+ return CMPPCHAR_UNS(a, b, 0) ? CMPPCHAR_UNS(a, b, 0)
+ : (CMPPCHAR_UNS(a, b, 1) ? CMPPCHAR_UNS(a, b, 1)
+ : CMPPCHAR_UNS(a, b, 2));
+}
+
+/*
+ * This gets called on the first call. It replaces the function pointer so
+ * that subsequent calls are routed directly to the chosen implementation.
+ */
+static int
+CMPTRGM_CHOOSE(const void *a, const void *b)
+{
+ if (GetDefaultCharSignedness())
+ CMPTRGM = CMPTRGM_SIGNED;
+ else
+ CMPTRGM = CMPTRGM_UNSIGNED;
+
+ return CMPTRGM(a, b);
+}
+
/*
* Deprecated function.
* Use "pg_trgm.similarity_threshold" GUC variable instead of this function.
--
2.43.5
v4-0001-Add-default_char_signedness-field-to-ControlFileD.patchapplication/octet-stream; name=v4-0001-Add-default_char_signedness-field-to-ControlFileD.patchDownload
From 4a9655f80de0ecdbc43734204aca35ac6d6ee97a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Sep 2024 11:08:57 -0700
Subject: [PATCH v4 1/5] Add default_char_signedness field to ControlFileData.
The signedness of the 'char' type in C is
implementation-dependent. For instance, 'signed char' is used by
default on x86 CPUs, while 'unsigned char' is used on aarch
CPUs. Previously, we accidentally let C implementation signedness
affect persistent data. This led to inconsistent results when
comparing char data across different platforms.
This commit introduces a new 'default_char_signedness' field in
ControlFileData to store the signedness of the 'char' type. While this
change does not encourage the use of 'char' without explicitly
specifying its signedness, this field can be used as a hint to ensure
consistent behavior for pre-v18 data files that store data sorted by
the 'char' type on disk (e.g., GIN and GiST indexes), especially in
cross-platform replication scenarios.
Newly created database clusters unconditionally set the default char
signedness to true. pg_upgrade (with an upcoming commit) changes this
flag for clusters if the source database cluster has
signedness=false. As a result, signedness=false setting will become
rare over time. If we had known about the problem during the last
development cycle that forced initdb (v8.3), we would have made all
clusters signed or all clusters unsigned. Making pg_upgrade the only
source of signedness=false will cause the population of database
clusters to converge toward that retrospective ideal.
XXX bump catversion.
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
doc/src/sgml/func.sgml | 5 ++++
src/backend/access/transam/xlog.c | 40 +++++++++++++++++++++++++
src/backend/utils/misc/pg_controldata.c | 7 +++--
src/bin/pg_controldata/pg_controldata.c | 2 ++
src/include/access/xlog.h | 1 +
src/include/catalog/pg_control.h | 6 ++++
src/include/catalog/pg_proc.dat | 6 ++--
7 files changed, 62 insertions(+), 5 deletions(-)
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7efc81936ab..4ffd3063a4e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27986,6 +27986,11 @@ acl | {postgres=arwdDxtm/postgres,foo=r/postgres}
<entry><type>integer</type></entry>
</row>
+ <row>
+ <entry><structfield>default_char_signedness</structfield></entry>
+ <entry><type>bool</type></entry>
+ </row>
+
</tbody>
</tgroup>
</table>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 25a5c605404..90832c37a53 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4284,6 +4284,33 @@ WriteControlFile(void)
ControlFile->float8ByVal = FLOAT8PASSBYVAL;
+ /*
+ * Initialize the default 'char' signedness.
+ *
+ * The signedness of the char type is implementation-defined. For instance
+ * on x86 architecture CPUs, the char data type is typically treated as
+ * signed by default, whereas on aarch architecture CPUs, it is typically
+ * treated as unsigned by default. In v17 or earlier, we accidentally let
+ * C implementation signedness affect persistent data. This led to
+ * inconsistent results when comparing char data across different
+ * platforms.
+ *
+ * This flag can be used as a hint to ensure consistent behavior for
+ * pre-v18 data files that store data sorted by the 'char' type on disk,
+ * especially in cross-platform replication scenarios.
+ *
+ * Newly created database clusters unconditionally set the default char
+ * signedness to true. pg_upgrade changes this flag for clusters that were
+ * initialized on signedness=false platforms. As a result,
+ * signedness=false setting will become rare over time. If we had known
+ * about this problem during the last development cycle that forced initdb
+ * (v8.3), we would have made all clusters signed or all clusters
+ * unsigned. Making pg_upgrade the only source of signedness=false will
+ * cause the population of database clusters to converge toward that
+ * retrospective ideal.
+ */
+ ControlFile->default_char_signedness = true;
+
/* Contents are protected with a CRC */
INIT_CRC32C(ControlFile->crc);
COMP_CRC32C(ControlFile->crc,
@@ -4612,6 +4639,19 @@ DataChecksumsEnabled(void)
return (ControlFile->data_checksum_version > 0);
}
+/*
+ * Return true if the cluster was initialized on a platform where the
+ * default signedness of char is "signed". This function exists for code
+ * that deals with pre-v18 data files that store data sorted by the 'char'
+ * type on disk (e.g., GIN and GiST indexes). See the comments in
+ * WriteControlFile() for details.
+ */
+bool
+GetDefaultCharSignedness(void)
+{
+ return ControlFile->default_char_signedness;
+}
+
/*
* Returns a fake LSN for unlogged relations.
*
diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index 9dfba499c13..6d036e3bf32 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -203,8 +203,8 @@ pg_control_recovery(PG_FUNCTION_ARGS)
Datum
pg_control_init(PG_FUNCTION_ARGS)
{
- Datum values[11];
- bool nulls[11];
+ Datum values[12];
+ bool nulls[12];
TupleDesc tupdesc;
HeapTuple htup;
ControlFileData *ControlFile;
@@ -254,6 +254,9 @@ pg_control_init(PG_FUNCTION_ARGS)
values[10] = Int32GetDatum(ControlFile->data_checksum_version);
nulls[10] = false;
+ values[11] = BoolGetDatum(ControlFile->default_char_signedness);
+ nulls[11] = false;
+
htup = heap_form_tuple(tupdesc, values, nulls);
PG_RETURN_DATUM(HeapTupleGetDatum(htup));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index cf11ab3f2ee..bea779eef94 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -336,6 +336,8 @@ main(int argc, char *argv[])
(ControlFile->float8ByVal ? _("by value") : _("by reference")));
printf(_("Data page checksum version: %u\n"),
ControlFile->data_checksum_version);
+ printf(_("Default char data signedness: %s\n"),
+ (ControlFile->default_char_signedness ? _("signed") : _("unsigned")));
printf(_("Mock authentication nonce: %s\n"),
mock_auth_nonce_str);
return 0;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4411c1468ac..d313099c027 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -231,6 +231,7 @@ extern XLogRecPtr GetXLogWriteRecPtr(void);
extern uint64 GetSystemIdentifier(void);
extern char *GetMockAuthenticationNonce(void);
extern bool DataChecksumsEnabled(void);
+extern bool GetDefaultCharSignedness(void);
extern XLogRecPtr GetFakeLSNForUnloggedRel(void);
extern Size XLOGShmemSize(void);
extern void XLOGShmemInit(void);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3797f25b306..0cb86faa43b 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -221,6 +221,12 @@ typedef struct ControlFileData
/* Are data pages protected by checksums? Zero if no checksum version */
uint32 data_checksum_version;
+ /*
+ * True if the default signedness of char is "signed" on a platform where
+ * the cluster is initialized.
+ */
+ bool default_char_signedness;
+
/*
* Random nonce, used in authentication requests that need to proceed
* based on values that are cluster-unique, like a SASL exchange that
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9e803d610d7..e2d5c0d0886 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12206,9 +12206,9 @@
descr => 'pg_controldata init state information as a function',
proname => 'pg_control_init', provolatile => 'v', prorettype => 'record',
proargtypes => '',
- proallargtypes => '{int4,int4,int4,int4,int4,int4,int4,int4,int4,bool,int4}',
- proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
- proargnames => '{max_data_alignment,database_block_size,blocks_per_segment,wal_block_size,bytes_per_wal_segment,max_identifier_length,max_index_columns,max_toast_chunk_size,large_object_chunk_size,float8_pass_by_value,data_page_checksum_version}',
+ proallargtypes => '{int4,int4,int4,int4,int4,int4,int4,int4,int4,bool,int4,bool}',
+ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{max_data_alignment,database_block_size,blocks_per_segment,wal_block_size,bytes_per_wal_segment,max_identifier_length,max_index_columns,max_toast_chunk_size,large_object_chunk_size,float8_pass_by_value,data_page_checksum_version,default_char_signedness}',
prosrc => 'pg_control_init' },
# subscripting support for built-in types
--
2.43.5
Apart from two doc issues, this is ready:
On Tue, Feb 18, 2025 at 01:23:20PM -0800, Masahiko Sawada wrote:
On Mon, Feb 17, 2025 at 2:57 PM Noah Misch <noah@leadboat.com> wrote:
On Fri, Jan 17, 2025 at 05:11:41PM -0800, Masahiko Sawada wrote:
+ However, when upgrading from <productname>PostgreSQL</productname> 17 or, + earlier <application>pg_upgrade</application> adopts the char signedness
s/or, earlier/or earlier,/
--- a/doc/src/sgml/ref/pg_resetwal.sgml +++ b/doc/src/sgml/ref/pg_resetwal.sgml @@ -171,6 +171,22 @@ PostgreSQL documentation </para><variablelist> + <varlistentry> + <term><option>--char-signedness=<replaceable class="parameter">option</replaceable></option></term> + <listitem> + <para> + Manually set the default char signedness. Possible values are + <literal>signed</literal> and <literal>unsigned</literal>. + </para> + <para> + A safe value for this option is, if known, the default char signedness + of the platform where the database cluster was initialized. However,
Only if initialized on v17 or earlier. I recommend this edit:
diff --git a/doc/src/sgml/ref/pg_resetwal.sgml b/doc/src/sgml/ref/pg_resetwal.sgml
index a72678d..dd011d2 100644
--- a/doc/src/sgml/ref/pg_resetwal.sgml
+++ b/doc/src/sgml/ref/pg_resetwal.sgml
@@ -179,8 +179,11 @@ PostgreSQL documentation
<literal>signed</literal> and <literal>unsigned</literal>.
</para>
<para>
- A safe value for this option is, if known, the default char signedness
- of the platform where the database cluster was initialized. However,
+ For a database cluster that <command>pg_upgrade</command> upgraded from
+ a <productname>PostgreSQL</productname> version before 18, the safe
+ value would be the default <type>char</type> signedness of the platform
+ that ran the cluster before that upgrade. For all other
+ clusters, <literal>signed</literal> would be the safe value. However,
this option is exclusively for use with <command>pg_upgrade</command>
and should not normally be used manually.
</para>
On Tue, Feb 18, 2025 at 3:23 PM Noah Misch <noah@leadboat.com> wrote:
Apart from two doc issues, this is ready:
On Tue, Feb 18, 2025 at 01:23:20PM -0800, Masahiko Sawada wrote:
On Mon, Feb 17, 2025 at 2:57 PM Noah Misch <noah@leadboat.com> wrote:
On Fri, Jan 17, 2025 at 05:11:41PM -0800, Masahiko Sawada wrote:
+ However, when upgrading from <productname>PostgreSQL</productname> 17 or, + earlier <application>pg_upgrade</application> adopts the char signednesss/or, earlier/or earlier,/
--- a/doc/src/sgml/ref/pg_resetwal.sgml +++ b/doc/src/sgml/ref/pg_resetwal.sgml @@ -171,6 +171,22 @@ PostgreSQL documentation </para><variablelist> + <varlistentry> + <term><option>--char-signedness=<replaceable class="parameter">option</replaceable></option></term> + <listitem> + <para> + Manually set the default char signedness. Possible values are + <literal>signed</literal> and <literal>unsigned</literal>. + </para> + <para> + A safe value for this option is, if known, the default char signedness + of the platform where the database cluster was initialized. However,Only if initialized on v17 or earlier. I recommend this edit:
diff --git a/doc/src/sgml/ref/pg_resetwal.sgml b/doc/src/sgml/ref/pg_resetwal.sgml index a72678d..dd011d2 100644 --- a/doc/src/sgml/ref/pg_resetwal.sgml +++ b/doc/src/sgml/ref/pg_resetwal.sgml @@ -179,8 +179,11 @@ PostgreSQL documentation <literal>signed</literal> and <literal>unsigned</literal>. </para> <para> - A safe value for this option is, if known, the default char signedness - of the platform where the database cluster was initialized. However, + For a database cluster that <command>pg_upgrade</command> upgraded from + a <productname>PostgreSQL</productname> version before 18, the safe + value would be the default <type>char</type> signedness of the platform + that ran the cluster before that upgrade. For all other + clusters, <literal>signed</literal> would be the safe value. However, this option is exclusively for use with <command>pg_upgrade</command> and should not normally be used manually. </para>
Thank you for reviewing the patches. I've fixed these issues and
attached the updated patches.
I have one question about the 0001 patch; since we add
'default_char_signedness' field to ControlFileData do we need to bump
PG_CONTROL_VERSION? We have comments about bumping PG_CONTROL_VERSION
when changing CheckPoint struct or DBState enum so it seems likely but
I'd like to confirm just in case that we need to bump
PG_CONTROL_VERSION also when changing ControlFileData. If we need, can
we bump it to 1800? or 1701?
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v5-0002-pg_resetwal-Add-char-signedness-option-to-change-.patchapplication/x-patch; name=v5-0002-pg_resetwal-Add-char-signedness-option-to-change-.patchDownload
From 55d365dd4e1b33e55834b00007b8b580f2db0686 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 2 Oct 2024 15:08:26 -0700
Subject: [PATCH v5 2/5] pg_resetwal: Add --char-signedness option to change
the default char signedness.
With the newly added option --char-signedness, pg_resetwal updates the
default char signedness flag in the controlfile. This option is
primarily intended for an upcoming patch that pg_upgrade supports
preserving the default char signedness during upgrades, and is not
meant for manual operation.
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
doc/src/sgml/ref/pg_resetwal.sgml | 19 +++++++++++++++++++
src/bin/pg_resetwal/pg_resetwal.c | 25 +++++++++++++++++++++++++
src/bin/pg_resetwal/t/001_basic.pl | 6 ++++++
3 files changed, 50 insertions(+)
diff --git a/doc/src/sgml/ref/pg_resetwal.sgml b/doc/src/sgml/ref/pg_resetwal.sgml
index cf9c7e70f27..dd011d246c1 100644
--- a/doc/src/sgml/ref/pg_resetwal.sgml
+++ b/doc/src/sgml/ref/pg_resetwal.sgml
@@ -171,6 +171,25 @@ PostgreSQL documentation
</para>
<variablelist>
+ <varlistentry>
+ <term><option>--char-signedness=<replaceable class="parameter">option</replaceable></option></term>
+ <listitem>
+ <para>
+ Manually set the default char signedness. Possible values are
+ <literal>signed</literal> and <literal>unsigned</literal>.
+ </para>
+ <para>
+ For a database cluster that <command>pg_upgrade</command> upgraded from
+ a <productname>PostgreSQL</productname> version before 18, the safe
+ value would be the default <type>char</type> signedness of the platform
+ that ran the cluster before that upgrade. For all other
+ clusters, <literal>signed</literal> would be the safe value. However,
+ this option is exclusively for use with <command>pg_upgrade</command>
+ and should not normally be used manually.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-c <replaceable class="parameter">xid</replaceable>,<replaceable class="parameter">xid</replaceable></option></term>
<term><option>--commit-timestamp-ids=<replaceable class="parameter">xid</replaceable>,<replaceable class="parameter">xid</replaceable></option></term>
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index ed73607a46f..31bc0abff16 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -75,6 +75,7 @@ static TimeLineID minXlogTli = 0;
static XLogSegNo minXlogSegNo = 0;
static int WalSegSz;
static int set_wal_segsize;
+static int set_char_signedness = -1;
static void CheckDataVersion(void);
static bool read_controlfile(void);
@@ -106,6 +107,7 @@ main(int argc, char *argv[])
{"oldest-transaction-id", required_argument, NULL, 'u'},
{"next-transaction-id", required_argument, NULL, 'x'},
{"wal-segsize", required_argument, NULL, 1},
+ {"char-signedness", required_argument, NULL, 2},
{NULL, 0, NULL, 0}
};
@@ -302,6 +304,23 @@ main(int argc, char *argv[])
break;
}
+ case 2:
+ {
+ errno = 0;
+
+ if (pg_strcasecmp(optarg, "signed") == 0)
+ set_char_signedness = 1;
+ else if (pg_strcasecmp(optarg, "unsigned") == 0)
+ set_char_signedness = 0;
+ else
+ {
+ pg_log_error("invalid argument for option %s", "--char-signedness");
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ break;
+ }
+
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -456,6 +475,9 @@ main(int argc, char *argv[])
if (set_wal_segsize != 0)
ControlFile.xlog_seg_size = WalSegSz;
+ if (set_char_signedness != -1)
+ ControlFile.default_char_signedness = (set_char_signedness == 1);
+
if (minXlogSegNo > newXlogSegNo)
newXlogSegNo = minXlogSegNo;
@@ -779,6 +801,8 @@ PrintControlValues(bool guessed)
(ControlFile.float8ByVal ? _("by value") : _("by reference")));
printf(_("Data page checksum version: %u\n"),
ControlFile.data_checksum_version);
+ printf(_("Default char data signedness: %s\n"),
+ (ControlFile.default_char_signedness ? _("signed") : _("unsigned")));
}
@@ -1188,6 +1212,7 @@ usage(void)
printf(_(" -O, --multixact-offset=OFFSET set next multitransaction offset\n"));
printf(_(" -u, --oldest-transaction-id=XID set oldest transaction ID\n"));
printf(_(" -x, --next-transaction-id=XID set next transaction ID\n"));
+ printf(_(" --char-signedness=OPTION set char signedness to \"signed\" or \"unsigned\"\n"));
printf(_(" --wal-segsize=SIZE size of WAL segments, in megabytes\n"));
printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 323cd483cf3..d6bbbd0ceda 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl
+++ b/src/bin/pg_resetwal/t/001_basic.pl
@@ -173,6 +173,12 @@ command_fails_like(
qr/must be greater than/,
'fails with -x value too small');
+# --char-signedness
+command_fails_like(
+ [ 'pg_resetwal', '--char-signedness', 'foo', $node->data_dir ],
+ qr/error: invalid argument for option --char-signedness/,
+ 'fails with incorrect --char-signedness option');
+
# run with control override options
my $out = (run_command([ 'pg_resetwal', '--dry-run', $node->data_dir ]))[0];
--
2.43.5
v5-0005-Fix-an-issue-with-index-scan-using-pg_trgm-due-to.patchapplication/x-patch; name=v5-0005-Fix-an-issue-with-index-scan-using-pg_trgm-due-to.patchDownload
From fd9d95eac0ed5c64815e07d27ab892be4f0aad11 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Sep 2024 15:17:02 -0700
Subject: [PATCH v5 5/5] Fix an issue with index scan using pg_trgm due to char
signedness on different architectures.
GIN and GiST indexes utilizing pg_trgm's opclasses store sorted
trigrams within index tuples. When comparing and sorting each trigram,
pg_trgm treats each character as a 'char[3]' type in C. However, the
char type in C can be interpreted as either signed char or unsigned
char, depending on the platform, if the signedness is not explicitly
specified. Consequently, during replication between different CPU
architectures, there was an issue where index scans on standby servers
could not locate matching index tuples due to the differing treatment
of character signedness.
This change introduces comparison functions for trgm that explicitly
handle signed char and unsigned char. The appropriate comparison
function will be dynamically selected based on the character
signedness stored in the control file. Therefore, upgraded clusters
can utilize the indexes without rebuilding, provided the cluster
upgrade occurs on platforms with the same character signedness as the
original cluster initialization.
The default char signedness information was introduced in XXXX, so no
backpatch.
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
contrib/pg_trgm/trgm.h | 5 +----
contrib/pg_trgm/trgm_op.c | 44 +++++++++++++++++++++++++++++++++++++++
2 files changed, 45 insertions(+), 4 deletions(-)
diff --git a/contrib/pg_trgm/trgm.h b/contrib/pg_trgm/trgm.h
index 10827563694..ca017585369 100644
--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -40,15 +40,12 @@
typedef char trgm[3];
-#define CMPCHAR(a,b) ( ((a)==(b)) ? 0 : ( ((a)<(b)) ? -1 : 1 ) )
-#define CMPPCHAR(a,b,i) CMPCHAR( *(((const char*)(a))+i), *(((const char*)(b))+i) )
-#define CMPTRGM(a,b) ( CMPPCHAR(a,b,0) ? CMPPCHAR(a,b,0) : ( CMPPCHAR(a,b,1) ? CMPPCHAR(a,b,1) : CMPPCHAR(a,b,2) ) )
-
#define CPTRGM(a,b) do { \
*(((char*)(a))+0) = *(((char*)(b))+0); \
*(((char*)(a))+1) = *(((char*)(b))+1); \
*(((char*)(a))+2) = *(((char*)(b))+2); \
} while(0)
+extern int (*CMPTRGM) (const void *a, const void *b);
#define ISWORDCHR(c) (t_isalnum(c))
#define ISPRINTABLECHAR(a) ( isascii( *(unsigned char*)(a) ) && (isalnum( *(unsigned char*)(a) ) || *(unsigned char*)(a)==' ') )
diff --git a/contrib/pg_trgm/trgm_op.c b/contrib/pg_trgm/trgm_op.c
index d0833b3e4a1..94b9015fd67 100644
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@@ -42,6 +42,9 @@ PG_FUNCTION_INFO_V1(strict_word_similarity_commutator_op);
PG_FUNCTION_INFO_V1(strict_word_similarity_dist_op);
PG_FUNCTION_INFO_V1(strict_word_similarity_dist_commutator_op);
+static int CMPTRGM_CHOOSE(const void *a, const void *b);
+int (*CMPTRGM) (const void *a, const void *b) = CMPTRGM_CHOOSE;
+
/* Trigram with position */
typedef struct
{
@@ -107,6 +110,47 @@ _PG_init(void)
MarkGUCPrefixReserved("pg_trgm");
}
+#define CMPCHAR(a,b) ( ((a)==(b)) ? 0 : ( ((a)<(b)) ? -1 : 1 ) )
+
+/*
+ * Functions for comparing two trgms while treating each char as "signed char" or
+ * "unsigned char".
+ */
+static inline int
+CMPTRGM_SIGNED(const void *a, const void *b)
+{
+#define CMPPCHAR_S(a,b,i) CMPCHAR( *(((const signed char*)(a))+i), *(((const signed char*)(b))+i) )
+
+ return CMPPCHAR_S(a, b, 0) ? CMPPCHAR_S(a, b, 0)
+ : (CMPPCHAR_S(a, b, 1) ? CMPPCHAR_S(a, b, 1)
+ : CMPPCHAR_S(a, b, 2));
+}
+
+static inline int
+CMPTRGM_UNSIGNED(const void *a, const void *b)
+{
+#define CMPPCHAR_UNS(a,b,i) CMPCHAR( *(((const unsigned char*)(a))+i), *(((const unsigned char*)(b))+i) )
+
+ return CMPPCHAR_UNS(a, b, 0) ? CMPPCHAR_UNS(a, b, 0)
+ : (CMPPCHAR_UNS(a, b, 1) ? CMPPCHAR_UNS(a, b, 1)
+ : CMPPCHAR_UNS(a, b, 2));
+}
+
+/*
+ * This gets called on the first call. It replaces the function pointer so
+ * that subsequent calls are routed directly to the chosen implementation.
+ */
+static int
+CMPTRGM_CHOOSE(const void *a, const void *b)
+{
+ if (GetDefaultCharSignedness())
+ CMPTRGM = CMPTRGM_SIGNED;
+ else
+ CMPTRGM = CMPTRGM_UNSIGNED;
+
+ return CMPTRGM(a, b);
+}
+
/*
* Deprecated function.
* Use "pg_trgm.similarity_threshold" GUC variable instead of this function.
--
2.43.5
v5-0003-pg_upgrade-Preserve-default-char-signedness-value.patchapplication/x-patch; name=v5-0003-pg_upgrade-Preserve-default-char-signedness-value.patchDownload
From 5296822018301c5b6d4d59981cbf4a6bd78b4633 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 2 Oct 2024 15:11:23 -0700
Subject: [PATCH v5 3/5] pg_upgrade: Preserve default char signedness value
from old cluster.
Commit XXX introduced the 'default_char_signedness' field in
controlfile. Newly created database clusters always set this field to
'signed'.
This change ensures that pg_upgrade updates the
'default_char_signedness' to 'unsigned' if the source database cluster
has signedness=false. For source clusters from v17 or earlier, which
lack the 'default_char_signedness' information, pg_upgrade assumes the
source cluster was initialized on the same platform where pg_upgrade
is running. It then sets the 'default_char_signedness' value according
to the current platform's default character signedness.
XXX: need to adjust DEFAULT_CHAR_SIGNEDNESS_CAT_VER.
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
src/bin/pg_upgrade/controldata.c | 44 +++++++++++++-
src/bin/pg_upgrade/pg_upgrade.c | 28 +++++++++
src/bin/pg_upgrade/pg_upgrade.h | 7 +++
src/bin/pg_upgrade/t/005_char_signedness.pl | 65 +++++++++++++++++++++
4 files changed, 143 insertions(+), 1 deletion(-)
create mode 100644 src/bin/pg_upgrade/t/005_char_signedness.pl
diff --git a/src/bin/pg_upgrade/controldata.c b/src/bin/pg_upgrade/controldata.c
index ab16716f319..bd49ea867bf 100644
--- a/src/bin/pg_upgrade/controldata.c
+++ b/src/bin/pg_upgrade/controldata.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include <ctype.h>
+#include <limits.h> /* for CHAR_MIN */
#include "common/string.h"
#include "pg_upgrade.h"
@@ -62,6 +63,7 @@ get_control_data(ClusterInfo *cluster)
bool got_date_is_int = false;
bool got_data_checksum_version = false;
bool got_cluster_state = false;
+ bool got_default_char_signedness = false;
char *lc_collate = NULL;
char *lc_ctype = NULL;
char *lc_monetary = NULL;
@@ -501,6 +503,25 @@ get_control_data(ClusterInfo *cluster)
cluster->controldata.data_checksum_version = str2uint(p);
got_data_checksum_version = true;
}
+ else if ((p = strstr(bufin, "Default char data signedness:")) != NULL)
+ {
+ p = strchr(p, ':');
+
+ if (p == NULL || strlen(p) <= 1)
+ pg_fatal("%d: controldata retrieval problem", __LINE__);
+
+ /* Skip the colon and any whitespace after it */
+ p++;
+ while (isspace((unsigned char) *p))
+ p++;
+
+ /* The value should be either 'signed' or 'unsigned' */
+ if (strcmp(p, "signed") != 0 && strcmp(p, "unsigned") != 0)
+ pg_fatal("%d: controldata retrieval problem", __LINE__);
+
+ cluster->controldata.default_char_signedness = strcmp(p, "signed") == 0;
+ got_default_char_signedness = true;
+ }
}
rc = pclose(output);
@@ -561,6 +582,21 @@ get_control_data(ClusterInfo *cluster)
}
}
+ /*
+ * Pre-v18 database clusters don't have the default char signedness
+ * information. We use the char signedness of the platform where
+ * pg_upgrade was built.
+ */
+ if (cluster->controldata.cat_ver < DEFAULT_CHAR_SIGNEDNESS_CAT_VER)
+ {
+ Assert(!got_default_char_signedness);
+#if CHAR_MIN != 0
+ cluster->controldata.default_char_signedness = true;
+#else
+ cluster->controldata.default_char_signedness = false;
+#endif
+ }
+
/* verify that we got all the mandatory pg_control data */
if (!got_xid || !got_oid ||
!got_multi || !got_oldestxid ||
@@ -572,7 +608,9 @@ get_control_data(ClusterInfo *cluster)
!got_index || !got_toast ||
(!got_large_object &&
cluster->controldata.ctrl_ver >= LARGE_OBJECT_SIZE_PG_CONTROL_VER) ||
- !got_date_is_int || !got_data_checksum_version)
+ !got_date_is_int || !got_data_checksum_version ||
+ (!got_default_char_signedness &&
+ cluster->controldata.cat_ver >= DEFAULT_CHAR_SIGNEDNESS_CAT_VER))
{
if (cluster == &old_cluster)
pg_log(PG_REPORT,
@@ -641,6 +679,10 @@ get_control_data(ClusterInfo *cluster)
if (!got_data_checksum_version)
pg_log(PG_REPORT, " data checksum version");
+ /* value added in Postgres 18 */
+ if (!got_default_char_signedness)
+ pg_log(PG_REPORT, " default char signedness");
+
pg_fatal("Cannot continue without required control information, terminating");
}
}
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 36c7f3879d5..cc7357b5599 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -54,6 +54,7 @@
*/
#define RESTORE_TRANSACTION_SIZE 1000
+static void set_new_cluster_char_signedness(void);
static void set_locale_and_encoding(void);
static void prepare_new_cluster(void);
static void prepare_new_globals(void);
@@ -154,6 +155,7 @@ main(int argc, char **argv)
*/
copy_xact_xlog_xid();
+ set_new_cluster_char_signedness();
/* New now using xids of the old system */
@@ -388,6 +390,32 @@ setup(char *argv0)
}
}
+/*
+ * Set the new cluster's default char signedness using the old cluster's
+ * value.
+ */
+static void
+set_new_cluster_char_signedness(void)
+{
+ bool new_char_signedness;
+
+ /* Inherit the source database's signedness */
+ new_char_signedness = old_cluster.controldata.default_char_signedness;
+
+ /* Change the char signedness of the new cluster, if necessary */
+ if (new_cluster.controldata.default_char_signedness != new_char_signedness)
+ {
+ prep_status("Setting the default char signedness for new cluster");
+
+ exec_prog(UTILITY_LOG_FILE, NULL, true, true,
+ "\"%s/pg_resetwal\" --char-signedness %s \"%s\"",
+ new_cluster.bindir,
+ new_char_signedness ? "signed" : "unsigned",
+ new_cluster.pgdata);
+
+ check_ok();
+ }
+}
/*
* Copy locale and encoding information into the new cluster's template0.
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 0cdd675e4f1..26991e71009 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -125,6 +125,12 @@ extern char *output_files[];
*/
#define JSONB_FORMAT_CHANGE_CAT_VER 201409291
+/*
+ * The control file was changed to have the default char signedness,
+ * commit XXXXX.
+ */
+#define DEFAULT_CHAR_SIGNEDNESS_CAT_VER 202501161
+
/*
* Each relation is represented by a relinfo structure.
@@ -245,6 +251,7 @@ typedef struct
bool date_is_int;
bool float8_pass_by_value;
uint32 data_checksum_version;
+ bool default_char_signedness;
} ControlData;
/*
diff --git a/src/bin/pg_upgrade/t/005_char_signedness.pl b/src/bin/pg_upgrade/t/005_char_signedness.pl
new file mode 100644
index 00000000000..05c3014a27d
--- /dev/null
+++ b/src/bin/pg_upgrade/t/005_char_signedness.pl
@@ -0,0 +1,65 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Tests for handling the default char signedness during upgrade.
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Can be changed to test the other modes
+my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
+
+# Initialize old and new clusters
+my $old = PostgreSQL::Test::Cluster->new('old');
+my $new = PostgreSQL::Test::Cluster->new('new');
+$old->init();
+$new->init();
+
+# Check the default char signedness of both the old and the new clusters.
+# Newly created clusters unconditionally use 'signed'.
+command_like(
+ [ 'pg_controldata', $old->data_dir ],
+ qr/Default char data signedness:\s+signed/,
+ 'default char signedness of old cluster is signed in control file');
+command_like(
+ [ 'pg_controldata', $new->data_dir ],
+ qr/Default char data signedness:\s+signed/,
+ 'default char signedness of new cluster is signed in control file');
+
+# Set the old cluster's default char signedness to unsigned for test.
+command_ok(
+ [ 'pg_resetwal', '--char-signedness', 'unsigned', '-f', $old->data_dir ],
+ "set old cluster's default char signedness to unsigned");
+
+# Check if the value is successfully updated.
+command_like(
+ [ 'pg_controldata', $old->data_dir ],
+ qr/Default char data signedness:\s+unsigned/,
+ 'updated default char signedness is unsigned in control file');
+
+# pg_upgrade should be successful.
+command_ok(
+ [
+ 'pg_upgrade', '--no-sync',
+ '-d', $old->data_dir,
+ '-D', $new->data_dir,
+ '-b', $old->config_data('--bindir'),
+ '-B', $new->config_data('--bindir'),
+ '-s', $new->host,
+ '-p', $old->port,
+ '-P', $new->port,
+ $mode
+ ],
+ 'run of pg_upgrade');
+
+# Check if the default char signedness of the new cluster inherited
+# the old cluster's value.
+command_like(
+ [ 'pg_controldata', $new->data_dir ],
+ qr/Default char data signedness:\s+unsigned/,
+ 'the default char signedness is updated during pg_upgrade');
+
+done_testing();
--
2.43.5
v5-0001-Add-default_char_signedness-field-to-ControlFileD.patchapplication/x-patch; name=v5-0001-Add-default_char_signedness-field-to-ControlFileD.patchDownload
From a064ae71ee5d4990acff8eef60eb90de80141f9c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Sep 2024 11:08:57 -0700
Subject: [PATCH v5 1/5] Add default_char_signedness field to ControlFileData.
The signedness of the 'char' type in C is
implementation-dependent. For instance, 'signed char' is used by
default on x86 CPUs, while 'unsigned char' is used on aarch
CPUs. Previously, we accidentally let C implementation signedness
affect persistent data. This led to inconsistent results when
comparing char data across different platforms.
This commit introduces a new 'default_char_signedness' field in
ControlFileData to store the signedness of the 'char' type. While this
change does not encourage the use of 'char' without explicitly
specifying its signedness, this field can be used as a hint to ensure
consistent behavior for pre-v18 data files that store data sorted by
the 'char' type on disk (e.g., GIN and GiST indexes), especially in
cross-platform replication scenarios.
Newly created database clusters unconditionally set the default char
signedness to true. pg_upgrade (with an upcoming commit) changes this
flag for clusters if the source database cluster has
signedness=false. As a result, signedness=false setting will become
rare over time. If we had known about the problem during the last
development cycle that forced initdb (v8.3), we would have made all
clusters signed or all clusters unsigned. Making pg_upgrade the only
source of signedness=false will cause the population of database
clusters to converge toward that retrospective ideal.
XXX bump catversion.
XXX bump PG_CONTROL_VERSION too?
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
doc/src/sgml/func.sgml | 5 ++++
src/backend/access/transam/xlog.c | 40 +++++++++++++++++++++++++
src/backend/utils/misc/pg_controldata.c | 7 +++--
src/bin/pg_controldata/pg_controldata.c | 2 ++
src/include/access/xlog.h | 1 +
src/include/catalog/pg_control.h | 6 ++++
src/include/catalog/pg_proc.dat | 6 ++--
7 files changed, 62 insertions(+), 5 deletions(-)
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7efc81936ab..21c666af2d2 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27986,6 +27986,11 @@ acl | {postgres=arwdDxtm/postgres,foo=r/postgres}
<entry><type>integer</type></entry>
</row>
+ <row>
+ <entry><structfield>default_char_signedness</structfield></entry>
+ <entry><type>boolean</type></entry>
+ </row>
+
</tbody>
</tgroup>
</table>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 25a5c605404..90832c37a53 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4284,6 +4284,33 @@ WriteControlFile(void)
ControlFile->float8ByVal = FLOAT8PASSBYVAL;
+ /*
+ * Initialize the default 'char' signedness.
+ *
+ * The signedness of the char type is implementation-defined. For instance
+ * on x86 architecture CPUs, the char data type is typically treated as
+ * signed by default, whereas on aarch architecture CPUs, it is typically
+ * treated as unsigned by default. In v17 or earlier, we accidentally let
+ * C implementation signedness affect persistent data. This led to
+ * inconsistent results when comparing char data across different
+ * platforms.
+ *
+ * This flag can be used as a hint to ensure consistent behavior for
+ * pre-v18 data files that store data sorted by the 'char' type on disk,
+ * especially in cross-platform replication scenarios.
+ *
+ * Newly created database clusters unconditionally set the default char
+ * signedness to true. pg_upgrade changes this flag for clusters that were
+ * initialized on signedness=false platforms. As a result,
+ * signedness=false setting will become rare over time. If we had known
+ * about this problem during the last development cycle that forced initdb
+ * (v8.3), we would have made all clusters signed or all clusters
+ * unsigned. Making pg_upgrade the only source of signedness=false will
+ * cause the population of database clusters to converge toward that
+ * retrospective ideal.
+ */
+ ControlFile->default_char_signedness = true;
+
/* Contents are protected with a CRC */
INIT_CRC32C(ControlFile->crc);
COMP_CRC32C(ControlFile->crc,
@@ -4612,6 +4639,19 @@ DataChecksumsEnabled(void)
return (ControlFile->data_checksum_version > 0);
}
+/*
+ * Return true if the cluster was initialized on a platform where the
+ * default signedness of char is "signed". This function exists for code
+ * that deals with pre-v18 data files that store data sorted by the 'char'
+ * type on disk (e.g., GIN and GiST indexes). See the comments in
+ * WriteControlFile() for details.
+ */
+bool
+GetDefaultCharSignedness(void)
+{
+ return ControlFile->default_char_signedness;
+}
+
/*
* Returns a fake LSN for unlogged relations.
*
diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index 9dfba499c13..6d036e3bf32 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -203,8 +203,8 @@ pg_control_recovery(PG_FUNCTION_ARGS)
Datum
pg_control_init(PG_FUNCTION_ARGS)
{
- Datum values[11];
- bool nulls[11];
+ Datum values[12];
+ bool nulls[12];
TupleDesc tupdesc;
HeapTuple htup;
ControlFileData *ControlFile;
@@ -254,6 +254,9 @@ pg_control_init(PG_FUNCTION_ARGS)
values[10] = Int32GetDatum(ControlFile->data_checksum_version);
nulls[10] = false;
+ values[11] = BoolGetDatum(ControlFile->default_char_signedness);
+ nulls[11] = false;
+
htup = heap_form_tuple(tupdesc, values, nulls);
PG_RETURN_DATUM(HeapTupleGetDatum(htup));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index cf11ab3f2ee..bea779eef94 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -336,6 +336,8 @@ main(int argc, char *argv[])
(ControlFile->float8ByVal ? _("by value") : _("by reference")));
printf(_("Data page checksum version: %u\n"),
ControlFile->data_checksum_version);
+ printf(_("Default char data signedness: %s\n"),
+ (ControlFile->default_char_signedness ? _("signed") : _("unsigned")));
printf(_("Mock authentication nonce: %s\n"),
mock_auth_nonce_str);
return 0;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4411c1468ac..d313099c027 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -231,6 +231,7 @@ extern XLogRecPtr GetXLogWriteRecPtr(void);
extern uint64 GetSystemIdentifier(void);
extern char *GetMockAuthenticationNonce(void);
extern bool DataChecksumsEnabled(void);
+extern bool GetDefaultCharSignedness(void);
extern XLogRecPtr GetFakeLSNForUnloggedRel(void);
extern Size XLOGShmemSize(void);
extern void XLOGShmemInit(void);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3797f25b306..0cb86faa43b 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -221,6 +221,12 @@ typedef struct ControlFileData
/* Are data pages protected by checksums? Zero if no checksum version */
uint32 data_checksum_version;
+ /*
+ * True if the default signedness of char is "signed" on a platform where
+ * the cluster is initialized.
+ */
+ bool default_char_signedness;
+
/*
* Random nonce, used in authentication requests that need to proceed
* based on values that are cluster-unique, like a SASL exchange that
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9e803d610d7..e2d5c0d0886 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12206,9 +12206,9 @@
descr => 'pg_controldata init state information as a function',
proname => 'pg_control_init', provolatile => 'v', prorettype => 'record',
proargtypes => '',
- proallargtypes => '{int4,int4,int4,int4,int4,int4,int4,int4,int4,bool,int4}',
- proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
- proargnames => '{max_data_alignment,database_block_size,blocks_per_segment,wal_block_size,bytes_per_wal_segment,max_identifier_length,max_index_columns,max_toast_chunk_size,large_object_chunk_size,float8_pass_by_value,data_page_checksum_version}',
+ proallargtypes => '{int4,int4,int4,int4,int4,int4,int4,int4,int4,bool,int4,bool}',
+ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{max_data_alignment,database_block_size,blocks_per_segment,wal_block_size,bytes_per_wal_segment,max_identifier_length,max_index_columns,max_toast_chunk_size,large_object_chunk_size,float8_pass_by_value,data_page_checksum_version,default_char_signedness}',
prosrc => 'pg_control_init' },
# subscripting support for built-in types
--
2.43.5
v5-0004-pg_upgrade-Add-set-char-signedness-to-set-the-def.patchapplication/x-patch; name=v5-0004-pg_upgrade-Add-set-char-signedness-to-set-the-def.patchDownload
From 25344fad6f925e328dc80b139e2ca80ae6117dca Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 2 Oct 2024 15:12:27 -0700
Subject: [PATCH v5 4/5] pg_upgrade: Add --set-char-signedness to set the
default char signedness of new cluster.
This change adds a new option --set-char-signedness to pg_upgrade. It
enables user to set arbitrary signedness during pg_upgrade. This helps
cases where user who knew they copied the v17 source cluster from
x86 (signedness=true) to ARM (signedness=false) can pg_upgrade
properly without the prerequisite of acquiring an x86 VM.
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
---
doc/src/sgml/ref/pgupgrade.sgml | 53 +++++++++++++++++++++
src/bin/pg_upgrade/check.c | 12 +++++
src/bin/pg_upgrade/option.c | 12 +++++
src/bin/pg_upgrade/pg_upgrade.c | 10 +++-
src/bin/pg_upgrade/pg_upgrade.h | 3 ++
src/bin/pg_upgrade/t/005_char_signedness.pl | 17 +++++++
6 files changed, 105 insertions(+), 2 deletions(-)
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 4777381dac2..27d0b668deb 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -276,6 +276,59 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>--set-char-signedness=</option><replaceable>option</replaceable></term>
+ <listitem>
+ <para>
+ Manually set the default char signedness of new clusters. Possible values
+ are <literal>signed</literal> and <literal>unsigned</literal>.
+ </para>
+ <para>
+ In the C language, the default signedness of the <type>char</type> type
+ (when not explicitly specified) varies across platforms. For example,
+ <type>char</type> defaults to <type>signed char</type> on x86 CPUs but
+ to <type>unsigned char</type> on ARM CPUs.
+ </para>
+ <para>
+ Starting from <productname>PostgreSQL</productname> 18, database clusters
+ maintain their own default char signedness setting, which can be used to
+ ensure consistent behavior across platforms with different default char
+ signedness. By default, <application>pg_upgrade</application> preserves
+ the char signedness setting when upgrading from an existing cluster.
+ However, when upgrading from <productname>PostgreSQL</productname> 17 or
+ earlier, <application>pg_upgrade</application> adopts the char signedness
+ of the platform on which it was built.
+ </para>
+ <para>
+ This option allows you to explicitly set the default char signedness for
+ the new cluster, overriding any inherited values. There are two specific
+ scenarios where this option is relevant:
+ <itemizedlist>
+ <listitem>
+ <para>
+ If you are planning to migrate to a different platform after the upgrade,
+ you should not use this option. The default behavior is right in this case.
+ Instead, perform the upgrade on the original platform without this flag,
+ and then migrate the cluster afterward. This is the recommended and safest
+ approach.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ If you have already migrated the cluster to a platform with different
+ char signedness (for example, from an x86-based system to an ARM-based
+ system), you should use this option to specify the signedness matching
+ the original platform's default char signedness. Additionally, it's
+ essential not to modify any data files between migrating data files and
+ running <command>pg_upgrade</command>. <command>pg_upgrade</command>
+ should be the first operation that starts the cluster on the new platform.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-?</option></term>
<term><option>--help</option></term>
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 7ca1d8fffc9..d6f629dd3a2 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -838,6 +838,18 @@ check_cluster_versions(void)
GET_MAJOR_VERSION(new_cluster.bin_version))
pg_fatal("New cluster data and binary directories are from different major versions.");
+ /*
+ * Since from version 18, newly created database clusters always have
+ * 'signed' default char-signedness, it makes less sense to use
+ * --set-char-signedness option for upgrading from version 18 or later.
+ * Users who want to change the default char signedness of the new
+ * cluster, they can use pg_resetwal manually before the upgrade.
+ */
+ if (GET_MAJOR_VERSION(old_cluster.major_version) >= 1800 &&
+ user_opts.char_signedness != -1)
+ pg_fatal("%s option cannot be used to upgrade from PostgreSQL %s and later.",
+ "--set-char-signedness", "18");
+
check_ok();
}
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 108eb7a1ba4..1a580d656bb 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -60,6 +60,7 @@ parseCommandLine(int argc, char *argv[])
{"copy", no_argument, NULL, 2},
{"copy-file-range", no_argument, NULL, 3},
{"sync-method", required_argument, NULL, 4},
+ {"set-char-signedness", required_argument, NULL, 5},
{NULL, 0, NULL, 0}
};
@@ -70,6 +71,7 @@ parseCommandLine(int argc, char *argv[])
user_opts.do_sync = true;
user_opts.transfer_mode = TRANSFER_MODE_COPY;
+ user_opts.char_signedness = -1;
os_info.progname = get_progname(argv[0]);
@@ -212,6 +214,14 @@ parseCommandLine(int argc, char *argv[])
user_opts.sync_method = pg_strdup(optarg);
break;
+ case 5:
+ if (pg_strcasecmp(optarg, "signed") == 0)
+ user_opts.char_signedness = 1;
+ else if (pg_strcasecmp(optarg, "unsigned") == 0)
+ user_opts.char_signedness = 0;
+ else
+ pg_fatal("invalid argument for option %s", "--set-char-signedness");
+ break;
default:
fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
os_info.progname);
@@ -306,6 +316,8 @@ usage(void)
printf(_(" --clone clone instead of copying files to new cluster\n"));
printf(_(" --copy copy files to new cluster (default)\n"));
printf(_(" --copy-file-range copy files to new cluster with copy_file_range\n"));
+ printf(_(" --set-char-signedness=OPTION set new cluster char signedness to \"signed\" or\n"));
+ printf(_(" \"unsigned\"\n"));
printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
printf(_(" -?, --help show this help, then exit\n"));
printf(_("\n"
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index cc7357b5599..e95be8b459d 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -399,8 +399,14 @@ set_new_cluster_char_signedness(void)
{
bool new_char_signedness;
- /* Inherit the source database's signedness */
- new_char_signedness = old_cluster.controldata.default_char_signedness;
+ /*
+ * Use the specified char signedness if specified. Otherwise we inherit
+ * inherit the source database's signedness.
+ */
+ if (user_opts.char_signedness != -1)
+ new_char_signedness = (user_opts.char_signedness == 1);
+ else
+ new_char_signedness = old_cluster.controldata.default_char_signedness;
/* Change the char signedness of the new cluster, if necessary */
if (new_cluster.controldata.default_char_signedness != new_char_signedness)
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 26991e71009..7d50c83d0bf 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -334,6 +334,9 @@ typedef struct
int jobs; /* number of processes/threads to use */
char *socketdir; /* directory to use for Unix sockets */
char *sync_method;
+ int char_signedness; /* default char signedness: -1 for initial
+ * value, 1 for "signed" and 0 for
+ * "unsigned" */
} UserOpts;
typedef struct
diff --git a/src/bin/pg_upgrade/t/005_char_signedness.pl b/src/bin/pg_upgrade/t/005_char_signedness.pl
index 05c3014a27d..c024106863e 100644
--- a/src/bin/pg_upgrade/t/005_char_signedness.pl
+++ b/src/bin/pg_upgrade/t/005_char_signedness.pl
@@ -40,6 +40,23 @@ command_like(
qr/Default char data signedness:\s+unsigned/,
'updated default char signedness is unsigned in control file');
+# Cannot use --set-char-signedness option for upgrading from v18+
+command_fails(
+ [
+ 'pg_upgrade', '--no-sync',
+ '-d', $old->data_dir,
+ '-D', $new->data_dir,
+ '-b', $old->config_data('--bindir'),
+ '-B', $new->config_data('--bindir'),
+ '-s', $new->host,
+ '-p', $old->port,
+ '-P', $new->port,
+ '-set-char-signedness', 'signed',
+ $mode
+ ],
+ '--set-char-signedness option cannot be used for upgrading from v18 or later'
+);
+
# pg_upgrade should be successful.
command_ok(
[
--
2.43.5
On Wed, Feb 19, 2025 at 10:48:29AM -0800, Masahiko Sawada wrote:
Thank you for reviewing the patches. I've fixed these issues and
attached the updated patches.
Looks good.
I have one question about the 0001 patch; since we add
'default_char_signedness' field to ControlFileData do we need to bump
PG_CONTROL_VERSION? We have comments about bumping PG_CONTROL_VERSION
when changing CheckPoint struct or DBState enum so it seems likely but
I'd like to confirm just in case that we need to bump
PG_CONTROL_VERSION also when changing ControlFileData.
Yes. (I'm not aware of value we get from having distinct control file version
and catalog version, but we do have both.)
If we need, can
we bump it to 1800? or 1701?
I'd do 1800. The pattern seems to be to bump to 1800 for the first pg_control
change of the v18 cycle, then 1801, then 1802 for the third change of the
cycle. That's based on this history:
git log -U0 -p src/include/catalog/pg_control.h | grep -E '^(Date|\+#define PG_CONTROL_VERSION)'
On Thu, Feb 20, 2025 at 2:07 PM Noah Misch <noah@leadboat.com> wrote:
On Wed, Feb 19, 2025 at 10:48:29AM -0800, Masahiko Sawada wrote:
Thank you for reviewing the patches. I've fixed these issues and
attached the updated patches.Looks good.
I have one question about the 0001 patch; since we add
'default_char_signedness' field to ControlFileData do we need to bump
PG_CONTROL_VERSION? We have comments about bumping PG_CONTROL_VERSION
when changing CheckPoint struct or DBState enum so it seems likely but
I'd like to confirm just in case that we need to bump
PG_CONTROL_VERSION also when changing ControlFileData.Yes. (I'm not aware of value we get from having distinct control file version
and catalog version, but we do have both.)If we need, can
we bump it to 1800? or 1701?I'd do 1800. The pattern seems to be to bump to 1800 for the first pg_control
change of the v18 cycle, then 1801, then 1802 for the third change of the
cycle. That's based on this history:git log -U0 -p src/include/catalog/pg_control.h | grep -E '^(Date|\+#define PG_CONTROL_VERSION)'
Thank you for the confirmation. That makes sense to me.
I'll push these patches with version bumps, barring any objections or
further comments.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Thu, Feb 20, 2025 at 3:31 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Feb 20, 2025 at 2:07 PM Noah Misch <noah@leadboat.com> wrote:
On Wed, Feb 19, 2025 at 10:48:29AM -0800, Masahiko Sawada wrote:
Thank you for reviewing the patches. I've fixed these issues and
attached the updated patches.Looks good.
I have one question about the 0001 patch; since we add
'default_char_signedness' field to ControlFileData do we need to bump
PG_CONTROL_VERSION? We have comments about bumping PG_CONTROL_VERSION
when changing CheckPoint struct or DBState enum so it seems likely but
I'd like to confirm just in case that we need to bump
PG_CONTROL_VERSION also when changing ControlFileData.Yes. (I'm not aware of value we get from having distinct control file version
and catalog version, but we do have both.)If we need, can
we bump it to 1800? or 1701?I'd do 1800. The pattern seems to be to bump to 1800 for the first pg_control
change of the v18 cycle, then 1801, then 1802 for the third change of the
cycle. That's based on this history:git log -U0 -p src/include/catalog/pg_control.h | grep -E '^(Date|\+#define PG_CONTROL_VERSION)'
Thank you for the confirmation. That makes sense to me.
I'll push these patches with version bumps, barring any objections or
further comments.
Pushed.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Hi,
While working on another round of the long option and fat comma style
cleanup, I noticed that the test for pg_upgrade --set-char-signedess
doesn't test what it's supposed to:
Masahiko Sawada <sawada.mshk@gmail.com> writes:
diff --git a/src/bin/pg_upgrade/t/005_char_signedness.pl b/src/bin/pg_upgrade/t/005_char_signedness.pl index 05c3014a27d..c024106863e 100644 --- a/src/bin/pg_upgrade/t/005_char_signedness.pl +++ b/src/bin/pg_upgrade/t/005_char_signedness.pl @@ -40,6 +40,23 @@ command_like( qr/Default char data signedness:\s+unsigned/, 'updated default char signedness is unsigned in control file');+# Cannot use --set-char-signedness option for upgrading from v18+ +command_fails( + [ + 'pg_upgrade', '--no-sync', + '-d', $old->data_dir, + '-D', $new->data_dir, + '-b', $old->config_data('--bindir'), + '-B', $new->config_data('--bindir'), + '-s', $new->host, + '-p', $old->port, + '-P', $new->port, + '-set-char-signedness', 'signed',
This is missing a dash, which causes the command to fail, but for the
wrong reason. pg_uprade seems to print all its errors on stdout, which
I guess is why the test use plain command_fails() instead of
command_fails_like(). However, we have another function to deal with
this: command_checks_all(). Attached are patches that fix the above
test, and also convert the other command_fails() calls in the pg_upgrade
tests to test for specific messages.
- ilmari
Attachments:
0001-pg_upgrade-fix-set-char-signedness-failure-test.patchtext/x-diffDownload
From b80c653c6635096345ec453bfe9445a82b7ab049 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= <ilmari@ilmari.org>
Date: Tue, 25 Feb 2025 22:27:36 +0000
Subject: [PATCH 1/2] pg_upgrade: fix --set-char-signedness failure test
The test had the option misspelled with only one dash, which fails, but
not with the expected message. Since pg_upgrade gives its error messages
on stdout(?!), use command_checks_all() instead of command_fails_like().
---
src/bin/pg_upgrade/t/005_char_signedness.pl | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/src/bin/pg_upgrade/t/005_char_signedness.pl b/src/bin/pg_upgrade/t/005_char_signedness.pl
index c024106863e..d186822ac77 100644
--- a/src/bin/pg_upgrade/t/005_char_signedness.pl
+++ b/src/bin/pg_upgrade/t/005_char_signedness.pl
@@ -41,7 +41,7 @@ command_like(
'updated default char signedness is unsigned in control file');
# Cannot use --set-char-signedness option for upgrading from v18+
-command_fails(
+command_checks_all(
[
'pg_upgrade', '--no-sync',
'-d', $old->data_dir,
@@ -51,9 +51,12 @@ command_fails(
'-s', $new->host,
'-p', $old->port,
'-P', $new->port,
- '-set-char-signedness', 'signed',
+ '--set-char-signedness', 'signed',
$mode
],
+ 1,
+ [qr/--set-char-signedness option cannot be used/],
+ [],
'--set-char-signedness option cannot be used for upgrading from v18 or later'
);
--
2.43.0
0002-pg_upgrade-use-command_checks_all-to-check-for-error.patchtext/x-diffDownload
From 03d25df987029da7511ff5a6f4ce69d1aef16ea9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= <ilmari@ilmari.org>
Date: Tue, 25 Feb 2025 22:56:12 +0000
Subject: [PATCH 2/2] pg_upgrade: use command_checks_all() to check for errors
pg_upgrade prints its error messages on stdout, so we can't use
command_fails_like() to check that it fails for the right reason.
Instead, we have to use command_checks_all(), which this patch does.
---
src/bin/pg_upgrade/t/002_pg_upgrade.pl | 5 ++++-
src/bin/pg_upgrade/t/004_subscription.pl | 8 +++++++-
2 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 45ea94c84bb..cccba9dc3db 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -396,7 +396,7 @@ $oldnode->stop;
# Cause a failure at the start of pg_upgrade, this should create the logging
# directory pg_upgrade_output.d but leave it around. Keep --check for an
# early exit.
-command_fails(
+command_checks_all(
[
'pg_upgrade', '--no-sync',
'-d', $oldnode->data_dir,
@@ -408,6 +408,9 @@ command_fails(
'-P', $newnode->port,
$mode, '--check',
],
+ 1,
+ [qr{check for ".*?does.not.exist" failed:}],
+ [],
'run of pg_upgrade --check for new instance with incorrect binary path');
ok(-d $newnode->data_dir . "/pg_upgrade_output.d",
"pg_upgrade_output.d/ not removed after pg_upgrade failure");
diff --git a/src/bin/pg_upgrade/t/004_subscription.pl b/src/bin/pg_upgrade/t/004_subscription.pl
index 13773316e1d..3f9734f7b3f 100644
--- a/src/bin/pg_upgrade/t/004_subscription.pl
+++ b/src/bin/pg_upgrade/t/004_subscription.pl
@@ -130,7 +130,7 @@ $old_sub->safe_psql('postgres',
$old_sub->stop;
-command_fails(
+command_checks_all(
[
'pg_upgrade',
'--no-sync',
@@ -144,6 +144,12 @@ command_fails(
$mode,
'--check',
],
+ 1,
+ [
+ qr{\QYour installation contains subscriptions without origin},
+ qr{\Qrelations not in i (initialize) or r (ready) state},
+ ],
+ [],
'run of pg_upgrade --check for old instance with relation in \'d\' datasync(invalid) state and missing replication origin'
);
--
2.43.0
On Tue, Feb 25, 2025 at 3:03 PM Dagfinn Ilmari Mannsåker
<ilmari@ilmari.org> wrote:
Hi,
While working on another round of the long option and fat comma style
cleanup, I noticed that the test for pg_upgrade --set-char-signedess
doesn't test what it's supposed to:Masahiko Sawada <sawada.mshk@gmail.com> writes:
diff --git a/src/bin/pg_upgrade/t/005_char_signedness.pl b/src/bin/pg_upgrade/t/005_char_signedness.pl index 05c3014a27d..c024106863e 100644 --- a/src/bin/pg_upgrade/t/005_char_signedness.pl +++ b/src/bin/pg_upgrade/t/005_char_signedness.pl @@ -40,6 +40,23 @@ command_like( qr/Default char data signedness:\s+unsigned/, 'updated default char signedness is unsigned in control file');+# Cannot use --set-char-signedness option for upgrading from v18+ +command_fails( + [ + 'pg_upgrade', '--no-sync', + '-d', $old->data_dir, + '-D', $new->data_dir, + '-b', $old->config_data('--bindir'), + '-B', $new->config_data('--bindir'), + '-s', $new->host, + '-p', $old->port, + '-P', $new->port, + '-set-char-signedness', 'signed',This is missing a dash, which causes the command to fail, but for the
wrong reason. pg_uprade seems to print all its errors on stdout, which
I guess is why the test use plain command_fails() instead of
command_fails_like(). However, we have another function to deal with
this: command_checks_all(). Attached are patches that fix the above
test, and also convert the other command_fails() calls in the pg_upgrade
tests to test for specific messages.
Thank you for the report. I agree with both points.
I believe that replacing command_fails_like() with
command_checks_all() is an improvement whereas adding the missing dash
to --set-char-signendess option is a bug fix. How about reorganizing
the patches as follows?
- 0001 patch just adds the dash to the --set-char-signedness.
- 0002 patch uses command_checks_all() in 002_pg_upgrade.pl,
004_subscription.pl, and 005_char_signedness.pl for checking the
output better.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Masahiko Sawada <sawada.mshk@gmail.com> writes:
On Tue, Feb 25, 2025 at 3:03 PM Dagfinn Ilmari Mannsåker
<ilmari@ilmari.org> wrote:Hi,
While working on another round of the long option and fat comma style
cleanup, I noticed that the test for pg_upgrade --set-char-signedess
doesn't test what it's supposed to:Masahiko Sawada <sawada.mshk@gmail.com> writes:
diff --git a/src/bin/pg_upgrade/t/005_char_signedness.pl b/src/bin/pg_upgrade/t/005_char_signedness.pl index 05c3014a27d..c024106863e 100644 --- a/src/bin/pg_upgrade/t/005_char_signedness.pl +++ b/src/bin/pg_upgrade/t/005_char_signedness.pl @@ -40,6 +40,23 @@ command_like( qr/Default char data signedness:\s+unsigned/, 'updated default char signedness is unsigned in control file');+# Cannot use --set-char-signedness option for upgrading from v18+ +command_fails( + [ + 'pg_upgrade', '--no-sync', + '-d', $old->data_dir, + '-D', $new->data_dir, + '-b', $old->config_data('--bindir'), + '-B', $new->config_data('--bindir'), + '-s', $new->host, + '-p', $old->port, + '-P', $new->port, + '-set-char-signedness', 'signed',This is missing a dash, which causes the command to fail, but for the
wrong reason. pg_uprade seems to print all its errors on stdout, which
I guess is why the test use plain command_fails() instead of
command_fails_like(). However, we have another function to deal with
this: command_checks_all(). Attached are patches that fix the above
test, and also convert the other command_fails() calls in the pg_upgrade
tests to test for specific messages.Thank you for the report. I agree with both points.
I believe that replacing command_fails_like() with
command_checks_all() is an improvement whereas adding the missing dash
to --set-char-signendess option is a bug fix. How about reorganizing
the patches as follows?- 0001 patch just adds the dash to the --set-char-signedness.
- 0002 patch uses command_checks_all() in 002_pg_upgrade.pl,
004_subscription.pl, and 005_char_signedness.pl for checking the
output better.
Yeah, that's a better way to organise it, updated patches attached.
- ilmari
Attachments:
0001-pg_upgrade-fix-spelling-of-set-char-signedness-in-te.patchtext/x-diffDownload
From 0671eb2a6f59926bf94b041a1e20b864ec370a25 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= <ilmari@ilmari.org>
Date: Wed, 26 Feb 2025 10:51:28 +0000
Subject: [PATCH 1/2] pg_upgrade: fix spelling of --set-char-signedness in test
The single dash causes it to fail with a different error, but because
the test isn't checking the error message it was passing anyway.
The next commit will make the test check for the actual error.
---
src/bin/pg_upgrade/t/005_char_signedness.pl | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/bin/pg_upgrade/t/005_char_signedness.pl b/src/bin/pg_upgrade/t/005_char_signedness.pl
index c024106863e..b3092f03bb7 100644
--- a/src/bin/pg_upgrade/t/005_char_signedness.pl
+++ b/src/bin/pg_upgrade/t/005_char_signedness.pl
@@ -51,7 +51,7 @@
'-s', $new->host,
'-p', $old->port,
'-P', $new->port,
- '-set-char-signedness', 'signed',
+ '--set-char-signedness', 'signed',
$mode
],
'--set-char-signedness option cannot be used for upgrading from v18 or later'
--
2.48.1
0002-pg_upgrade-check-for-the-expected-error-message-in-t.patchtext/x-diffDownload
From 371f4fafc93e8520b4cb49a657ca56131da3b3e4 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= <ilmari@ilmari.org>
Date: Wed, 26 Feb 2025 10:53:40 +0000
Subject: [PATCH 2/2] pg_upgrade: check for the expected error message in tests
pg_upgrade prints its error messages on stdout, so we can't use
command_fails_like() to check that it fails for the right reason.
Instead use command_checks_all() to check the exit status and stdout.
---
src/bin/pg_upgrade/t/002_pg_upgrade.pl | 5 ++++-
src/bin/pg_upgrade/t/004_subscription.pl | 8 +++++++-
src/bin/pg_upgrade/t/005_char_signedness.pl | 5 ++++-
3 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 45ea94c84bb..cccba9dc3db 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -396,7 +396,7 @@ sub filter_dump
# Cause a failure at the start of pg_upgrade, this should create the logging
# directory pg_upgrade_output.d but leave it around. Keep --check for an
# early exit.
-command_fails(
+command_checks_all(
[
'pg_upgrade', '--no-sync',
'-d', $oldnode->data_dir,
@@ -408,6 +408,9 @@ sub filter_dump
'-P', $newnode->port,
$mode, '--check',
],
+ 1,
+ [qr{check for ".*?does.not.exist" failed:}],
+ [],
'run of pg_upgrade --check for new instance with incorrect binary path');
ok(-d $newnode->data_dir . "/pg_upgrade_output.d",
"pg_upgrade_output.d/ not removed after pg_upgrade failure");
diff --git a/src/bin/pg_upgrade/t/004_subscription.pl b/src/bin/pg_upgrade/t/004_subscription.pl
index 13773316e1d..3f9734f7b3f 100644
--- a/src/bin/pg_upgrade/t/004_subscription.pl
+++ b/src/bin/pg_upgrade/t/004_subscription.pl
@@ -130,7 +130,7 @@
$old_sub->stop;
-command_fails(
+command_checks_all(
[
'pg_upgrade',
'--no-sync',
@@ -144,6 +144,12 @@
$mode,
'--check',
],
+ 1,
+ [
+ qr{\QYour installation contains subscriptions without origin},
+ qr{\Qrelations not in i (initialize) or r (ready) state},
+ ],
+ [],
'run of pg_upgrade --check for old instance with relation in \'d\' datasync(invalid) state and missing replication origin'
);
diff --git a/src/bin/pg_upgrade/t/005_char_signedness.pl b/src/bin/pg_upgrade/t/005_char_signedness.pl
index b3092f03bb7..d186822ac77 100644
--- a/src/bin/pg_upgrade/t/005_char_signedness.pl
+++ b/src/bin/pg_upgrade/t/005_char_signedness.pl
@@ -41,7 +41,7 @@
'updated default char signedness is unsigned in control file');
# Cannot use --set-char-signedness option for upgrading from v18+
-command_fails(
+command_checks_all(
[
'pg_upgrade', '--no-sync',
'-d', $old->data_dir,
@@ -54,6 +54,9 @@
'--set-char-signedness', 'signed',
$mode
],
+ 1,
+ [qr/--set-char-signedness option cannot be used/],
+ [],
'--set-char-signedness option cannot be used for upgrading from v18 or later'
);
--
2.48.1
On Wed, Feb 26, 2025 at 3:12 AM Dagfinn Ilmari Mannsåker
<ilmari@ilmari.org> wrote:
Masahiko Sawada <sawada.mshk@gmail.com> writes:
On Tue, Feb 25, 2025 at 3:03 PM Dagfinn Ilmari Mannsåker
<ilmari@ilmari.org> wrote:Hi,
While working on another round of the long option and fat comma style
cleanup, I noticed that the test for pg_upgrade --set-char-signedess
doesn't test what it's supposed to:Masahiko Sawada <sawada.mshk@gmail.com> writes:
diff --git a/src/bin/pg_upgrade/t/005_char_signedness.pl b/src/bin/pg_upgrade/t/005_char_signedness.pl index 05c3014a27d..c024106863e 100644 --- a/src/bin/pg_upgrade/t/005_char_signedness.pl +++ b/src/bin/pg_upgrade/t/005_char_signedness.pl @@ -40,6 +40,23 @@ command_like( qr/Default char data signedness:\s+unsigned/, 'updated default char signedness is unsigned in control file');+# Cannot use --set-char-signedness option for upgrading from v18+ +command_fails( + [ + 'pg_upgrade', '--no-sync', + '-d', $old->data_dir, + '-D', $new->data_dir, + '-b', $old->config_data('--bindir'), + '-B', $new->config_data('--bindir'), + '-s', $new->host, + '-p', $old->port, + '-P', $new->port, + '-set-char-signedness', 'signed',This is missing a dash, which causes the command to fail, but for the
wrong reason. pg_uprade seems to print all its errors on stdout, which
I guess is why the test use plain command_fails() instead of
command_fails_like(). However, we have another function to deal with
this: command_checks_all(). Attached are patches that fix the above
test, and also convert the other command_fails() calls in the pg_upgrade
tests to test for specific messages.Thank you for the report. I agree with both points.
I believe that replacing command_fails_like() with
command_checks_all() is an improvement whereas adding the missing dash
to --set-char-signendess option is a bug fix. How about reorganizing
the patches as follows?- 0001 patch just adds the dash to the --set-char-signedness.
- 0002 patch uses command_checks_all() in 002_pg_upgrade.pl,
004_subscription.pl, and 005_char_signedness.pl for checking the
output better.Yeah, that's a better way to organise it, updated patches attached.
Thank you for the patches.
The 0001 patch already got committed (945a9e38). I've simplified the
tests and updated the commit message for the 0002 patch. I'm going to
push it, barring further comments.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v2-0001-pg_upgrade-Check-for-the-expected-error-message-i.patchapplication/octet-stream; name=v2-0001-pg_upgrade-Check-for-the-expected-error-message-i.patchDownload
From 39b9d4c7349da5f6d68321aa11c890e10e87f10b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= <ilmari@ilmari.org>
Date: Wed, 26 Feb 2025 10:53:40 +0000
Subject: [PATCH v2] pg_upgrade: Check for the expected error message in TAP
tests.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Since pg_upgrade prints its error messages on stdout, we can't use
command_fails_like() to check if it fails for the right reason. This
commit uses command_checks_all() in pg_upgrade TAP tests to check the
exit status and stdout, enabling proper verification of error
reasons.
Author: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Discussion: https://postgr.es/m/87tt8h1vb7.fsf@wibble.ilmari.org
---
src/bin/pg_upgrade/t/002_pg_upgrade.pl | 5 ++++-
src/bin/pg_upgrade/t/004_subscription.pl | 7 ++++++-
src/bin/pg_upgrade/t/005_char_signedness.pl | 5 ++++-
3 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 45ea94c84bb..c00cf68d660 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -396,7 +396,7 @@ $oldnode->stop;
# Cause a failure at the start of pg_upgrade, this should create the logging
# directory pg_upgrade_output.d but leave it around. Keep --check for an
# early exit.
-command_fails(
+command_checks_all(
[
'pg_upgrade', '--no-sync',
'-d', $oldnode->data_dir,
@@ -408,6 +408,9 @@ command_fails(
'-P', $newnode->port,
$mode, '--check',
],
+ 1,
+ [qr{check for ".*?does/not/exist" failed}],
+ [],
'run of pg_upgrade --check for new instance with incorrect binary path');
ok(-d $newnode->data_dir . "/pg_upgrade_output.d",
"pg_upgrade_output.d/ not removed after pg_upgrade failure");
diff --git a/src/bin/pg_upgrade/t/004_subscription.pl b/src/bin/pg_upgrade/t/004_subscription.pl
index 13773316e1d..e3ccff9f270 100644
--- a/src/bin/pg_upgrade/t/004_subscription.pl
+++ b/src/bin/pg_upgrade/t/004_subscription.pl
@@ -130,7 +130,7 @@ $old_sub->safe_psql('postgres',
$old_sub->stop;
-command_fails(
+command_checks_all(
[
'pg_upgrade',
'--no-sync',
@@ -144,6 +144,11 @@ command_fails(
$mode,
'--check',
],
+ 1,
+ [
+ qr/\QYour installation contains subscriptions without origin or having relations not in i (initialize) or r (ready) state\E/
+ ],
+ [],
'run of pg_upgrade --check for old instance with relation in \'d\' datasync(invalid) state and missing replication origin'
);
diff --git a/src/bin/pg_upgrade/t/005_char_signedness.pl b/src/bin/pg_upgrade/t/005_char_signedness.pl
index b3092f03bb7..d186822ac77 100644
--- a/src/bin/pg_upgrade/t/005_char_signedness.pl
+++ b/src/bin/pg_upgrade/t/005_char_signedness.pl
@@ -41,7 +41,7 @@ command_like(
'updated default char signedness is unsigned in control file');
# Cannot use --set-char-signedness option for upgrading from v18+
-command_fails(
+command_checks_all(
[
'pg_upgrade', '--no-sync',
'-d', $old->data_dir,
@@ -54,6 +54,9 @@ command_fails(
'--set-char-signedness', 'signed',
$mode
],
+ 1,
+ [qr/--set-char-signedness option cannot be used/],
+ [],
'--set-char-signedness option cannot be used for upgrading from v18 or later'
);
--
2.43.5
On 21.02.25 20:39, Masahiko Sawada wrote:
I have one question about the 0001 patch; since we add
'default_char_signedness' field to ControlFileData do we need to bump
PG_CONTROL_VERSION? We have comments about bumping PG_CONTROL_VERSION
when changing CheckPoint struct or DBState enum so it seems likely but
I'd like to confirm just in case that we need to bump
PG_CONTROL_VERSION also when changing ControlFileData.Yes. (I'm not aware of value we get from having distinct control file version
and catalog version, but we do have both.)If we need, can
we bump it to 1800? or 1701?I'd do 1800. The pattern seems to be to bump to 1800 for the first pg_control
change of the v18 cycle, then 1801, then 1802 for the third change of the
cycle. That's based on this history:git log -U0 -p src/include/catalog/pg_control.h | grep -E '^(Date|\+#define PG_CONTROL_VERSION)'
Thank you for the confirmation. That makes sense to me.
I'll push these patches with version bumps, barring any objections or
further comments.Pushed.
Is there a reason why the pg_controldata and pg_resetwal output are
"Default char *data* signedness", while the rest of the patch and
description just says "char signedness"? The word "data" doesn't mean
anything here, does it?
On Wed, Mar 19, 2025 at 2:58 AM Peter Eisentraut <peter@eisentraut.org> wrote:
On 21.02.25 20:39, Masahiko Sawada wrote:
I have one question about the 0001 patch; since we add
'default_char_signedness' field to ControlFileData do we need to bump
PG_CONTROL_VERSION? We have comments about bumping PG_CONTROL_VERSION
when changing CheckPoint struct or DBState enum so it seems likely but
I'd like to confirm just in case that we need to bump
PG_CONTROL_VERSION also when changing ControlFileData.Yes. (I'm not aware of value we get from having distinct control file version
and catalog version, but we do have both.)If we need, can
we bump it to 1800? or 1701?I'd do 1800. The pattern seems to be to bump to 1800 for the first pg_control
change of the v18 cycle, then 1801, then 1802 for the third change of the
cycle. That's based on this history:git log -U0 -p src/include/catalog/pg_control.h | grep -E '^(Date|\+#define PG_CONTROL_VERSION)'
Thank you for the confirmation. That makes sense to me.
I'll push these patches with version bumps, barring any objections or
further comments.Pushed.
Is there a reason why the pg_controldata and pg_resetwal output are
"Default char *data* signedness", while the rest of the patch and
description just says "char signedness"? The word "data" doesn't mean
anything here, does it?
I wanted to refer to the word "data" here to the relation data stored
as "char" type (without explicit signedness) into the database
cluster. But I admit some inconsistency. While pg_controldata outputs
"Default char data signedness", pg_control_init() SQL function outputs
it as default_char_signedness.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com