pg_trgm comparison bug on cross-architecture replication due to different char implementation

Started by Guo, Adamalmost 2 years ago60 messages
Jump to latest
#1Guo, Adam
adamguo@amazon.com

Hi all,

I would like to report an issue with the pg_trgm extension on cross-architecture replication scenarios. When an x86_64 standby server is replicating from an aarch64 primary server or vice versa, the gist_trgm_ops opclass returns different results on the primary and standby. Masahiko previously reported a similar issue affecting the pg_bigm extension [1]https://osdn.net/projects/pgbigm/lists/archive/hackers/2024-February/000370.html.

To reproduce, execute the following on the x86_64 primary server:

CREATE EXTENSION pg_trgm;
CREATE TABLE tbl (c text);
CREATE INDEX ON tbl USING gist (c gist_trgm_ops);
INSERT INTO tbl VALUES ('Bóbr');

On the x86_64 primary server:

postgres=> select * from tbl where c like '%Bób%';
c
------
Bóbr
(1 row)

On the aarch64 replica server:

postgres=> select * from tbl where c like '%Bób%';
c
---
(0 rows)

The root cause looks the same as the pg_bigm issue that Masahiko reported. To compare trigrams, pg_trgm uses a numerical comparison of chars [2]https://github.com/postgres/postgres/blob/480bc6e3ed3a5719cdec076d4943b119890e8171/contrib/pg_trgm/trgm.h#L45. On x86_64 a char is signed by default, whereas on aarch64 it is unsigned by default. gist_trgm_ops expects the trigram list to be sorted, but due to the different signedness of chars, the sort order is broken when replicating the values across architectures.

The different sort behaviour can be demonstrated using show_trgm.

On the x86_64 primary server:

postgres=> SELECT show_trgm('Bóbr');
show_trgm
------------------------------------------
{0x89194c," b","br ",0x707c72,0x7f7849}
(1 row)

On the aarch64 replica server:

postgres=> SELECT show_trgm('Bóbr');
show_trgm
------------------------------------------
{" b","br ",0x707c72,0x7f7849,0x89194c}
(1 row)

One simple solution for this specific case is to declare the char signedness in the CMPPCHAR macro.

--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -42,7 +42,7 @@
typedef char trgm[3];
 #define CMPCHAR(a,b) ( ((a)==(b)) ? 0 : ( ((a)<(b)) ? -1 : 1 ) )
-#define CMPPCHAR(a,b,i)  CMPCHAR( *(((const char*)(a))+i), *(((const char*)(b))+i) )
+#define CMPPCHAR(a,b,i)  CMPCHAR( *(((unsigned char*)(a))+i), *(((unsigned char*)(b))+i) )
#define CMPTRGM(a,b) ( CMPPCHAR(a,b,0) ? CMPPCHAR(a,b,0) : ( CMPPCHAR(a,b,1) ? CMPPCHAR(a,b,1) : CMPPCHAR(a,b,2) ) )
 #define CPTRGM(a,b) do {                                                           \

Alternatively, Postgres can be compiled with -funsigned-char or -fsigned-char. I came across a code comment suggesting that this may not be a good idea in general [3]https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/cash.c#L114-L123.

Given that this has problem has come up before and seems likely to come up again, I'm curious what other broad solutions there might be to resolve it? Looking forward to any feedback, thanks!

Best,

Adam Guo
Amazon Web Services: https://aws.amazon.com

[1]: https://osdn.net/projects/pgbigm/lists/archive/hackers/2024-February/000370.html
[2]: https://github.com/postgres/postgres/blob/480bc6e3ed3a5719cdec076d4943b119890e8171/contrib/pg_trgm/trgm.h#L45
[3]: https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/cash.c#L114-L123

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Guo, Adam (#1)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

"Guo, Adam" <adamguo@amazon.com> writes:

I would like to report an issue with the pg_trgm extension on
cross-architecture replication scenarios. When an x86_64 standby
server is replicating from an aarch64 primary server or vice versa,
the gist_trgm_ops opclass returns different results on the primary
and standby.

I do not think that is a supported scenario. Hash functions and
suchlike are not guaranteed to produce the same results on different
CPU architectures. As a quick example, I get

regression=# select hashfloat8(34);
hashfloat8
------------
21570837
(1 row)

on x86_64 but

postgres=# select hashfloat8(34);
hashfloat8
------------
-602898821
(1 row)

on ppc32 thanks to the endianness difference.

Given that this has problem has come up before and seems likely to
come up again, I'm curious what other broad solutions there might be
to resolve it?

Reject as not a bug. Discourage people from thinking that physical
replication will work across architectures.

regards, tom lane

#3Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tom Lane (#2)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

On Tue, Apr 23, 2024 at 11:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Guo, Adam" <adamguo@amazon.com> writes:

I would like to report an issue with the pg_trgm extension on
cross-architecture replication scenarios. When an x86_64 standby
server is replicating from an aarch64 primary server or vice versa,
the gist_trgm_ops opclass returns different results on the primary
and standby.

I do not think that is a supported scenario. Hash functions and
suchlike are not guaranteed to produce the same results on different
CPU architectures. As a quick example, I get

regression=# select hashfloat8(34);
hashfloat8
------------
21570837
(1 row)

on x86_64 but

postgres=# select hashfloat8(34);
hashfloat8
------------
-602898821
(1 row)

on ppc32 thanks to the endianness difference.

Given that this has problem has come up before and seems likely to
come up again, I'm curious what other broad solutions there might be
to resolve it?

Reject as not a bug. Discourage people from thinking that physical
replication will work across architectures.

While cross-arch physical replication is not supported, I think having
architecture dependent differences is not good and It's legitimate to
fix it. FYI the 'char' data type comparison is done as though char is
unsigned. I've attached a small patch to fix it. What do you think?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

fix_signedness_issue_in_pg_trgm.patchapplication/octet-stream; name=fix_signedness_issue_in_pg_trgm.patchDownload+1-1
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Masahiko Sawada (#3)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

Masahiko Sawada <sawada.mshk@gmail.com> writes:

On Tue, Apr 23, 2024 at 11:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Reject as not a bug. Discourage people from thinking that physical
replication will work across architectures.

While cross-arch physical replication is not supported, I think having
architecture dependent differences is not good and It's legitimate to
fix it. FYI the 'char' data type comparison is done as though char is
unsigned. I've attached a small patch to fix it. What do you think?

I think this will break existing indexes that are working fine.
Yeah, it would have been better to avoid the difference, but
it's too late now.

regards, tom lane

#5Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tom Lane (#4)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

On Tue, Apr 30, 2024 at 12:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Masahiko Sawada <sawada.mshk@gmail.com> writes:

On Tue, Apr 23, 2024 at 11:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Reject as not a bug. Discourage people from thinking that physical
replication will work across architectures.

While cross-arch physical replication is not supported, I think having
architecture dependent differences is not good and It's legitimate to
fix it. FYI the 'char' data type comparison is done as though char is
unsigned. I've attached a small patch to fix it. What do you think?

I think this will break existing indexes that are working fine.
Yeah, it would have been better to avoid the difference, but
it's too late now.

True. So it will be a PG18 item.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Masahiko Sawada (#5)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

Masahiko Sawada <sawada.mshk@gmail.com> writes:

On Tue, Apr 30, 2024 at 12:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think this will break existing indexes that are working fine.
Yeah, it would have been better to avoid the difference, but
it's too late now.

True. So it will be a PG18 item.

How will it be any better in v18? It's still an on-disk
compatibility break for affected platforms.

Now, people could recover by reindexing affected indexes,
but I think we need to have a better justification than this
for making them do so.

regards, tom lane

#7Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#2)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

On Tue, Apr 23, 2024 at 5:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Guo, Adam" <adamguo@amazon.com> writes:

I would like to report an issue with the pg_trgm extension on
cross-architecture replication scenarios. When an x86_64 standby
server is replicating from an aarch64 primary server or vice versa,
the gist_trgm_ops opclass returns different results on the primary
and standby.

I do not think that is a supported scenario. Hash functions and
suchlike are not guaranteed to produce the same results on different
CPU architectures. As a quick example, I get

regression=# select hashfloat8(34);
hashfloat8
------------
21570837
(1 row)

on x86_64 but

postgres=# select hashfloat8(34);
hashfloat8
------------
-602898821
(1 row)

on ppc32 thanks to the endianness difference.

Given this, should we try to do better with binary compatibility
checks using ControlFileData? AFAICS they are supposed to check if
the database cluster is binary compatible with the running
architecture. But it obviously allows incompatibilities.

------
Regards,
Alexander Korotkov

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#7)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

Alexander Korotkov <aekorotkov@gmail.com> writes:

Given this, should we try to do better with binary compatibility
checks using ControlFileData? AFAICS they are supposed to check if
the database cluster is binary compatible with the running
architecture. But it obviously allows incompatibilities.

Perhaps. pg_control already covers endianness, which I think
is the root of the hashing differences I showed. Adding a field
for char signedness feels a little weird, since it's not directly
a property of the bits-on-disk, but maybe we should.

regards, tom lane

#9Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#8)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

On Tue, Apr 30, 2024 at 7:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alexander Korotkov <aekorotkov@gmail.com> writes:

Given this, should we try to do better with binary compatibility
checks using ControlFileData? AFAICS they are supposed to check if
the database cluster is binary compatible with the running
architecture. But it obviously allows incompatibilities.

Perhaps. pg_control already covers endianness, which I think
is the root of the hashing differences I showed. Adding a field
for char signedness feels a little weird, since it's not directly
a property of the bits-on-disk, but maybe we should.

I agree that storing char signedness might seem weird. But it appears
that we already store indexes that depend on char signedness. So,
it's effectively property of bits-on-disk even though it affects
indirectly. Then I see two options to make the picture consistent.
1) Assume that char signedness is somehow a property of bits-on-disk
even though it's weird. Then pg_trgm indexes are correct, but we need
to store char signedness in pg_control.
2) Assume that char signedness is not a property of bits-on-disk.
Then pg_trgm indexes are buggy and need to be fixed.
What do you think?

------
Regards,
Alexander Korotkov

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#9)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

Alexander Korotkov <aekorotkov@gmail.com> writes:

I agree that storing char signedness might seem weird. But it appears
that we already store indexes that depend on char signedness. So,
it's effectively property of bits-on-disk even though it affects
indirectly. Then I see two options to make the picture consistent.
1) Assume that char signedness is somehow a property of bits-on-disk
even though it's weird. Then pg_trgm indexes are correct, but we need
to store char signedness in pg_control.
2) Assume that char signedness is not a property of bits-on-disk.
Then pg_trgm indexes are buggy and need to be fixed.
What do you think?

The problem with option (2) is the assumption that pg_trgm's behavior
is the only bug of this kind, either now or in the future. I think
that's just about an impossible standard to meet, because there's no
realistic way to test whether char signedness is affecting things.
(Sure, you can compare results across platforms, but maybe you
just didn't test the right case.)

Also, the bigger picture here is the seeming assumption that "if
we change pg_trgm then it will be safe to replicate from x86 to
arm". I don't believe that that's a good idea and I'm unwilling
to promise that it will work, regardless of what we do about
char signedness. That being the case, I don't want to invest a
lot of effort in the signedness issue. Option (1) is clearly
a small change with little if any risk of future breakage.
Option (2) ... not so much.

regards, tom lane

#11Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tom Lane (#10)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

On Wed, May 1, 2024 at 2:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alexander Korotkov <aekorotkov@gmail.com> writes:

I agree that storing char signedness might seem weird. But it appears
that we already store indexes that depend on char signedness. So,
it's effectively property of bits-on-disk even though it affects
indirectly. Then I see two options to make the picture consistent.
1) Assume that char signedness is somehow a property of bits-on-disk
even though it's weird. Then pg_trgm indexes are correct, but we need
to store char signedness in pg_control.
2) Assume that char signedness is not a property of bits-on-disk.
Then pg_trgm indexes are buggy and need to be fixed.
What do you think?

The problem with option (2) is the assumption that pg_trgm's behavior
is the only bug of this kind, either now or in the future. I think
that's just about an impossible standard to meet, because there's no
realistic way to test whether char signedness is affecting things.
(Sure, you can compare results across platforms, but maybe you
just didn't test the right case.)

Also, the bigger picture here is the seeming assumption that "if
we change pg_trgm then it will be safe to replicate from x86 to
arm". I don't believe that that's a good idea and I'm unwilling
to promise that it will work, regardless of what we do about
char signedness. That being the case, I don't want to invest a
lot of effort in the signedness issue.

I think that the char signedness issue is an issue also for developers
(and extension authors) since it could lead to confusion and potential
bugs in the future due to that. x86 developers would think of char as
always being signed and write code that will misbehave on arm
machines. For example, since logical replication should behave
correctly even in cross-arch replication all developers need to be
aware of that. I thought of using the -funsigned-char (or
-fsigned-char) compiler flag to avoid that but it would have a broader
impact not only on indexes.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#12Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#10)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

On 30.04.24 19:29, Tom Lane wrote:

1) Assume that char signedness is somehow a property of bits-on-disk
even though it's weird. Then pg_trgm indexes are correct, but we need
to store char signedness in pg_control.
2) Assume that char signedness is not a property of bits-on-disk.
Then pg_trgm indexes are buggy and need to be fixed.
What do you think?

Also, the bigger picture here is the seeming assumption that "if
we change pg_trgm then it will be safe to replicate from x86 to
arm". I don't believe that that's a good idea and I'm unwilling
to promise that it will work, regardless of what we do about
char signedness. That being the case, I don't want to invest a
lot of effort in the signedness issue. Option (1) is clearly
a small change with little if any risk of future breakage.

But note that option 1 would prevent some replication that is currently
working.

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#12)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

Peter Eisentraut <peter@eisentraut.org> writes:

On 30.04.24 19:29, Tom Lane wrote:

Also, the bigger picture here is the seeming assumption that "if
we change pg_trgm then it will be safe to replicate from x86 to
arm". I don't believe that that's a good idea and I'm unwilling
to promise that it will work, regardless of what we do about
char signedness. That being the case, I don't want to invest a
lot of effort in the signedness issue. Option (1) is clearly
a small change with little if any risk of future breakage.

But note that option 1 would prevent some replication that is currently
working.

The point of this thread though is that it's working only for small
values of "work". People are rightfully unhappy if it seems to work
and then later they get bitten by compatibility problems.

Treating char signedness as a machine property in pg_control would
signal that we don't intend to make it work, and would ensure that
even the most minimal testing would find out that it doesn't work.

If we do not do that, it seems to me we have to buy into making
it work. That would mean dealing with the consequences of an
incompatible change in pg_trgm indexes, and then going through
the same dance again the next time(s) similar problems are found.

regards, tom lane

#14Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#13)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

On 03.05.24 16:13, Tom Lane wrote:

Peter Eisentraut <peter@eisentraut.org> writes:

On 30.04.24 19:29, Tom Lane wrote:

Also, the bigger picture here is the seeming assumption that "if
we change pg_trgm then it will be safe to replicate from x86 to
arm". I don't believe that that's a good idea and I'm unwilling
to promise that it will work, regardless of what we do about
char signedness. That being the case, I don't want to invest a
lot of effort in the signedness issue. Option (1) is clearly
a small change with little if any risk of future breakage.

But note that option 1 would prevent some replication that is currently
working.

The point of this thread though is that it's working only for small
values of "work". People are rightfully unhappy if it seems to work
and then later they get bitten by compatibility problems.

Treating char signedness as a machine property in pg_control would
signal that we don't intend to make it work, and would ensure that
even the most minimal testing would find out that it doesn't work.

If we do not do that, it seems to me we have to buy into making
it work. That would mean dealing with the consequences of an
incompatible change in pg_trgm indexes, and then going through
the same dance again the next time(s) similar problems are found.

Yes, that is understood. But anecdotally, replicating between x86-64
arm64 is occasionally used for upgrades or migrations. In practice,
this appears to have mostly worked. If we now discover that it won't
work with certain index extension modules, it's usable for most users.
Even if we say, you have to reindex everything afterwards, it's probably
still useful for these scenarios.

The way I understand the original report, the issue has to do
specifically with how signed and unsigned chars compare differently. I
don't imagine this is used anywhere in the table/heap code. So it's
plausible that this issue is really contained to indexes.

On the other hand, if we put in a check against this, then at least we
can answer any user questions about this with more certainty: No, won't
work, here is why.

#15Joe Conway
mail@joeconway.com
In reply to: Peter Eisentraut (#14)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

On 5/3/24 11:44, Peter Eisentraut wrote:

On 03.05.24 16:13, Tom Lane wrote:

Peter Eisentraut <peter@eisentraut.org> writes:

On 30.04.24 19:29, Tom Lane wrote:

Also, the bigger picture here is the seeming assumption that "if
we change pg_trgm then it will be safe to replicate from x86 to
arm".  I don't believe that that's a good idea and I'm unwilling
to promise that it will work, regardless of what we do about
char signedness.  That being the case, I don't want to invest a
lot of effort in the signedness issue.  Option (1) is clearly
a small change with little if any risk of future breakage.

But note that option 1 would prevent some replication that is currently
working.

The point of this thread though is that it's working only for small
values of "work".  People are rightfully unhappy if it seems to work
and then later they get bitten by compatibility problems.

Treating char signedness as a machine property in pg_control would
signal that we don't intend to make it work, and would ensure that
even the most minimal testing would find out that it doesn't work.

If we do not do that, it seems to me we have to buy into making
it work.  That would mean dealing with the consequences of an
incompatible change in pg_trgm indexes, and then going through
the same dance again the next time(s) similar problems are found.

Yes, that is understood.  But anecdotally, replicating between x86-64 arm64 is
occasionally used for upgrades or migrations.  In practice, this appears to have
mostly worked.  If we now discover that it won't work with certain index
extension modules, it's usable for most users. Even if we say, you have to
reindex everything afterwards, it's probably still useful for these scenarios.

+1

I have heard similar anecdotes, and the reported experience goes even further --
many such upgrade/migration uses, with exceedingly rare reported failures.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#16Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Joe Conway (#15)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

On Sat, May 4, 2024 at 7:36 AM Joe Conway <mail@joeconway.com> wrote:

On 5/3/24 11:44, Peter Eisentraut wrote:

On 03.05.24 16:13, Tom Lane wrote:

Peter Eisentraut <peter@eisentraut.org> writes:

On 30.04.24 19:29, Tom Lane wrote:

Also, the bigger picture here is the seeming assumption that "if
we change pg_trgm then it will be safe to replicate from x86 to
arm". I don't believe that that's a good idea and I'm unwilling
to promise that it will work, regardless of what we do about
char signedness. That being the case, I don't want to invest a
lot of effort in the signedness issue. Option (1) is clearly
a small change with little if any risk of future breakage.

But note that option 1 would prevent some replication that is currently
working.

The point of this thread though is that it's working only for small
values of "work". People are rightfully unhappy if it seems to work
and then later they get bitten by compatibility problems.

Treating char signedness as a machine property in pg_control would
signal that we don't intend to make it work, and would ensure that
even the most minimal testing would find out that it doesn't work.

If we do not do that, it seems to me we have to buy into making
it work. That would mean dealing with the consequences of an
incompatible change in pg_trgm indexes, and then going through
the same dance again the next time(s) similar problems are found.

Yes, that is understood. But anecdotally, replicating between x86-64 arm64 is
occasionally used for upgrades or migrations. In practice, this appears to have
mostly worked. If we now discover that it won't work with certain index
extension modules, it's usable for most users. Even if we say, you have to
reindex everything afterwards, it's probably still useful for these scenarios.

+1

+1

How about extending amcheck to support GIN and GIst indexes so that it
can detect potential data incompatibility due to changing 'char' to
'unsigned char'? I think these new tests would be useful also for
users to check if they really need to reindex indexes due to such
changes. Also we fix pg_trgm so that it uses 'unsigned char' in PG18.
Users who upgraded to PG18 can run the new amcheck tests on the
primary as well as the standby.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#17Noah Misch
noah@leadboat.com
In reply to: Masahiko Sawada (#16)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

On Wed, May 15, 2024 at 02:56:54PM +0900, Masahiko Sawada wrote:

On Sat, May 4, 2024 at 7:36 AM Joe Conway <mail@joeconway.com> wrote:

On 5/3/24 11:44, Peter Eisentraut wrote:

On 03.05.24 16:13, Tom Lane wrote:

Peter Eisentraut <peter@eisentraut.org> writes:

On 30.04.24 19:29, Tom Lane wrote:

Also, the bigger picture here is the seeming assumption that "if
we change pg_trgm then it will be safe to replicate from x86 to
arm". I don't believe that that's a good idea and I'm unwilling
to promise that it will work, regardless of what we do about
char signedness. That being the case, I don't want to invest a
lot of effort in the signedness issue. Option (1) is clearly
a small change with little if any risk of future breakage.

But note that option 1 would prevent some replication that is currently
working.

The point of this thread though is that it's working only for small
values of "work". People are rightfully unhappy if it seems to work
and then later they get bitten by compatibility problems.

Treating char signedness as a machine property in pg_control would
signal that we don't intend to make it work, and would ensure that
even the most minimal testing would find out that it doesn't work.

If we do not do that, it seems to me we have to buy into making
it work. That would mean dealing with the consequences of an
incompatible change in pg_trgm indexes, and then going through
the same dance again the next time(s) similar problems are found.

Yes, that is understood. But anecdotally, replicating between x86-64 arm64 is
occasionally used for upgrades or migrations. In practice, this appears to have
mostly worked. If we now discover that it won't work with certain index
extension modules, it's usable for most users. Even if we say, you have to
reindex everything afterwards, it's probably still useful for these scenarios.

Like you, I would not introduce a ControlFileData block for sign-of-char,
given the signs of breakage in extension indexing only. That would lose much
useful migration capability. I'm sympathetic to Tom's point, which I'll
attempt to summarize as: a ControlFileData block is a promise we know how to
keep, whereas we should expect further bug discoveries without it. At the
same time, I put more weight on the apparently-wide span of functionality that
works fine. (I wonder whether any static analyzer does or practically could
detect char sign dependence with a decent false positive rate.)

+1

How about extending amcheck to support GIN and GIst indexes so that it
can detect potential data incompatibility due to changing 'char' to
'unsigned char'? I think these new tests would be useful also for
users to check if they really need to reindex indexes due to such
changes. Also we fix pg_trgm so that it uses 'unsigned char' in PG18.
Users who upgraded to PG18 can run the new amcheck tests on the
primary as well as the standby.

If I were standardizing pg_trgm on one or the other notion of "char", I would
choose signed char, since I think it's still the majority. More broadly, I
see these options to fix pg_trgm:

1. Change to signed char. Every arm64 system needs to scan pg_trgm indexes.
2. Change to unsigned char. Every x86 system needs to scan pg_trgm indexes.
3. Offer both, as an upgrade path. For example, pg_trgm could have separate
operator classes gin_trgm_ops and gin_trgm_ops_unsigned. Running
pg_upgrade on an unsigned-char system would automatically map v17
gin_trgm_ops to v18 gin_trgm_ops_unsigned. This avoids penalizing any
architecture with upgrade-time scans.

Independently, having amcheck for GIN and/or GiST would be great.

#18Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Noah Misch (#17)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

On Sun, May 19, 2024 at 6:46 AM Noah Misch <noah@leadboat.com> wrote:

On Wed, May 15, 2024 at 02:56:54PM +0900, Masahiko Sawada wrote:

How about extending amcheck to support GIN and GIst indexes so that it
can detect potential data incompatibility due to changing 'char' to
'unsigned char'? I think these new tests would be useful also for
users to check if they really need to reindex indexes due to such
changes. Also we fix pg_trgm so that it uses 'unsigned char' in PG18.
Users who upgraded to PG18 can run the new amcheck tests on the
primary as well as the standby.

If I were standardizing pg_trgm on one or the other notion of "char", I would
choose signed char, since I think it's still the majority. More broadly, I
see these options to fix pg_trgm:

1. Change to signed char. Every arm64 system needs to scan pg_trgm indexes.
2. Change to unsigned char. Every x86 system needs to scan pg_trgm indexes.

Even though it's true that signed char systems are the majority, it
would not be acceptable to force the need to scan pg_trgm indexes on
unsigned char systems.

3. Offer both, as an upgrade path. For example, pg_trgm could have separate
operator classes gin_trgm_ops and gin_trgm_ops_unsigned. Running
pg_upgrade on an unsigned-char system would automatically map v17
gin_trgm_ops to v18 gin_trgm_ops_unsigned. This avoids penalizing any
architecture with upgrade-time scans.

Very interesting idea. How can new v18 users use the correct operator
class? I don't want to require users to specify the correct signed or
unsigned operator classes when creating a GIN index. Maybe we need to
dynamically use the correct compare function for the same operator
class depending on the char signedness. But is it possible to do it on
the extension (e.g. pg_trgm) side?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#19Noah Misch
noah@leadboat.com
In reply to: Masahiko Sawada (#18)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

On Thu, Aug 29, 2024 at 03:48:53PM -0500, Masahiko Sawada wrote:

On Sun, May 19, 2024 at 6:46 AM Noah Misch <noah@leadboat.com> wrote:

If I were standardizing pg_trgm on one or the other notion of "char", I would
choose signed char, since I think it's still the majority. More broadly, I
see these options to fix pg_trgm:

1. Change to signed char. Every arm64 system needs to scan pg_trgm indexes.
2. Change to unsigned char. Every x86 system needs to scan pg_trgm indexes.

Even though it's true that signed char systems are the majority, it
would not be acceptable to force the need to scan pg_trgm indexes on
unsigned char systems.

3. Offer both, as an upgrade path. For example, pg_trgm could have separate
operator classes gin_trgm_ops and gin_trgm_ops_unsigned. Running
pg_upgrade on an unsigned-char system would automatically map v17
gin_trgm_ops to v18 gin_trgm_ops_unsigned. This avoids penalizing any
architecture with upgrade-time scans.

Very interesting idea. How can new v18 users use the correct operator
class? I don't want to require users to specify the correct signed or
unsigned operator classes when creating a GIN index. Maybe we need to

In brief, it wouldn't matter which operator class new v18 indexes use. The
documentation would focus on gin_trgm_ops and also say something like:

There's an additional operator class, gin_trgm_ops_unsigned. It behaves
exactly like gin_trgm_ops, but it uses a deprecated on-disk representation.
Use gin_trgm_ops in new indexes, but there's no disadvantage from continuing
to use gin_trgm_ops_unsigned. Before PostgreSQL 18, gin_trgm_ops used a
platform-dependent representation. pg_upgrade automatically uses
gin_trgm_ops_unsigned when upgrading from source data that used the
deprecated representation.

What concerns might users have, then? (Neither operator class would use plain
"char" in a context that affects on-disk state. They'll use "signed char" and
"unsigned char".)

dynamically use the correct compare function for the same operator
class depending on the char signedness. But is it possible to do it on
the extension (e.g. pg_trgm) side?

No, I don't think the extension can do that cleanly. It would need to store
the signedness in the index somehow, and GIN doesn't call the opclass at a
time facilitating that. That would need help from the core server.

#20Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Noah Misch (#19)
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

On Fri, Aug 30, 2024 at 8:10 PM Noah Misch <noah@leadboat.com> wrote:

On Thu, Aug 29, 2024 at 03:48:53PM -0500, Masahiko Sawada wrote:

On Sun, May 19, 2024 at 6:46 AM Noah Misch <noah@leadboat.com> wrote:

If I were standardizing pg_trgm on one or the other notion of "char", I would
choose signed char, since I think it's still the majority. More broadly, I
see these options to fix pg_trgm:

1. Change to signed char. Every arm64 system needs to scan pg_trgm indexes.
2. Change to unsigned char. Every x86 system needs to scan pg_trgm indexes.

Even though it's true that signed char systems are the majority, it
would not be acceptable to force the need to scan pg_trgm indexes on
unsigned char systems.

3. Offer both, as an upgrade path. For example, pg_trgm could have separate
operator classes gin_trgm_ops and gin_trgm_ops_unsigned. Running
pg_upgrade on an unsigned-char system would automatically map v17
gin_trgm_ops to v18 gin_trgm_ops_unsigned. This avoids penalizing any
architecture with upgrade-time scans.

Very interesting idea. How can new v18 users use the correct operator
class? I don't want to require users to specify the correct signed or
unsigned operator classes when creating a GIN index. Maybe we need to

In brief, it wouldn't matter which operator class new v18 indexes use. The
documentation would focus on gin_trgm_ops and also say something like:

There's an additional operator class, gin_trgm_ops_unsigned. It behaves
exactly like gin_trgm_ops, but it uses a deprecated on-disk representation.
Use gin_trgm_ops in new indexes, but there's no disadvantage from continuing
to use gin_trgm_ops_unsigned. Before PostgreSQL 18, gin_trgm_ops used a
platform-dependent representation. pg_upgrade automatically uses
gin_trgm_ops_unsigned when upgrading from source data that used the
deprecated representation.

What concerns might users have, then? (Neither operator class would use plain
"char" in a context that affects on-disk state. They'll use "signed char" and
"unsigned char".)

I think I understand your idea now. Since gin_trgm_ops will use
"signed char", there is no impact for v18 users -- they can continue
using gin_trgm_ops.

But how does pg_upgrade use gin_trgm_ops_unsigned? This opclass will
be created by executing the pg_trgm script for v18, but it isn't
executed during pg_upgrade. Another way would be to do these opclass
replacement when executing the pg_trgm's update script (i.e., 1.6 to
1.7).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#21Noah Misch
noah@leadboat.com
In reply to: Masahiko Sawada (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#21)
#23Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#23)
#25Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#24)
#26Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tom Lane (#22)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Masahiko Sawada (#26)
#28Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tom Lane (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Masahiko Sawada (#28)
#30Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tom Lane (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Masahiko Sawada (#30)
#32Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tom Lane (#31)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Masahiko Sawada (#32)
#34Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#33)
#35Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Noah Misch (#34)
#36Noah Misch
noah@leadboat.com
In reply to: Masahiko Sawada (#35)
#37Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Noah Misch (#36)
#38Noah Misch
noah@leadboat.com
In reply to: Masahiko Sawada (#37)
#39Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Noah Misch (#38)
#40Noah Misch
noah@leadboat.com
In reply to: Masahiko Sawada (#39)
#41Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Noah Misch (#40)
#42Noah Misch
noah@leadboat.com
In reply to: Masahiko Sawada (#41)
#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#42)
#44Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tom Lane (#43)
#45Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#44)
#46Noah Misch
noah@leadboat.com
In reply to: Masahiko Sawada (#45)
#47Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Noah Misch (#46)
#48Noah Misch
noah@leadboat.com
In reply to: Masahiko Sawada (#47)
#49Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Noah Misch (#48)
#50Noah Misch
noah@leadboat.com
In reply to: Masahiko Sawada (#49)
#51Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Noah Misch (#50)
#52Noah Misch
noah@leadboat.com
In reply to: Masahiko Sawada (#51)
#53Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Noah Misch (#52)
#54Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#53)
In reply to: Masahiko Sawada (#49)
#56Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Dagfinn Ilmari Mannsåker (#55)
In reply to: Masahiko Sawada (#56)
#58Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Dagfinn Ilmari Mannsåker (#57)
#59Peter Eisentraut
peter_e@gmx.net
In reply to: Masahiko Sawada (#54)
#60Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Eisentraut (#59)