Upgrading locale issues

Started by rihadabout 7 years ago12 messagesgeneral

rihad@mail.ru

about 7 years ago

Hi. Today we run pg_ctl promote on a slave server (10.7) and started
using it as a master. The OS also got upgraded FreeBSD 10.4 -> FreeBSD
11.2. And you guessed it, most varchar indexes got corrupted because
system local changed in subtle ways. So I created the extension amcheck
and reindexed all bad indexes one by one. Is there any way to prevent
such things in the future? Will switching to ICU fix all such issues?
The problem with it is that ICU collations are absent in pg_collation,
initdb should be run to create them, but pg_basebackup only runs on an
empty base directory, so I couldn't run initdb + pg_basebackup to
prepare the replica server. I believe I can run the create collation
command manually, but what would it look like for en-x-icu?

CREATE COLLATION "en-ix-icu" (provider = icu, locale = 'en-x-icu');

is that it? But what about version etc?

en-x-icu | 11 | 10 | i
| -1 | en | en | 153.88

Thanks.

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: rihad (#1)

Re: Upgrading locale issues

On Mon, Apr 29, 2019 at 7:45 AM rihad <rihad@mail.ru> wrote:

Hi. Today we run pg_ctl promote on a slave server (10.7) and started
using it as a master. The OS also got upgraded FreeBSD 10.4 -> FreeBSD
11.2. And you guessed it, most varchar indexes got corrupted because
system local changed in subtle ways. So I created the extension amcheck
and reindexed all bad indexes one by one. Is there any way to prevent
such things in the future? Will switching to ICU fix all such issues?

Not necessarily, but it will detect the incompatibility more or less
automatically, making it far more likely that the problem will be
caught before it does any harm. ICU versions collations, giving
Postgres a way to reason about their compatibility over time. The libc
collations are not versioned, though (at least not in any standard way
that Postgres can take advantage of).

The problem with it is that ICU collations are absent in pg_collation,
initdb should be run to create them, but pg_basebackup only runs on an
empty base directory, so I couldn't run initdb + pg_basebackup to
prepare the replica server. I believe I can run the create collation
command manually, but what would it look like for en-x-icu?

It is safe to call pg_import_system_collations() directly, which is
all that initdb does. This is documented, so you wouldn't be relying
on a hack.

--
Peter Geoghegan

Thomas Munro

thomas.munro@gmail.com

about 7 years ago

In reply to: Peter Geoghegan (#2)

Re: Upgrading locale issues

On Thu, May 2, 2019 at 8:26 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Apr 29, 2019 at 7:45 AM rihad <rihad@mail.ru> wrote:

Hi. Today we run pg_ctl promote on a slave server (10.7) and started
using it as a master. The OS also got upgraded FreeBSD 10.4 -> FreeBSD
11.2. And you guessed it, most varchar indexes got corrupted because
system local changed in subtle ways. So I created the extension amcheck
and reindexed all bad indexes one by one. Is there any way to prevent
such things in the future? Will switching to ICU fix all such issues?

Not necessarily, but it will detect the incompatibility more or less
automatically, making it far more likely that the problem will be
caught before it does any harm. ICU versions collations, giving
Postgres a way to reason about their compatibility over time. The libc
collations are not versioned, though (at least not in any standard way
that Postgres can take advantage of).

As discussed over on -hackers[1]/messages/by-id/CAEepm=0uEQCpfq_+LYFBdArCe4Ot98t1aR4eYiYTe=yavQygiQ@mail.gmail.com, I think it's worth pursuing that
though. FWIW I've proposed locale versioning for FreeBSD's libc[2]https://reviews.freebsd.org/D17166.
The reason I haven't gone further with that yet even though the code
change has been accepted in principle by FreeBSD reviewers is because
I got stuck on the question of how exactly to model the versions. If,
say, just Turkish changes, I don't want to be rebuilding my French
indexes, which means that I don't think you can use the CLDR version
string. Frustratingly, you could probably do a good job of that by
just checksumming the collation definition files, but that seems a bit
too crude.

There is also the question of how PostgreSQL should model versions,
and as I've argued in [1]/messages/by-id/CAEepm=0uEQCpfq_+LYFBdArCe4Ot98t1aR4eYiYTe=yavQygiQ@mail.gmail.com, I think we should track them at the level
of database object dependencies.

I'm hoping to reopen this can of worms for PostgreSQL 13 (and the
corresponding support could in theory be in FreeBSD 13... coincidence,
or a sign!?)

The problem with it is that ICU collations are absent in pg_collation,
initdb should be run to create them, but pg_basebackup only runs on an
empty base directory, so I couldn't run initdb + pg_basebackup to
prepare the replica server. I believe I can run the create collation
command manually, but what would it look like for en-x-icu?

It is safe to call pg_import_system_collations() directly, which is
all that initdb does. This is documented, so you wouldn't be relying
on a hack.

Unfortunately you can't use ICU collations as a database default yet
(though there was some WIP code[3]/messages/by-id/3366.1498183854@sss.pgh.pa.us), so ICU only saves you from
versioning problems if you explicitly set collations for columns or
expressions, and even then the version tracking is currently just a
warning that you clear manually with a command, not a mechanism that
really tracks which database objects were last rebuilt/validated with
a given version.

[1]: /messages/by-id/CAEepm=0uEQCpfq_+LYFBdArCe4Ot98t1aR4eYiYTe=yavQygiQ@mail.gmail.com
[2]: https://reviews.freebsd.org/D17166
[3]: /messages/by-id/3366.1498183854@sss.pgh.pa.us

--
Thomas Munro
https://enterprisedb.com

rihad

rihad@mail.ru

about 7 years ago

In reply to: Peter Geoghegan (#2)

Re: Upgrading locale issues

On 05/02/2019 12:26 AM, Peter Geoghegan wrote:

On Mon, Apr 29, 2019 at 7:45 AM rihad <rihad@mail.ru> wrote:

Hi. Today we run pg_ctl promote on a slave server (10.7) and started
using it as a master. The OS also got upgraded FreeBSD 10.4 -> FreeBSD
11.2. And you guessed it, most varchar indexes got corrupted because
system local changed in subtle ways. So I created the extension amcheck
and reindexed all bad indexes one by one. Is there any way to prevent
such things in the future? Will switching to ICU fix all such issues?

Not necessarily, but it will detect the incompatibility more or less
automatically, making it far more likely that the problem will be
caught before it does any harm. ICU versions collations, giving
Postgres a way to reason about their compatibility over time. The libc
collations are not versioned, though (at least not in any standard way
that Postgres can take advantage of).

The problem with it is that ICU collations are absent in pg_collation,
initdb should be run to create them, but pg_basebackup only runs on an
empty base directory, so I couldn't run initdb + pg_basebackup to
prepare the replica server. I believe I can run the create collation
command manually, but what would it look like for en-x-icu?

It is safe to call pg_import_system_collations() directly, which is
all that initdb does. This is documented, so you wouldn't be relying
on a hack.

Thanks for the reply. Do you know what would a "decent" ICU collation be
to bind to a field's schema definition so it would mimic a UTF-8
encoding for a multilingual column? Maybe und-x-icu? We aren't as much
concerned about their sortability in most cases, we just want indexes to
better handle future PG/ICU upgrades. But what does und(efined) even
mean with respect to collations? With UTF-8 at least some default
collation is specified, like en_US.UTF-8. Will results be in a
completely undefined order as a result of ORDER BY "icu_und_column"?

Daniel Verite

daniel@manitou-mail.org

about 7 years ago

In reply to: rihad (#4)

Re: Upgrading locale issues

rihad wrote:

Thanks for the reply. Do you know what would a "decent" ICU collation be
to bind to a field's schema definition so it would mimic a UTF-8
encoding for a multilingual column? Maybe und-x-icu? We aren't as much
concerned about their sortability in most cases, we just want indexes to
better handle future PG/ICU upgrades. But what does und(efined) even
mean with respect to collations?

"undefined" in this context means unspecified language and
unspecified country or region. It implies that no language-specific
nor regional rule will be applied to compare strings.

Using C.UTF-8 as the collation for text fields to index may be the
best trade-off in your case. It should be immune to libc and ICU
upgrades.

With C.UTF-8, a string like 'BC' will sort before 'ab', and punctuation
and accents will also sort differently than with a linguistic-aware
collation.
If your applications care about that, it can be fixed by simply
adding COLLATE "default" to the ORDER BY clause of the queries that
are meant to present data to users.
COLLATE "default" means the collation of the database, which
presumably would be something like "language_REGION.UTF-8" in your
case. If you never specified it explicitly, it came from initdb which
itself got it from the environment of the server.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

rihad

rihad@mail.ru

about 7 years ago

In reply to: Daniel Verite (#5)

Re: Upgrading locale issues

On 05/02/2019 05:36 PM, Daniel Verite wrote:

rihad wrote:

Thanks for the reply. Do you know what would a "decent" ICU collation be
to bind to a field's schema definition so it would mimic a UTF-8
encoding for a multilingual column? Maybe und-x-icu? We aren't as much
concerned about their sortability in most cases, we just want indexes to
better handle future PG/ICU upgrades. But what does und(efined) even
mean with respect to collations?

"undefined" in this context means unspecified language and
unspecified country or region. It implies that no language-specific
nor regional rule will be applied to compare strings.

Using C.UTF-8 as the collation for text fields to index may be the
best trade-off in your case. It should be immune to libc and ICU
upgrades.

With C.UTF-8, a string like 'BC' will sort before 'ab', and punctuation
and accents will also sort differently than with a linguistic-aware
collation.

Thanks, I'm a bit confused here. AFAIK indexes are used for at least two
things: for speed and for skipping the ORDER BY step (since btree
indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index
still work correctly for table lookups? And can the existingï¿œ
en_US.UTF-8 fields' definition be altered in place, without a
dump+restore? en_US.UTF-8 is the default encoding+locale+collation, it
isn't set explicitly for any of our string columns. I assume there's
some "catch-all" ordering taking place even for the C locale, so there
won't be any bizarre things like b coming before a, or generally for any
language, the second letter of its alphabet coming before the first?

Show quoted text

If your applications care about that, it can be fixed by simply
adding COLLATE "default" to the ORDER BY clause of the queries that
are meant to present data to users.
COLLATE "default" means the collation of the database, which
presumably would be something like "language_REGION.UTF-8" in your
case. If you never specified it explicitly, it came from initdb which
itself got it from the environment of the server.

Best regards,

Daniel Verite

daniel@manitou-mail.org

about 7 years ago

In reply to: rihad (#6)

Re: Upgrading locale issues

rihad wrote:

Thanks, I'm a bit confused here. AFAIK indexes are used for at least two
things: for speed and for skipping the ORDER BY step (since btree
indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index
still work correctly for table lookups?

If the lookup is based on a equality test or a pattern match with LIKE
or a regexp, it makes no difference. But the locale makes a
difference with inequality tests, such as < > or BETWEEN.

Around version 9.1 and in the pre-ICU days, Robert Haas wrote
this post that explained it pretty well, I think:
http://rhaas.blogspot.com/2012/03/perils-of-collation-aware-comparisons.html

Quote:

If you happen to need the particular sorting behavior that
collation-aware sorting and comparisons provide, then you may find
this price worth paying, but I suspect there are a lot of people out
there who are paying it more or less accidentally and don't really
care very much about the underlying sorting behavior. If, for
example, all of your queries are based on equality, and you don't
use greater-than or less-than tests, then it doesn't matter what
collation is in use. You might as well use "C" instead of whatever
your local default may be, because it's faster.

For non-English text, I would recommend C.UTF-8 over "C" because of
its better handling of Unicode characters. For instance:

=# select upper('été' collate "C"), upper('été' collate "C.UTF-8");
upper | upper
-------+-------
éTé | ÉTÉ

The "price" of linguistic comparisons that Robert mentioned was about
performance, but the troubles we have with the lack of long-term
immutability of these locales are worth being added to that.

And can the existing en_US.UTF-8 fields' definition be altered in
place, without a dump+restore?

Changing the COLLATE clause of a column with ALTER TABLE does
implicitly rebuild an index on this column if there is one,
A dump+restore is not needed, nor an explicit REINDEX.

The dump+restore is needed in another scenario, where you would
decide to change the LC_COLLATE and LC_CTYPE of the database,
instead of doing it only for some columns.
This scenario makes perfect sense if the locale of the database
has been set implicitly and it uses linguistic sorts without
really needing them ("accidentally" as said in the post).

en_US.UTF-8 is the default encoding+locale+collation, it
isn't set explicitly for any of our string columns. I assume there's
some "catch-all" ordering taking place even for the C locale, so there
won't be any bizarre things like b coming before a, or generally for any
language, the second letter of its alphabet coming before the first?

'b' < 'a' is never true for any locale AFAIK, it is 'B' < 'a' that is
true for some locales such as C or C.UTF-8.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

rihad

rihad@mail.ru

about 7 years ago

In reply to: Daniel Verite (#7)

Re: Upgrading locale issues

On 05/03/2019 05:35 PM, Daniel Verite wrote:

rihad wrote:

Thanks, I'm a bit confused here. AFAIK indexes are used for at least two
things: for speed and for skipping the ORDER BY step (since btree
indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index
still work correctly for table lookups?

If the lookup is based on a equality test or a pattern match with LIKE
or a regexp, it makes no difference. But the locale makes a
difference with inequality tests, such as < > or BETWEEN.

Around version 9.1 and in the pre-ICU days, Robert Haas wrote
this post that explained it pretty well, I think:
http://rhaas.blogspot.com/2012/03/perils-of-collation-aware-comparisons.html

Quote:

If you happen to need the particular sorting behavior that
collation-aware sorting and comparisons provide, then you may find
this price worth paying, but I suspect there are a lot of people out
there who are paying it more or less accidentally and don't really
care very much about the underlying sorting behavior. If, for
example, all of your queries are based on equality, and you don't
use greater-than or less-than tests, then it doesn't matter what
collation is in use. You might as well use "C" instead of whatever
your local default may be, because it's faster.

For non-English text, I would recommend C.UTF-8 over "C" because of
its better handling of Unicode characters. For instance:

=# select upper('ï¿œtï¿œ' collate "C"), upper('ï¿œtï¿œ' collate "C.UTF-8");
upper | upper
-------+-------
ï¿œTï¿œ | ï¿œTï¿œ

The "price" of linguistic comparisons that Robert mentioned was about
performance, but the troubles we have with the lack of long-term
immutability of these locales are worth being added to that.

And can the existing en_US.UTF-8 fields' definition be altered in
place, without a dump+restore?

Changing the COLLATE clause of a column with ALTER TABLE does
implicitly rebuild an index on this column if there is one,
A dump+restore is not needed, nor an explicit REINDEX.

The dump+restore is needed in another scenario, where you would
decide to change the LC_COLLATE and LC_CTYPE of the database,
instead of doing it only for some columns.
This scenario makes perfect sense if the locale of the database
has been set implicitly and it uses linguistic sorts without
really needing them ("accidentally" as said in the post).

en_US.UTF-8 is the default encoding+locale+collation, it
isn't set explicitly for any of our string columns. I assume there's
some "catch-all" ordering taking place even for the C locale, so there
won't be any bizarre things like b coming before a, or generally for any
language, the second letter of its alphabet coming before the first?

'b' < 'a' is never true for any locale AFAIK, it is 'B' < 'a' that is
true for some locales such as C or C.UTF-8.

Best regards,

Thanks a lot for sharing your insights!

rihad

rihad@mail.ru

about 7 years ago

In reply to: Daniel Verite (#7)

Re: Upgrading locale issues

On 05/03/2019 05:35 PM, Daniel Verite wrote:

For non-English text, I would recommend C.UTF-8 over "C" because of

BTW, there's no C.UTF-8 inside pg_collation, and running select
pg_import_system_collations('pg_catalog') doesn't bring it in, at least
not on FreeBSD 11.2.

#10

Daniel Verite

daniel@manitou-mail.org

about 7 years ago

In reply to: rihad (#9)

Re: Upgrading locale issues

rihad wrote:

On 05/03/2019 05:35 PM, Daniel Verite wrote:

For non-English text, I would recommend C.UTF-8 over "C" because of

BTW, there's no C.UTF-8 inside pg_collation, and running select
pg_import_system_collations('pg_catalog') doesn't bring it in, at least
not on FreeBSD 11.2.

Yes, aside from "C", locales are quite system-dependent, unfortunately.

It looks like FreeBSD 13 does provide C.UTF-8:

https://unix.stackexchange.com/questions/485073/how-to-backport-freebsd-13-current-c-utf-8-locale-to-11-2-release/485077#485077

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

#11

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Thomas Munro (#3)

Re: Upgrading locale issues

On Wed, May 1, 2019 at 3:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:

As discussed over on -hackers[1], I think it's worth pursuing that
though. FWIW I've proposed locale versioning for FreeBSD's libc[2].
The reason I haven't gone further with that yet even though the code
change has been accepted in principle by FreeBSD reviewers is because
I got stuck on the question of how exactly to model the versions. If,
say, just Turkish changes, I don't want to be rebuilding my French
indexes, which means that I don't think you can use the CLDR version
string.

The ICU versions can handle that, though. Importantly, ICU decouples
implementation details from actual versioning.

I must say that I am not enthused about the idea of trying to get libc
people of any variety on board. I don't have an objection to it if it
can work for FreeBSD, but I don't think it can scale. ICU is built
around a culture that takes our concerns seriously already, which is
what it boils down to. Also, we can imagine a package manager taking
it upon themselves to vendor their own ICU, with platform-specific
guarantees around stability. That seems like a nice option to have, at
least.

There is also the question of how PostgreSQL should model versions,
and as I've argued in [1], I think we should track them at the level
of database object dependencies.

I think you're probably right about that.

I'm hoping to reopen this can of worms for PostgreSQL 13 (and the
corresponding support could in theory be in FreeBSD 13... coincidence,
or a sign!?)

Maybe we should do what Oracle did, and call it PostgreSQL 18c
instead. Actually, any number that isn't of interest to numerologists
will do.

Unfortunately you can't use ICU collations as a database default yet
(though there was some WIP code[3]), so ICU only saves you from
versioning problems if you explicitly set collations for columns or
expressions, and even then the version tracking is currently just a
warning that you clear manually with a command, not a mechanism that
really tracks which database objects were last rebuilt/validated with
a given version.

Yes, that does seem like a big remaining weakness.

--
Peter Geoghegan

#12

Peter Eisentraut

peter_e@gmx.net

about 7 years ago

In reply to: Daniel Verite (#7)

Re: Upgrading locale issues

On 2019-05-03 15:35, Daniel Verite wrote:

'b' < 'a' is never true for any locale AFAIK,

But there is 'bb' < 'aa' in Danish. :-)

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services