initdb initalization failure for collation "ja_JP"

Started by Marco Atzeriabout 9 years ago8 messageshackers

marco.atzeri@gmail.com

about 9 years ago

Building on Cygwin latest 10 beta1 or head sourece,
make check fails as:

---------------------initdb.log -----------------------------
The database cluster will be initialized with locales
COLLATE: en_US.UTF-8
CTYPE: en_US.UTF-8
MESSAGES: C
MONETARY: en_US.UTF-8
NUMERIC: en_US.UTF-8
TIME: en_US.UTF-8
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory
/cygdrive/e/cyg_pub/devel/postgresql/prova/postgresql-10.0-0.1.x86_64/build/src/test/regress/./tmp_check/data
... ok
creating subdirectories ... ok
selecting default max_connections ... 30
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... sysv
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... 2017-05-31 23:23:22.214
CEST [16860] FATAL: collation "ja_JP" for encoding "EUC_JP" already exists
-------------------------------------

This does not happen on 9.6.x.
Any idea what to look ?

Regards
Marco

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 9 years ago

In reply to: Marco Atzeri (#1)

Re: initdb initalization failure for collation "ja_JP"

Marco Atzeri <marco.atzeri@gmail.com> writes:

Building on Cygwin latest 10 beta1 or head sourece,
make check fails as:
...
performing post-bootstrap initialization ... 2017-05-31 23:23:22.214
CEST [16860] FATAL: collation "ja_JP" for encoding "EUC_JP" already exists

Hmph. Could we see the results of "locale -a | grep ja_JP" ?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 9 years ago

In reply to: Tom Lane (#2)

Re: initdb initalization failure for collation "ja_JP"

I wrote:

Marco Atzeri <marco.atzeri@gmail.com> writes:

Building on Cygwin latest 10 beta1 or head sourece,
make check fails as:
...
performing post-bootstrap initialization ... 2017-05-31 23:23:22.214
CEST [16860] FATAL: collation "ja_JP" for encoding "EUC_JP" already exists

Hmph. Could we see the results of "locale -a | grep ja_JP" ?

Despite the lack of followup from the OP, I'm pretty troubled by this
report. It shows that the reimplementation of OS collation data import
as pg_import_system_collations() is a whole lot more fragile than the
original coding. We have never before trusted "locale -a" to not produce
duplicate outputs, not since the very beginning in 414c5a2e. AFAICS,
the current coding has also lost the protections we added very shortly
after that in 853c1750f; and it has also lost the admittedly rather
arbitrary, but at least deterministic, preference order for conflicting
short aliases that was in the original initdb code.

I suppose the idea was to see whether we actually needed those defenses,
but since we have here a failure report after less than a month of beta,
it seems clear to me that we do. I think we need to upgrade
pg_import_system_collations to have all the same logic that was there
before.

Now the hard part of that is that because pg_import_system_collations
isn't using a temporary staging table, but is just inserting directly
into pg_collation, there isn't any way for it to eliminate duplicates
unless it uses if_not_exists behavior all the time. So there seem to
be two ways to proceed:

1. Drop pg_import_system_collations' if_not_exists argument and just
define it as adding any collations not already known in pg_collation.

2. Significantly rewrite it so that it de-dups the collation set by
hand before trying to insert into pg_collation.

#2 seems like a lot more work, but on the other hand, we might need
most of that logic anyway to get back deterministic alias handling.
However, since I cannot see any real-world use case at all for
if_not_exists = false, I figure we might as well do #1 and take
whatever simplification we can get that way.

I'm willing to do the legwork on this, but before I start, does
anyone have any ideas or objections?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Marco Atzeri

marco.atzeri@gmail.com

about 9 years ago

In reply to: Tom Lane (#2)

Re: initdb initalization failure for collation "ja_JP"

On 18/06/2017 16:48, Tom Lane wrote:

Marco Atzeri <marco.atzeri@gmail.com> writes:

Building on Cygwin latest 10 beta1 or head sourece,
make check fails as:
...
performing post-bootstrap initialization ... 2017-05-31 23:23:22.214
CEST [16860] FATAL: collation "ja_JP" for encoding "EUC_JP" already exists

Hmph. Could we see the results of "locale -a | grep ja_JP" ?

regards, tom lane

$ locale -a |grep -i ja
ja_JP
ja_JP
ja_JP.utf8
ja_JP.ujis
ja_JP@cjknarrow
ja_JP.utf8@cjknarrow
japanese
japanese.euc
japanese.sjis

$ locale -a |grep -i "euc"
japanese.euc
korean.euc

Regards
Marco

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Marco Atzeri

marco.atzeri@gmail.com

about 9 years ago

In reply to: Tom Lane (#3)

Re: initdb initalization failure for collation "ja_JP"

On 20/06/2017 17:37, Tom Lane wrote:

I wrote:

Marco Atzeri <marco.atzeri@gmail.com> writes:

Building on Cygwin latest 10 beta1 or head sourece,
make check fails as:
...
performing post-bootstrap initialization ... 2017-05-31 23:23:22.214
CEST [16860] FATAL: collation "ja_JP" for encoding "EUC_JP" already exists

Hmph. Could we see the results of "locale -a | grep ja_JP" ?

Despite the lack of followup from the OP, I'm pretty troubled by this
report. It shows that the reimplementation of OS collation data import
as pg_import_system_collations() is a whole lot more fragile than the
original coding. We have never before trusted "locale -a" to not produce
duplicate outputs, not since the very beginning in 414c5a2e. AFAICS,
the current coding has also lost the protections we added very shortly
after that in 853c1750f; and it has also lost the admittedly rather
arbitrary, but at least deterministic, preference order for conflicting
short aliases that was in the original initdb code.

Hi Tom,
I raised the duplication issue on the cygwin mailing list,
and one of the core developer reports that
they saw the same issues on Linux in the past.

https://cygwin.com/ml/cygwin/2017-06/msg00253.html

regards, tom lane

Regards
Marco

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 9 years ago

In reply to: Tom Lane (#3)

Re: initdb initalization failure for collation "ja_JP"

I wrote:

Now the hard part of that is that because pg_import_system_collations
isn't using a temporary staging table, but is just inserting directly
into pg_collation, there isn't any way for it to eliminate duplicates
unless it uses if_not_exists behavior all the time. So there seem to
be two ways to proceed:
1. Drop pg_import_system_collations' if_not_exists argument and just
define it as adding any collations not already known in pg_collation.
2. Significantly rewrite it so that it de-dups the collation set by
hand before trying to insert into pg_collation.

After further thought, I realized that there's another argument for
making pg_import_system_collations() always do if-not-exists, which
is that as it stands it's inconsistent anyway because it silently
uses if-not-exists behavior for aliases.

So attached is a draft patch that drops if_not_exists and tweaks the
alias insertion code to guarantee deterministic behavior. I also
corrected failure to use CommandCounterIncrement() in the ICU part
of the code, which would cause problems if ICU can ever report
duplicate names; something I don't especially care to assume is
impossible. Also, I fixed things so that pg_import_system_collations()
doesn't emit any chatter about duplicate existing names, because it
didn't take long to realize that that behavior was useless and
irritating. (Perhaps we should change the function to return the
number of entries successfully added? But I didn't do that here.)

I've tested this with a faked version of "locale -a" that generates
duplicated entries, and it does what I think it should.

One question that I've got is why the ICU portion refuses to load
any entries unless is_encoding_supported_by_icu(GetDatabaseEncoding()).
Surely this is completely wrong? I should think that what we load into
pg_collation ought to be independent of template1's encoding, the same
as it is for libc collations, and the right place to be making a test
like that is where somebody attempts to use an ICU collation. But
I've not tried to test it.

Also, I'm a bit tempted to remove setup_collation()'s manual insertion of
'ucs_basic' in favor of adding a DATA() line for that to pg_collation.h.
As things stand, if for some reason "locale -a" were to report a locale
named that, initdb would fail.  In the old code, it would not have failed
--- it's uncertain whether you would have gotten the forced UTF8/C
definition or libc's definition, but you would have gotten only one.
However, that approach would result in 'ucs_basic' being treated as
pinned, which perhaps we don't want.  If we don't want it to be pinned,
another idea is just to make setup_collation() insert it first not last,
thereby ensuring that the SQL definition wins over any libc entries.

Comments?

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

about 9 years ago

In reply to: Tom Lane (#6)

Re: initdb initalization failure for collation "ja_JP"

I wrote:

One question that I've got is why the ICU portion refuses to load
any entries unless is_encoding_supported_by_icu(GetDatabaseEncoding()).
Surely this is completely wrong? I should think that what we load into
pg_collation ought to be independent of template1's encoding, the same
as it is for libc collations, and the right place to be making a test
like that is where somebody attempts to use an ICU collation. But
I've not tried to test it.

So I did test that, and found out the presumable reason why that's there:
icu_from_uchar() falls over if the database encoding is unsupported, and
we use that to convert ICU "display names" for use as comments for the
ICU collations. But that's not very much less wrongheaded, because it will
allow non-ASCII characters into the initial database contents, which is
absolutely not acceptable. We assume we can bit-copy the contents of
template0 and it will be valid in any encoding.

Therefore, I think the right thing to do is remove that test and change
get_icu_locale_comment() so that it rejects non-ASCII text, making the
encoding conversion trivial, as in the attached patch.

On my Fedora 25 laptop, the only collations that go without a comment
in this approach are the "nb" ones (Norwegian Bokmål). As I recall,
that locale is a second-class citizen for other reasons already,
precisely because of its loony insistence on a non-ASCII name even
when we're asking for an Anglicized version.

I'm inclined to add a test to reject non-ASCII in the ICU locale names as
well as the comments. We've had to do that for libc locale names, and
this experience shows that the ICU locale maintainers don't have their
heads screwed on any straighter. But this patch doesn't do that.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

about 9 years ago

In reply to: Tom Lane (#7)

Re: initdb initalization failure for collation "ja_JP"

I wrote:

One question that I've got is why the ICU portion refuses to load
any entries unless is_encoding_supported_by_icu(GetDatabaseEncoding()).
Surely this is completely wrong? I should think that what we load into
pg_collation ought to be independent of template1's encoding, the same
as it is for libc collations, and the right place to be making a test
like that is where somebody attempts to use an ICU collation.

Pursuant to the second part of that: I checked on what happens if you
try to use an ICU collation in a database with a not-supported-by-ICU
encoding. We have to cope with that scenario even with the current
(broken IMO) initdb behavior, because even if template1 has a supported
encoding, it's possible to create another database that doesn't.
It does fail more or less cleanly; you get an "encoding "foo" not
supported by ICU" message at runtime (out of get_encoding_name_for_icu).
But that's quite a bit unlike the behavior for libc collations: with
those, you get an error in collation name lookup, along the lines of

collation "en_DK.utf8" for encoding "SQL_ASCII" does not exist

The attached proposed patch makes the behavior for ICU collations the
same, by dint of injecting the is_encoding_supported_by_icu() check
into collation name lookup.

regards, tom lane

initdb initalization failure for collation "ja_JP"

Attachments:

Attachments:

Attachments: