Pre-proposal: unicode normalized text

Started by Jeff Davisover 2 years ago77 messageshackers
Jump to latest
#1Jeff Davis
pgsql@j-davis.com

One of the frustrations with using the "C" locale (or any deterministic
locale) is that the following returns false:

SELECT 'á' = 'á'; -- false

because those are the unicode sequences U&'\0061\0301' and U&'\00E1',
respectively, so memcmp() returns non-zero. But it's really the same
character with just a different representation, and if you normalize
them they are equal:

SELECT normalize('á') = normalize('á'); -- true

The idea is to have a new data type, say "UTEXT", that normalizes the
input so that it can have an improved notion of equality while still
using memcmp().

Unicode guarantees that "the results of normalizing a string on one
version will always be the same as normalizing it on any other version,
as long as the string contains only assigned characters according to
both versions"[1]https://unicode.org/reports/tr15/. It also guarantees that it "will not reallocate,
remove, or reassign" characters[2]https://www.unicode.org/policies/stability_policy.html. That means that we can normalize in
a forward-compatible way as long as we don't allow the use of
unassigned code points.

I looked at the standard to see what it had to say, and is discusses
normalization, but a standard UCS string with an unassigned code point
is not an error. Without a data type to enforce the constraint that
there are no unassigned code points, we can't guarantee forward
compatibility. Some other systems support NVARCHAR, but I didn't see
any guarantee of normalization or blocking unassigned code points
there, either.

UTEXT benefits:
* slightly better natural language semantics than TEXT with
deterministic collation
* still deterministic=true
* fast memcmp()-based comparisons
* no breaking semantic changes as unicode evolves

TEXT allows unassigned code points, and generally returns the same byte
sequences that were orgiinally entered; therefore UTEXT is not a
replacement for TEXT.

UTEXT could be built-in or it could be an extension or in contrib. If
an extension, we'd probably want to at least expose a function that can
detect unassigned code points, so that it's easy to be consistent with
the auto-generated unicode tables. I also notice that there already is
an unassigned code points table in saslprep.c, but it seems to be
frozen as of Unicode 3.2, and I'm not sure why.

Questions:

* Would this be useful enough to justify a new data type? Would it be
confusing about when to choose one versus the other?
* Would cross-type comparisons between TEXT and UTEXT become a major
problem that would reduce the utility?
* Should "some_utext_value = some_text_value" coerce the LHS to TEXT
or the RHS to UTEXT?
* Other comments or am I missing something?

Regards,
Jeff Davis

[1]: https://unicode.org/reports/tr15/
[2]: https://www.unicode.org/policies/stability_policy.html

#2Peter Eisentraut
peter_e@gmx.net
In reply to: Jeff Davis (#1)
Re: Pre-proposal: unicode normalized text

On 13.09.23 00:47, Jeff Davis wrote:

The idea is to have a new data type, say "UTEXT", that normalizes the
input so that it can have an improved notion of equality while still
using memcmp().

I think a new type like this would obviously be suboptimal because it's
nonstandard and most people wouldn't use it.

I think a better direction here would be to work toward making
nondeterministic collations usable on the global/database level and then
encouraging users to use those.

It's also not clear which way the performance tradeoffs would fall.

Nondeterministic collations are obviously going to be slower, but by how
much? People have accepted moving from C locale to "real" locales
because they needed those semantics. Would it be any worse moving from
real locales to "even realer" locales?

On the other hand, a utext type would either require a large set of its
own functions and operators, or you would have to inject text-to-utext
casts in places, which would also introduce overhead.

#3Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#2)
Re: Pre-proposal: unicode normalized text

On Mon, Oct 2, 2023 at 3:42 PM Peter Eisentraut <peter@eisentraut.org> wrote:

I think a better direction here would be to work toward making
nondeterministic collations usable on the global/database level and then
encouraging users to use those.

It seems to me that this overlooks one of the major points of Jeff's
proposal, which is that we don't reject text input that contains
unassigned code points. That decision turns out to be really painful.
Here, Jeff mentions normalization, but I think it's a major issue with
collation support. If new code points are added, users can put them
into the database before they are known to the collation library, and
then when they become known to the collation library the sort order
changes and indexes break. Would we endorse a proposal to make
pg_catalog.text with encoding UTF-8 reject code points that aren't yet
known to the collation library? To do so would be tighten things up
considerably from where they stand today, and the way things stand
today is already rigid enough to cause problems for some users. But if
we're not willing to do that then I find it easy to understand why
Jeff wants an alternative type that does.

Now, there is still the question of whether such a data type would
properly belong in core or even contrib rather than being an
out-of-core project. It's not obvious to me that such a data type
would get enough traction that we'd want it to be part of PostgreSQL
itself. But at the same time I can certainly understand why Jeff finds
the status quo problematic.

--
Robert Haas
EDB: http://www.enterprisedb.com

#4Nico Williams
nico@cryptonector.com
In reply to: Jeff Davis (#1)
Re: Pre-proposal: unicode normalized text

On Tue, Sep 12, 2023 at 03:47:10PM -0700, Jeff Davis wrote:

One of the frustrations with using the "C" locale (or any deterministic
locale) is that the following returns false:

SELECT 'á' = 'á'; -- false

because those are the unicode sequences U&'\0061\0301' and U&'\00E1',
respectively, so memcmp() returns non-zero. But it's really the same
character with just a different representation, and if you normalize
them they are equal:

SELECT normalize('á') = normalize('á'); -- true

I think you misunderstand Unicode normalization and equivalence. There
is no standard Unicode `normalize()` that would cause the above equality
predicate to be true. If you normalize to NFD (normal form decomposed)
then a _prefix_ of those two strings will be equal, but that's clearly
not what you're looking for.

PostgreSQL already has Unicode normalization support, though it would be
nice to also have form-insensitive indexing and equality predicates.

There are two ways to write 'á' in Unicode: one is pre-composed (one
codepoint) and the other is decomposed (two codepoints in this specific
case), and it would be nice to be able to preserve input form when
storing strings but then still be able to index and match them
form-insensitively (in the case of 'á' both equivalent representations
should be considered equal, and for UNIQUE indexes they should be
considered the same).

You could also have functions that perform lossy normalization in the
sort of way that soundex does, such as first normalizing to NFD then
dropping all combining codepoints which then could allow 'á' to be eq to
'a'. But this would not be a Unicode normalization function.

Nico
--

#5Jeff Davis
pgsql@j-davis.com
In reply to: Nico Williams (#4)
Re: Pre-proposal: unicode normalized text

On Mon, 2023-10-02 at 15:27 -0500, Nico Williams wrote:

I think you misunderstand Unicode normalization and equivalence. 
There
is no standard Unicode `normalize()` that would cause the above
equality
predicate to be true.  If you normalize to NFD (normal form
decomposed)
then a _prefix_ of those two strings will be equal, but that's
clearly
not what you're looking for.

From [1]https://unicode.org/reports/tr15/:

"Unicode Normalization Forms are formally defined normalizations of
Unicode strings which make it possible to determine whether any two
Unicode strings are equivalent to each other. Depending on the
particular Unicode Normalization Form, that equivalence can either be a
canonical equivalence or a compatibility equivalence... A binary
comparison of the transformed strings will then determine equivalence."

NFC and NFD are based on Canonical Equivalence.

"Canonical equivalence is a fundamental equivalency between characters
or sequences of characters which represent the same abstract character,
and which when correctly displayed should always have the same visual
appearance and behavior."

Can you explain why NFC (the default form of normalization used by the
postgres normalize() function), followed by memcmp(), is not the right
thing to use to determine Canonical Equivalence?

Or are you saying that Canonical Equivalence is not a useful thing to
test?

What do you mean about the "prefix"?

In Postgres today:

SELECT normalize(U&'\0061\0301', nfc)::bytea; -- \xc3a1
SELECT normalize(U&'\00E1', nfc)::bytea; -- \xc3a1

SELECT normalize(U&'\0061\0301', nfd)::bytea; -- \x61cc81
SELECT normalize(U&'\00E1', nfd)::bytea; -- \x61cc81

which looks useful to me, but I assume you are saying that it doesn't
generalize well to other cases?

[1]: https://unicode.org/reports/tr15/

There are two ways to write 'á' in Unicode: one is pre-composed (one
codepoint) and the other is decomposed (two codepoints in this
specific
case), and it would be nice to be able to preserve input form when
storing strings but then still be able to index and match them
form-insensitively (in the case of 'á' both equivalent
representations
should be considered equal, and for UNIQUE indexes they should be
considered the same).

Sometimes preserving input differences is a good thing, other times
it's not, depending on the context. Almost any data type has some
aspects of the input that might not be preserved -- leading zeros in a
number, or whitespace in jsonb, etc.

If text is stored as normalized with NFC, it could be frustrating if
the retrieved string has a different binary representation than the
source data. But it could also be frustrating to look at two strings
made up of ordinary characters that look identical and for the database
to consider them unequal.

Regards,
Jeff Davis

#6Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#3)
Re: Pre-proposal: unicode normalized text

On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote:

It seems to me that this overlooks one of the major points of Jeff's
proposal, which is that we don't reject text input that contains
unassigned code points. That decision turns out to be really painful.

Yeah, because we lose forward-compatibility of some useful operations.

Here, Jeff mentions normalization, but I think it's a major issue
with
collation support. If new code points are added, users can put them
into the database before they are known to the collation library, and
then when they become known to the collation library the sort order
changes and indexes break.

The collation version number may reflect the change in understanding
about assigned code points that may affect collation -- though I'd like
to understand whether this is guaranteed or not.

Regardless, given that (a) we don't have a good story for migrating to
new collation versions; and (b) it would be painful to rebuild indexes
even if we did; then you are right that it's a problem.

Would we endorse a proposal to make
pg_catalog.text with encoding UTF-8 reject code points that aren't
yet
known to the collation library? To do so would be tighten things up
considerably from where they stand today, and the way things stand
today is already rigid enough to cause problems for some users.

What problems exist today due to the rigidity of text?

I assume you mean because we reject invalid byte sequences? Yeah, I'm
sure that causes a problem for some (especially migrations), but it's
difficult for me to imagine a database working well with no rules at
all for the the basic data types.

Now, there is still the question of whether such a data type would
properly belong in core or even contrib rather than being an
out-of-core project. It's not obvious to me that such a data type
would get enough traction that we'd want it to be part of PostgreSQL
itself.

At minimum I think we need to have some internal functions to check for
unassigned code points. That belongs in core, because we generate the
unicode tables from a specific version.

I also think we should expose some SQL functions to check for
unassigned code points. That sounds useful, especially since we already
expose normalization functions.

One could easily imagine a domain with CHECK(NOT
contains_unassigned(a)). Or an extension with a data type that uses the
internal functions.

Whether we ever get to a core data type -- and more importantly,
whether anyone uses it -- I'm not sure.

But at the same time I can certainly understand why Jeff finds
the status quo problematic.

Yeah, I am looking for a better compromise between:

* everything is memcmp() and 'á' sometimes doesn't equal 'á'
(depending on code point sequence)
* everything is constantly changing, indexes break, and text
comparisons are slow

A stable idea of unicode normalization based on using only assigned
code points is very tempting.

Regards,
Jeff Davis

#7Nico Williams
nico@cryptonector.com
In reply to: Jeff Davis (#5)
Re: Pre-proposal: unicode normalized text

On Tue, Oct 03, 2023 at 12:15:10PM -0700, Jeff Davis wrote:

On Mon, 2023-10-02 at 15:27 -0500, Nico Williams wrote:

I think you misunderstand Unicode normalization and equivalence. 
There is no standard Unicode `normalize()` that would cause the
above equality predicate to be true.  If you normalize to NFD
(normal form decomposed) then a _prefix_ of those two strings will
be equal, but that's clearly not what you're looking for.

Ugh, My client is not displying 'a' correctly, thus I misunderstood your
post.

From [1]:

Here's what you wrote in your post:

| [...] But it's really the same
| character with just a different representation, and if you normalize
| them they are equal:
|
| SELECT normalize('á') = normalize('á'); -- true

but my client is not displying 'a' correctly! (It displays like 'a' but
it should display like 'á'.)

Bah. So I'd (mis)interpreted you as saying that normalize('a') should
equal normalize('á'). Please disregard that part of my reply.

There are two ways to write 'á' in Unicode: one is pre-composed (one
codepoint) and the other is decomposed (two codepoints in this
specific case), and it would be nice to be able to preserve input
form when storing strings but then still be able to index and match
them form-insensitively (in the case of 'á' both equivalent
representations should be considered equal, and for UNIQUE indexes
they should be considered the same).

Sometimes preserving input differences is a good thing, other times
it's not, depending on the context. Almost any data type has some
aspects of the input that might not be preserved -- leading zeros in a
number, or whitespace in jsonb, etc.

Almost every Latin input mode out there produces precomposed characters
and so they effectively produce NFC. I'm not sure if the same is true
for, e.g., Hangul (Korean) and various other scripts.

But there are things out there that produce NFD. Famously Apple's HFS+
uses NFD (or something very close to NFD). So if you cut-n-paste things
that got normalized to NFD and paste them into contexts where
normalization isn't done, then you might start wanting to alter those
contexts to either normalize or be form-preserving/form-insensitive.
Sometimes you don't get to normalize, so you have to pick form-
preserving/form-insensitive behavior.

If text is stored as normalized with NFC, it could be frustrating if
the retrieved string has a different binary representation than the
source data. But it could also be frustrating to look at two strings
made up of ordinary characters that look identical and for the database
to consider them unequal.

Exactly. If you have such a case you might like the option to make your
database form-preserving and form-insensitive. That means that indices
need to normalize strings, but tables need to store unnormalized
strings.

ZFS (filesystems are a bit like databases) does just that!

Nico
--

#8Jeff Davis
pgsql@j-davis.com
In reply to: Nico Williams (#7)
Re: Pre-proposal: unicode normalized text

On Tue, 2023-10-03 at 15:15 -0500, Nico Williams wrote:

Ugh, My client is not displying 'a' correctly

Ugh. Is that an argument in favor of normalization or against?

I've also noticed that some fonts render the same character a bit
differently depending on the constituent code points. For instance, if
the accent is its own code point, it seems to be more prominent than if
a single code point represents both the base character and the accent.
That seems to be a violation, but I can understand why that might be
useful.

Almost every Latin input mode out there produces precomposed
characters
and so they effectively produce NFC.

The problem is not the normal case, the problem will be things like
obscure input methods, some kind of software that's being too clever,
or some kind of malicious user trying to confuse the database.

That means that indices
need to normalize strings, but tables need to store unnormalized
strings.

That's an interesting idea. Would the equality operator normalize
first, or are you saying that the index would need to recheck the
results?

Regards,
Jeff Davis

#9Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#2)
Re: Pre-proposal: unicode normalized text

On Mon, 2023-10-02 at 10:47 +0200, Peter Eisentraut wrote:

I think a better direction here would be to work toward making
nondeterministic collations usable on the global/database level and
then
encouraging users to use those.

It's also not clear which way the performance tradeoffs would fall.

Nondeterministic collations are obviously going to be slower, but by
how
much?  People have accepted moving from C locale to "real" locales
because they needed those semantics.  Would it be any worse moving
from
real locales to "even realer" locales?

If you normalize first, then you can get some semantic improvements
without giving up on the stability and performance of memcmp(). That
seems like a win with zero costs in terms of stability or performance
(except perhaps some extra text->utext casts).

Going to a "real" locale gives more semantic benefits but at a very
high cost: depending on a collation provider library, dealing with
collation changes, and performance costs. While supporting the use of
nondeterministic collations at the database level may be a good idea,
it's not helping to reach the compromise that I'm trying to reach in
this thread.

Regards,
Jeff Davis

#10Nico Williams
nico@cryptonector.com
In reply to: Jeff Davis (#8)
Re: Pre-proposal: unicode normalized text

On Tue, Oct 03, 2023 at 03:34:44PM -0700, Jeff Davis wrote:

On Tue, 2023-10-03 at 15:15 -0500, Nico Williams wrote:

Ugh, My client is not displying 'a' correctly

Ugh. Is that an argument in favor of normalization or against?

Heheh, well, it's an argument in favor of more software getting this
right (darn it).

It's also an argument for building a time machine so HFS+ can just
always have used NFC. But the existence of UTF-16 is proof that time
machines don't exist (or that only bad actors have them).

I've also noticed that some fonts render the same character a bit
differently depending on the constituent code points. For instance, if
the accent is its own code point, it seems to be more prominent than if
a single code point represents both the base character and the accent.
That seems to be a violation, but I can understand why that might be
useful.

Yes, that happens. Did you know that the ASCII character set was
designed with overstrike in mind for typing of accented Latin
characters? Unicode combining sequences are kinda like that, but more
complex.

Yes, the idea really was that you could write a<BS>' (or '<BS>a) to get �.
That's how people did it with typewriters anyways.

Almost every Latin input mode out there produces precomposed
characters and so they effectively produce NFC.

The problem is not the normal case, the problem will be things like
obscure input methods, some kind of software that's being too clever,
or some kind of malicious user trying to confuse the database.

_HFS+ enters the chat_

That means that indices
need to normalize strings, but tables need to store unnormalized
strings.

That's an interesting idea. Would the equality operator normalize
first, or are you saying that the index would need to recheck the
results?

You can optimize this to avoid having to normalize first. Most strings
are not equal, and they tend to differ early. And most strings will
likely be ASCII-mostly or in the same form anyways. So you can just
walk a cursor down each string looking at two bytes, and if they are
both ASCII then you move each cursor forward by one byte, and if then
are not both ASCII then you take a slow path where you normalize one
grapheme cluster at each cursor (if necessary) and compare that. (ZFS
does this.)

You can also assume ASCII-mostly, load as many bits of each string
(padding as needed) as will fit in SIMD registers, compare and check
that they're all ASCII, and if not then jump to the slow path.

You can also normalize one grapheme cluster at a time when hashing
(e.g., for hash indices), thus avoiding a large allocation if the string
is large.

Nico
--

#11Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#6)
Re: Pre-proposal: unicode normalized text

On Tue, Oct 3, 2023 at 3:54 PM Jeff Davis <pgsql@j-davis.com> wrote:

I assume you mean because we reject invalid byte sequences? Yeah, I'm
sure that causes a problem for some (especially migrations), but it's
difficult for me to imagine a database working well with no rules at
all for the the basic data types.

There's a very popular commercial database where, or so I have been
led to believe, any byte sequence at all is accepted when you try to
put values into the database. The rumors I've heard -- I have not
played with it myself -- are that when you try to do anything, byte
sequences that are not valid in the configured encoding are treated as
single-byte characters or something of that sort. So like if you had
UTF-8 as the encoding and the first byte of the string is something
that can only appear as a continuation byte in UTF-8, I think that
byte is just treated as a separate character. I don't quite know how
you make all of the operations work that way, but it seems like
they've come up with a somewhat-consistent set of principles that are
applied across the board. Very different from the PG philosophy, of
course. And I'm not saying it's better. But it does eliminate the
problem of being unable to load data into the database, because in
such a model there's no such thing as invalidly-encoded data. Instead,
an encoding like UTF-8 is effectively extended so that every byte
sequence represents *something*. Whether that something is what you
wanted is another story.

At any rate, if we were to go in the direction of rejecting code
points that aren't yet assigned, or aren't yet known to the collation
library, that's another way for data loading to fail. Which feels like
very defensible behavior, but not what everyone wants, or is used to.

At minimum I think we need to have some internal functions to check for
unassigned code points. That belongs in core, because we generate the
unicode tables from a specific version.

That's a good idea.

I also think we should expose some SQL functions to check for
unassigned code points. That sounds useful, especially since we already
expose normalization functions.

That's a good idea, too.

One could easily imagine a domain with CHECK(NOT
contains_unassigned(a)). Or an extension with a data type that uses the
internal functions.

Yeah.

Whether we ever get to a core data type -- and more importantly,
whether anyone uses it -- I'm not sure.

Same here.

Yeah, I am looking for a better compromise between:

* everything is memcmp() and 'á' sometimes doesn't equal 'á'
(depending on code point sequence)
* everything is constantly changing, indexes break, and text
comparisons are slow

A stable idea of unicode normalization based on using only assigned
code points is very tempting.

The fact that there are multiple types of normalization and multiple
notions of equality doesn't make this easier.

--
Robert Haas
EDB: http://www.enterprisedb.com

#12Nico Williams
nico@cryptonector.com
In reply to: Jeff Davis (#1)
Re: Pre-proposal: unicode normalized text

On Tue, Sep 12, 2023 at 03:47:10PM -0700, Jeff Davis wrote:

The idea is to have a new data type, say "UTEXT", that normalizes the
input so that it can have an improved notion of equality while still
using memcmp().

A UTEXT type would be helpful for specifying that the text must be
Unicode (in which transform?) even if the character data encoding for
the database is not UTF-8.

Maybe UTF8 might be a better name for the new type, since it would
denote the transform (and would allow for UTF16 and UTF32 some day,
though it's doubtful those would ever happen).

But it's one thing to specify Unicode (and transform) in the type and
another to specify an NF to normalize to on insert or on lookup.

How about new column constraint keywords, such as NORMALIZE (meaning
normalize on insert) and NORMALIZED (meaning reject non-canonical form
text), with an optional parenthetical by which to specify a non-default
form? (These would apply to TEXT as well when the default encoding for
the DB is UTF-8.)

One could then ALTER TABLE to add this to existing tables.

This would also make it easier to add a form-preserving/form-insensitive
mode later if it turns out to be useful or necessary, maybe making it
the default for Unicode text in new tables.

Questions:

* Would this be useful enough to justify a new data type? Would it be
confusing about when to choose one versus the other?

Yes. See above. I think I'd rather have it be called UTF8, and the
normalization properties of it to be specified as column constraints.

* Would cross-type comparisons between TEXT and UTEXT become a major
problem that would reduce the utility?

Maybe when the database's encoding is UTF_8 then UTEXT (or UTF8) can be an alias
of TEXT.

* Should "some_utext_value = some_text_value" coerce the LHS to TEXT
or the RHS to UTEXT?

Ooh, this is nice! If the TEXT is _not_ UTF-8 then it could be
converted to UTF-8. So I think which is RHS and which is LHS doesn't
matter -- it's which is UTF-8, and if both are then the only thing left
to do is normalize, and for that I'd take the LHS' form if the LHS is
UTF-8, else the RHS'.

Nico
--

#13Robert Haas
robertmhaas@gmail.com
In reply to: Nico Williams (#12)
Re: Pre-proposal: unicode normalized text

On Wed, Oct 4, 2023 at 1:27 PM Nico Williams <nico@cryptonector.com> wrote:

A UTEXT type would be helpful for specifying that the text must be
Unicode (in which transform?) even if the character data encoding for
the database is not UTF-8.

That's actually pretty thorny ... because right now client_encoding
specifies the encoding to be used for all data sent to the client. So
would we convert the data from UTF8 to the selected client encoding?
Or what?

--
Robert Haas
EDB: http://www.enterprisedb.com

#14Chapman Flack
chap@anastigmatix.net
In reply to: Robert Haas (#13)
Re: Pre-proposal: unicode normalized text

On 2023-10-04 13:47, Robert Haas wrote:

On Wed, Oct 4, 2023 at 1:27 PM Nico Williams <nico@cryptonector.com>
wrote:

A UTEXT type would be helpful for specifying that the text must be
Unicode (in which transform?) even if the character data encoding for
the database is not UTF-8.

That's actually pretty thorny ... because right now client_encoding
specifies the encoding to be used for all data sent to the client. So
would we convert the data from UTF8 to the selected client encoding?

The SQL standard would have me able to:

CREATE TABLE foo (
a CHARACTER VARYING CHARACTER SET UTF8,
b CHARACTER VARYING CHARACTER SET LATIN1
)

and so on, and write character literals like

_UTF8'Hello, world!' and _LATIN1'Hello, world!'

and have those columns and data types independently contain what
they can contain, without constraints imposed by one overall
database encoding.

Obviously, we're far from being able to do that. But should it
become desirable to get closer, would it be worthwhile to also
try to follow how the standard would have it look?

Clearly, part of the job would involve making the wire protocol
able to transmit binary values and identify their encodings.

Regards,
-Chap

#15Robert Haas
robertmhaas@gmail.com
In reply to: Chapman Flack (#14)
Re: Pre-proposal: unicode normalized text

On Wed, Oct 4, 2023 at 2:02 PM Chapman Flack <chap@anastigmatix.net> wrote:

Clearly, part of the job would involve making the wire protocol
able to transmit binary values and identify their encodings.

Right. Which unfortunately is moving the goal posts into the
stratosphere compared to any other work mentioned so far. I agree it
would be great. But not if you want concrete progress any time soon.

--
Robert Haas
EDB: http://www.enterprisedb.com

#16Isaac Morland
isaac.morland@gmail.com
In reply to: Chapman Flack (#14)
Re: Pre-proposal: unicode normalized text

On Wed, 4 Oct 2023 at 14:05, Chapman Flack <chap@anastigmatix.net> wrote:

On 2023-10-04 13:47, Robert Haas wrote:

The SQL standard would have me able to:

CREATE TABLE foo (
a CHARACTER VARYING CHARACTER SET UTF8,
b CHARACTER VARYING CHARACTER SET LATIN1
)

and so on, and write character literals like

_UTF8'Hello, world!' and _LATIN1'Hello, world!'

and have those columns and data types independently contain what
they can contain, without constraints imposed by one overall
database encoding.

Obviously, we're far from being able to do that. But should it
become desirable to get closer, would it be worthwhile to also
try to follow how the standard would have it look?

Clearly, part of the job would involve making the wire protocol
able to transmit binary values and identify their encodings.

I would go in the other direction (note: I’m ignoring all backward
compatibility considerations related to the current design of Postgres).

Always store only UTF-8 in the database, and send only UTF-8 on the wire
protocol. If we still want to have a concept of "client encoding", have the
client libpq take care of translating the bytes between the bytes used by
the caller and the bytes sent on the wire.

Note that you could still define columns as you say, but the character set
specification would effectively act simply as a CHECK constraint on the
characters allowed, essentially CHECK (column_name ~ '^[...all characters
in encoding...]$*'). We don't allow different on-disk representations of
dates or other data types; except when we really need to, and then we have
multiple data types (e.g. int vs. float) rather than different ways of
storing the same datatype.

What about characters not in UTF-8? If a character is important enough for
us to worry about in Postgres, it’s important enough to get a U+ number
from the Unicode Consortium, which automatically puts it in UTF-8. In the
modern context, "plain text" mean "UTF-8 encoded text", as far as I'm
concerned.

#17Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#11)
Re: Pre-proposal: unicode normalized text

On Wed, 2023-10-04 at 13:16 -0400, Robert Haas wrote:

any byte sequence at all is accepted when you try to
put values into the database.

We support SQL_ASCII, which allows something similar.

At any rate, if we were to go in the direction of rejecting code
points that aren't yet assigned, or aren't yet known to the collation
library, that's another way for data loading to fail.

A failure during data loading is either a feature or a bug, depending
on whether you are the one loading the data or the one trying to make
sense of it later ;-)

Which feels like
very defensible behavior, but not what everyone wants, or is used to.

Yeah, there are many reasons someone might want to accept unassigned
code points. An obvious one is if their application is on a newer
version of unicode where the codepoint *is* assigned.

The fact that there are multiple types of normalization and multiple
notions of equality doesn't make this easier.

NFC is really the only one that makes sense.

NFD is semantically the same as NFC, but expanded into a larger
representation. NFKC/NFKD are based on a more relaxed notion of
equality -- kind of like non-deterministic collations. These other
forms might make sense in certain cases, but not general use.

I believe that having a kind of text data type where it's stored in NFC
and compared with memcmp() would be a good place for many users to be -
- probably most users. It's got all the performance and stability
benefits of memcmp(), with slightly richer semantics. It's less likely
that someone malicious can confuse the database by using different
representations of the same character.

The problem is that it's not universally better for everyone: there are
certainly users who would prefer that the codepoints they send to the
database are preserved exactly, and also users who would like to be
able to use unassigned code points.

Regards,
Jeff Davis

#18Jeff Davis
pgsql@j-davis.com
In reply to: Chapman Flack (#14)
Re: Pre-proposal: unicode normalized text

On Wed, 2023-10-04 at 14:02 -0400, Chapman Flack wrote:

The SQL standard would have me able to:

CREATE TABLE foo (
   a CHARACTER VARYING CHARACTER SET UTF8,
   b CHARACTER VARYING CHARACTER SET LATIN1
)

and so on, and write character literals like

_UTF8'Hello, world!' and _LATIN1'Hello, world!'

Is there a use case for that? UTF-8 is able to encode any unicode code
point, it's relatively compact, and it's backwards-compatible with 7-
bit ASCII. If you have a variety of text data in your system (and in
many cases even if not), then UTF-8 seems like the right solution.

Text data encoded 17 different ways requires a lot of bookkeeping in
the type system, and it also requires injecting a bunch of fallible
transcoding operators around just to compare strings.

Regards,
Jeff Davis

#19Nico Williams
nico@cryptonector.com
In reply to: Jeff Davis (#18)
Re: Pre-proposal: unicode normalized text

On Wed, Oct 04, 2023 at 01:38:15PM -0700, Jeff Davis wrote:

On Wed, 2023-10-04 at 14:02 -0400, Chapman Flack wrote:

The SQL standard would have me able to:

[...]
_UTF8'Hello, world!' and _LATIN1'Hello, world!'

Is there a use case for that? UTF-8 is able to encode any unicode code
point, it's relatively compact, and it's backwards-compatible with 7-
bit ASCII. If you have a variety of text data in your system (and in
many cases even if not), then UTF-8 seems like the right solution.

Text data encoded 17 different ways requires a lot of bookkeeping in
the type system, and it also requires injecting a bunch of fallible
transcoding operators around just to compare strings.

Better that than TEXT blobs w/ the encoding given by the `CREATE
DATABASE` or `initdb` default!

It'd be a lot _less_ fragile to have all text tagged with an encoding
(indirectly, via its type which then denotes the encoding).

That would be a lot of work, but starting with just a UTF-8 text type
would be an improvement.

Nico
--

#20Chapman Flack
chap@anastigmatix.net
In reply to: Jeff Davis (#18)
Re: Pre-proposal: unicode normalized text

On 2023-10-04 16:38, Jeff Davis wrote:

On Wed, 2023-10-04 at 14:02 -0400, Chapman Flack wrote:

The SQL standard would have me able to:

CREATE TABLE foo (
   a CHARACTER VARYING CHARACTER SET UTF8,
   b CHARACTER VARYING CHARACTER SET LATIN1
)

and so on

Is there a use case for that? UTF-8 is able to encode any unicode code
point, it's relatively compact, and it's backwards-compatible with 7-
bit ASCII. If you have a variety of text data in your system (and in
many cases even if not), then UTF-8 seems like the right solution.

Well, for what reason does anybody run PG now with the encoding set
to anything besides UTF-8? I don't really have my finger on that pulse.
Could it be that it bloats common strings in their local script, and
with enough of those to store, it could matter to use the local
encoding that stores them more economically?

Also, while any Unicode transfer format can encode any Unicode code
point, I'm unsure whether it's yet the case that {any Unicode code
point} is a superset of every character repertoire associated with
every non-Unicode encoding.

The cheap glaring counterexample is SQL_ASCII. Half those code points
are *nobody knows what Unicode character* (or even *whether*). I'm not
insisting that's a good thing, but it is a thing.

It might be a very tidy future to say all text is Unicode and all
server encodings are UTF-8, but I'm not sure it wouldn't still
be a good step on the way to be able to store some things in
their own encodings. We have JSON and XML now, two data types
that are *formally defined* to accept any Unicode content, and
we hedge and mumble and say (well, as long as it goes in the
server encoding) and that makes me sad. Things like that should
be easy to handle even without declaring UTF-8 as a server-wide
encoding ... they already are their own distinct data types, and
could conceivably know their own encodings.

But there again, it's possible that going with unconditional
UTF-8 for JSON or XML documents could, in some regions, bloat them.

Regards,
-Chap

#21Jeff Davis
pgsql@j-davis.com
In reply to: Isaac Morland (#16)
#22Nico Williams
nico@cryptonector.com
In reply to: Chapman Flack (#20)
#23Jeff Davis
pgsql@j-davis.com
In reply to: Nico Williams (#19)
#24Nico Williams
nico@cryptonector.com
In reply to: Jeff Davis (#23)
#25Isaac Morland
isaac.morland@gmail.com
In reply to: Jeff Davis (#21)
#26Robert Haas
robertmhaas@gmail.com
In reply to: Isaac Morland (#25)
#27Isaac Morland
isaac.morland@gmail.com
In reply to: Robert Haas (#26)
#28Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#26)
#29Nico Williams
nico@cryptonector.com
In reply to: Robert Haas (#26)
#30Jeff Davis
pgsql@j-davis.com
In reply to: Isaac Morland (#27)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nico Williams (#29)
#32Nico Williams
nico@cryptonector.com
In reply to: Tom Lane (#31)
#33Peter Eisentraut
peter_e@gmx.net
In reply to: Jeff Davis (#6)
#34Peter Eisentraut
peter_e@gmx.net
In reply to: Jeff Davis (#28)
#35Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#33)
#36Robert Haas
robertmhaas@gmail.com
In reply to: Nico Williams (#29)
#37Nico Williams
nico@cryptonector.com
In reply to: Robert Haas (#36)
#38Jeff Davis
pgsql@j-davis.com
In reply to: Nico Williams (#32)
#39Robert Haas
robertmhaas@gmail.com
In reply to: Nico Williams (#37)
#40Nico Williams
nico@cryptonector.com
In reply to: Robert Haas (#39)
#41Robert Haas
robertmhaas@gmail.com
In reply to: Nico Williams (#40)
#42Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#36)
#43Isaac Morland
isaac.morland@gmail.com
In reply to: Jeff Davis (#42)
#44Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Jeff Davis (#42)
#45Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#11)
#46Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#42)
#47Peter Eisentraut
peter_e@gmx.net
In reply to: Jeff Davis (#45)
#48Peter Eisentraut
peter_e@gmx.net
In reply to: Jeff Davis (#35)
#49Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#47)
#50Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#49)
#51Peter Eisentraut
peter_e@gmx.net
In reply to: Robert Haas (#49)
#52Peter Eisentraut
peter_e@gmx.net
In reply to: Jeff Davis (#50)
#53Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#52)
#54Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#51)
#55Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#52)
#56Daniel Verite
daniel@manitou-mail.org
In reply to: Jeff Davis (#55)
#57Robert Haas
robertmhaas@gmail.com
In reply to: Daniel Verite (#56)
#58Isaac Morland
isaac.morland@gmail.com
In reply to: Robert Haas (#57)
#59Robert Haas
robertmhaas@gmail.com
In reply to: Isaac Morland (#58)
#60Jeff Davis
pgsql@j-davis.com
In reply to: Daniel Verite (#56)
#61Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#55)
#62Thomas Munro
thomas.munro@gmail.com
In reply to: Jeff Davis (#61)
#63Nico Williams
nico@cryptonector.com
In reply to: Robert Haas (#41)
#64Nico Williams
nico@cryptonector.com
In reply to: Robert Haas (#11)
#65Nico Williams
nico@cryptonector.com
In reply to: Daniel Verite (#56)
#66Nico Williams
nico@cryptonector.com
In reply to: Jeff Davis (#17)
#67Jeff Davis
pgsql@j-davis.com
In reply to: Thomas Munro (#62)
#68David Rowley
dgrowleyml@gmail.com
In reply to: Jeff Davis (#67)
#69John Naylor
john.naylor@enterprisedb.com
In reply to: Jeff Davis (#61)
#70Jeff Davis
pgsql@j-davis.com
In reply to: David Rowley (#68)
#71Jeff Davis
pgsql@j-davis.com
In reply to: John Naylor (#69)
#72Phil Krylov
phil@krylov.eu
In reply to: Chapman Flack (#20)
#73Thomas Munro
thomas.munro@gmail.com
In reply to: David Rowley (#68)
#74David Rowley
dgrowleyml@gmail.com
In reply to: Thomas Munro (#73)
#75Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#3)
#76Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#75)
#77Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#75)