UTF8 national character data type support WIP patch and list of open issues.

Started by Boguk, Maksymover 12 years ago58 messageshackers
Jump to latest
#1Boguk, Maksym
maksymb@fast.au.fujitsu.com

Hi,

As part of my job I started developing in-core support for the UTF8
National Character types (national character/national character
variable).
I attached current WIP patch (against HEAD) to community review.

Target usage: ability to store UTF8 national characters in some
selected fields inside a single-byte encoded database.
For sample if I have a ru-RU.koi8r encoded database with mostly Russian
text inside, it would be nice to be able store an Japanese text in one
field without converting the whole database to UTF8 (convert such
database to UTF8 easily could almost double the database size even if
only one field in whole database will use any symbols outside of
ru-RU.koi8r encoding).

What has been done:

1)Addition of new string data types NATIONAL CHARACTER and NATIONAL
CHARACTER VARIABLE.
These types differ from the char/varchar data types in one important
respect: NATIONAL string types are always have UTF8 encoding even
(independent from used database encoding).
Of course that lead to encoding conversion overhead when comparing
NATIONAL string types with common string types (that is expected and
unavoidable).
2)Some ECPG support for these types
3)Some documentation patch (not finished)

What need to be done:

1)Full set of string functions and operators for NATIONAL types (we
could not use generic text functions because they assume that the stings
will have database encoding).
Now only basic set implemented.
2)Need implement some way to define default collation for a NATIONAL
types.
3)Need implement some way to input UTF8 characters into NATIONAL types
via SQL (there are serious open problem... it will be defined later in
the text).

Most serious open problem that the patch in current state doesn't allow
input/output UTF8 symbols which could not be represented in used
database encoding into NATIONAL fields.
It happen because encoding conversion from the client_encoding to the
database encoding happens before syntax analyze/parse stage and throw an
error for symbols which could not be represented.
I don't see any good solution to this problem except made whole codebase
use an UTF8 encoding for the all internal operations with huge
performance hit.
May be someone have good idea how to deal with this issue.

That is really WIP patch (with lots things on todo list/required
polish).

Kindly please tell me what you think about this idea/patch in general.

PS: It is my first patch to PostgreSQL so there are a lot of space to
improvement/style for sure.

Kind Regards,
Maksym

Attachments:

pgHEAD_UTF8NCHAR.patchapplication/octet-stream; name=pgHEAD_UTF8NCHAR.patchDownload+3339-255
#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Boguk, Maksym (#1)
Re: UTF8 national character data type support WIP patch and list of open issues.

On 03.09.2013 05:28, Boguk, Maksym wrote:

Target usage: ability to store UTF8 national characters in some
selected fields inside a single-byte encoded database.
For sample if I have a ru-RU.koi8r encoded database with mostly Russian
text inside, it would be nice to be able store an Japanese text in one
field without converting the whole database to UTF8 (convert such
database to UTF8 easily could almost double the database size even if
only one field in whole database will use any symbols outside of
ru-RU.koi8r encoding).

Ok.

What has been done:

1)Addition of new string data types NATIONAL CHARACTER and NATIONAL
CHARACTER VARIABLE.
These types differ from the char/varchar data types in one important
respect: NATIONAL string types are always have UTF8 encoding even
(independent from used database encoding).

I don't like the approach of adding a new data type for this. The
encoding used for a text field should be an implementation detail, not
something that's exposed to users at the schema-level. A separate data
type makes an nvarchar field behave slightly differently from text, for
example when it's passed to and from functions. It will also require
drivers and client applications to know about it.

What need to be done:

1)Full set of string functions and operators for NATIONAL types (we
could not use generic text functions because they assume that the stings
will have database encoding).
Now only basic set implemented.
2)Need implement some way to define default collation for a NATIONAL
types.
3)Need implement some way to input UTF8 characters into NATIONAL types
via SQL (there are serious open problem... it will be defined later in
the text).

Yeah, all of these issues stem from the fact that the NATIONAL types are
separate from text.

I think we should take a completely different approach to this. Two
alternatives spring to mind:

1. Implement a new encoding. The new encoding would be some variant of
UTF-8 that encodes languages like Russian more efficiently. Then just
use that in the whole database. Something like SCSU
(http://www.unicode.org/reports/tr6/) should do the trick, although I'm
not sure if SCSU can be used as a server-encoding. A lot of code relies
on the fact that a server encoding must have the high bit set in all
bytes that are part of a multi-byte character. That's why SJIS for
example can only be used as a client-encoding. But surely you could come
up with some subset or variant of SCSU which satisfies that requirement.

2. Compress the column. Simply do "ALTER TABLE foo ALTER COLUMN bar SET
STORAGE MAIN". That will make Postgres compress that field. That might
not be very efficient for compressing short cyrillic text encoded in
UTF-8 today, but that could be improved. There has been discussion on
supporting more compression algorithms in the past, and one such
algorithm could be again something like SCSU.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#2)
Re: UTF8 national character data type support WIP patch and list of open issues.

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 03.09.2013 05:28, Boguk, Maksym wrote:

Target usage: ability to store UTF8 national characters in some
selected fields inside a single-byte encoded database.

I think we should take a completely different approach to this. Two
alternatives spring to mind:

1. Implement a new encoding. The new encoding would be some variant of
UTF-8 that encodes languages like Russian more efficiently.

+1. I'm not sure that SCSU satisfies the requirement (which I read as
that Russian text should be pretty much 1 byte/character). But surely
we could devise a variant that does. For instance, it could look like
koi8r (or any other single-byte encoding of your choice) with one byte
value, say 255, reserved as a prefix. 255 means that a UTF8 character
follows. The main complication here is that you don't want to allow more
than one way to represent a character --- else you break text hashing,
for instance. So you'd have to take care that you never emit the 255+UTF8
representation for a character that can be represented in the single-byte
encoding. In particular, you'd never encode ASCII that way, and thus this
would satisfy the all-multibyte-chars-must-have-all-high-bits-set rule.

Ideally we could make a variant like this for each supported single-byte
encoding, and thus you could optimize a database for "mostly but not
entirely LATIN1 text", etc.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Boguk, Maksym
maksymb@fast.au.fujitsu.com
In reply to: Heikki Linnakangas (#2)
Re: UTF8 national character data type support WIP patch and list of open issues.

1)Addition of new string data types NATIONAL CHARACTER and NATIONAL
CHARACTER VARIABLE.
These types differ from the char/varchar data types in one important
respect: NATIONAL string types are always have UTF8 encoding even
(independent from used database encoding).

I don't like the approach of adding a new data type for this. The

encoding used for a text field should be an implementation detail, not
something that's exposed to users at the schema-level. A separate data
type makes an >nvarchar field behave slightly differently from text, for
example when it's passed to and from functions. It will also require
drivers and client applications to know about it.

Hi, my task is implementing ANSI NATIONAL character string types as
part of PostgreSQL core.
And requirement " require drivers and client applications to know about
it" is reason why it could not be done as add-on (these new types should
have a fixed OID for most drivers from my experience).
Implementing them as UTF8 data-type is first step which allows have
NATIONAL characters with encoding differ from database encoding (and
might me even support multiple encoding for common string types in
future).

1)Full set of string functions and operators for NATIONAL types (we
could not use generic text functions because they assume that the
stings will have database encoding).
Now only basic set implemented.
2)Need implement some way to define default collation for a NATIONAL
types.
3)Need implement some way to input UTF8 characters into NATIONAL

types

via SQL (there are serious open problem... it will be defined later
in the text).

Yeah, all of these issues stem from the fact that the NATIONAL types

are separate from text.

I think we should take a completely different approach to this. Two

alternatives spring to mind:

1. Implement a new encoding. The new encoding would be some variant of
UTF-8 that encodes languages like Russian more efficiently. Then just

use that in the whole database. Something like SCSU

(http://www.unicode.org/reports/tr6/) should do the trick, although I'm

not sure if SCSU can be used as a server-encoding. A lot of code relies
on the fact that a server encoding must have the high bit set in all
bytes that >are part of a multi-byte character. That's why SJIS for
example can only be used as a client-encoding. But surely you could come
up with some subset or variant of SCSU which satisfies that requirement.

2. Compress the column. Simply do "ALTER TABLE foo ALTER COLUMN bar SET

STORAGE MAIN". That will make Postgres compress that field. That might
not be very efficient for compressing short Cyrillic text encoded in

UTF-8 today, but that could be improved. There has been discussion on

supporting more compression algorithms in the past, and one such
algorithm could be again something like SCSU.

Both of these approach requires dump/restore the whole database which is
not always an opinion.
Implementing an UTF8 NATIONAL character as new datatype will provide
opinion use pg_upgrade to latest version and have required functionality
without prolonged downtime.

PS: is it possible to reserve some narrow type OID range in PostgreSQL
core for the future use?

Kind Regards,
Maksym

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Boguk, Maksym (#4)
Re: UTF8 national character data type support WIP patch and list of open issues.

"Boguk, Maksym" <maksymb@fast.au.fujitsu.com> writes:

Hi, my task is implementing ANSI NATIONAL character string types as
part of PostgreSQL core.

No, that's not a given. You have a problem to solve, ie store some UTF8
strings in a database that's mostly just 1-byte data. It is not clear
that NATIONAL CHARACTER is the best solution to that problem. And I don't
think that you're going to convince anybody that this is an improvement in
spec compliance, because there's too much gap between what you're doing
here and what it says in the spec.

Both of these approach requires dump/restore the whole database which is
not always an opinion.

That's a disadvantage, agreed, but it's not a large enough one to reject
the approach, because what you want to do also has very significant
disadvantages.

I think it is extremely likely that we will end up rejecting a patch based
on NATIONAL CHARACTER altogether. It will require too much duplicative
code, it requires too many application-side changes to make use of the
functionality, and it will break any applications that are relying on the
current behavior of that syntax. But the real problem is that you're
commandeering syntax defined in the SQL spec for what is in the end quite
a narrow usage. I agree that the use-case will be very handy for some
applications ... but if we were ever to try to achieve real spec
compliance for the SQL features around character sets, this doesn't look
like a step on the way to that.

I think you'd be well advised to take a hard look at the
specialized-database-encoding approach. From here it looks like a 99%
solution for about 1% of the effort; and since it would be quite
uninvasive to the system as a whole, it's unlikely that such a patch
would get rejected.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Tom Lane (#5)
Re: UTF8 national character data type support WIP patch and list of open issues.

Hello,

I think it would be nice for PostgreSQL to support national character types
largely because it should ease migration from other DBMSs.

[Reasons why we need NCHAR]
--------------------------------------------------
1. Invite users of other DBMSs to PostgreSQL. Oracle, SQL Server, MySQL,
etc. all have NCHAR support. PostgreSQL is probably the only database out
of major ones that does not support NCHAR.
Sadly, I've read a report from some Japanese government agency that the
number of MySQL users exceeded that of PostgreSQL here in Japan in 2010 or
2011. I wouldn't say that is due to NCHAR support, but it might be one
reason. I want PostgreSQL to be more popular and regain those users.

2. Enhance the "open" image of PostgreSQL by implementing more features of
SQL standard. NCHAR may be a wrong and unnecessary feature of SQL standard
now that we have Unicode support, but it is defined in the standard and
widely implemented.

3. I have heard that some potential customers didn't adopt PostgreSQL due to
lack of NCHAR support. However, I don't know the exact reason why they need
NCHAR.

4. I guess some users really want to continue to use ShiftJIS or EUC_JP for
database encoding, and use NCHAR for a limited set of columns to store
international text in Unicode:
- to avoid code conversion between the server and the client for performance
- because ShiftJIS and EUC_JP require less amount of storage (2 bytes for
most Kanji) than UTF-8 (3 bytes)
This use case is described in chapter 6 of "Oracle Database Globalization
Support Guide".
--------------------------------------------------

I think we need to do the following:

[Minimum requirements]
--------------------------------------------------
1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically.
This is already implemented. PostgreSQL treats NCHAR/NVARCHAR as synonyms
for CHAR/VARCHAR, and ignores N prefix. But this is not documented.

2. Declare support for national character support in the manual.
1 is not sufficient because users don't want to depend on undocumented
behavior. This is exactly what the TODO item "national character support"
in PostgreSQL TODO wiki is about.

3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so that:
- psql \d can display the user-specified data types.
- pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as
CHAR/VARCHAR.
- To implement additional features for NCHAR/NVARCHAR in the future, as
described below.
--------------------------------------------------

[Optional requirements]
--------------------------------------------------
1. Implement client driver support, such as:
- NCHAR host variable type (e.g. "NCHAR var_name[12];") in ECPG, as
specified in the SQL standard.
- national character methods (e.g. setNString, getNString,
setNCharacterStream) as specified in JDBC 4.0.
I think at first we can treat these national-character-specific features as
the same as CHAR/VARCHAR.

2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
contain Unicode data.
I think it is sufficient at first that NCHAR/NVARCHAR columns can only be
used in UTF-8 databases and they store UTF-8 strings. This allows us to
reuse the input/output/send/recv functions and other infrastructure of
CHAR/VARCHAR. This is a reasonable compromise to avoid duplication and
minimize the first implementation of NCHAR support.

3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
Fixed-width encoding may allow faster string manipulation as described in
Oracle's manual. But I'm not sure about this, because UTF-16 is not a real
fixed-width encoding due to supplementary characters.
--------------------------------------------------

I don't think it is good to implement NCHAR/NVARCHAR types as extensions
like contrib/citext, because NCHAR/NVARCHAR are basic types and need
client-side support. That is, client drivers need to be aware of the fixed
NCHAR/NVARCHAR OID values.

How do you think we should implement NCHAR support?

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Arulappan, Arul Shaji
arul@fast.au.fujitsu.com
In reply to: Tsunakawa, Takayuki (#6)
Re: UTF8 national character data type support WIP patch and list of open issues.

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of MauMau

Hello,

I think it would be nice for PostgreSQL to support national character

types

largely because it should ease migration from other DBMSs.

[Reasons why we need NCHAR]
--------------------------------------------------
1. Invite users of other DBMSs to PostgreSQL. Oracle, SQL Server,

MySQL, etc.

all have NCHAR support. PostgreSQL is probably the only database out

of major

ones that does not support NCHAR.
Sadly, I've read a report from some Japanese government agency that the

number

of MySQL users exceeded that of PostgreSQL here in Japan in 2010 or

2011. I

wouldn't say that is due to NCHAR support, but it might be one reason.

I want

PostgreSQL to be more popular and regain those users.

2. Enhance the "open" image of PostgreSQL by implementing more features

of SQL

standard. NCHAR may be a wrong and unnecessary feature of SQL standard

now

that we have Unicode support, but it is defined in the standard and

widely

implemented.

3. I have heard that some potential customers didn't adopt PostgreSQL

due to

lack of NCHAR support. However, I don't know the exact reason why they

need

NCHAR.

The use case we have is for customer(s) who are modernizing their
databases on mainframes. These applications are typically written in
COBOL which does have extensive support for National Characters.
Supporting National Characters as in-built data types in PostgreSQL is,
not to exaggerate, an important criteria in their decision to use
PostgreSQL or not. (So is Embedded COBOL. But that is a separate issue.)

4. I guess some users really want to continue to use ShiftJIS or EUC_JP

for

database encoding, and use NCHAR for a limited set of columns to store
international text in Unicode:
- to avoid code conversion between the server and the client for

performance

- because ShiftJIS and EUC_JP require less amount of storage (2 bytes

for most

Kanji) than UTF-8 (3 bytes) This use case is described in chapter 6 of

"Oracle

Database Globalization Support Guide".
--------------------------------------------------

I think we need to do the following:

[Minimum requirements]
--------------------------------------------------
1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically.
This is already implemented. PostgreSQL treats NCHAR/NVARCHAR as

synonyms for

CHAR/VARCHAR, and ignores N prefix. But this is not documented.

2. Declare support for national character support in the manual.
1 is not sufficient because users don't want to depend on undocumented
behavior. This is exactly what the TODO item "national character

support"

in PostgreSQL TODO wiki is about.

3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so

that:

- psql \d can display the user-specified data types.
- pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as
CHAR/VARCHAR.
- To implement additional features for NCHAR/NVARCHAR in the future, as
described below.
--------------------------------------------------

Agreed. This is our minimum requirement too.

Rgds,
Arul Shaji

[Optional requirements]
--------------------------------------------------
1. Implement client driver support, such as:
- NCHAR host variable type (e.g. "NCHAR var_name[12];") in ECPG, as

specified

in the SQL standard.
- national character methods (e.g. setNString, getNString,
setNCharacterStream) as specified in JDBC 4.0.
I think at first we can treat these national-character-specific

features as the

same as CHAR/VARCHAR.

2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always

contain

Unicode data.
I think it is sufficient at first that NCHAR/NVARCHAR columns can only

be used

in UTF-8 databases and they store UTF-8 strings. This allows us to

reuse the

input/output/send/recv functions and other infrastructure of

CHAR/VARCHAR.

This is a reasonable compromise to avoid duplication and minimize the

first

implementation of NCHAR support.

3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
Fixed-width encoding may allow faster string manipulation as described

in

Oracle's manual. But I'm not sure about this, because UTF-16 is not a

real

fixed-width encoding due to supplementary characters.

This would definitely be a welcome addition.

--------------------------------------------------

I don't think it is good to implement NCHAR/NVARCHAR types as

extensions like

contrib/citext, because NCHAR/NVARCHAR are basic types and need

client-side

support. That is, client drivers need to be aware of the fixed

NCHAR/NVARCHAR

OID values.

How do you think we should implement NCHAR support?

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To

make

changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Robert Haas
robertmhaas@gmail.com
In reply to: Tsunakawa, Takayuki (#6)
Re: UTF8 national character data type support WIP patch and list of open issues.

On Mon, Sep 16, 2013 at 8:49 AM, MauMau <maumau307@gmail.com> wrote:

2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
contain Unicode data.

...

3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
Fixed-width encoding may allow faster string manipulation as described in
Oracle's manual. But I'm not sure about this, because UTF-16 is not a real
fixed-width encoding due to supplementary characters.

It seems to me that these two points here are the real core of your
proposal. The rest is just syntactic sugar.

Let me start with the second one: I don't think there's likely to be
any benefit in using UTF-16 as the internal encoding. In fact, I
think it's likely to make things quite a bit more complicated, because
we have a lot of code that assumes that server encodings have certain
properties that UTF-16 doesn't - specifically, that any byte with the
high-bit clear represents the corresponding ASCII character.

As to the first one, if we're going to go to the (substantial) trouble
of building infrastructure to allow a database to store data in
multiple encodings, why limit it to storing UTF-8 in non-UTF-8
databases? What about storing SHIFT-JIS in UTF-8 databases, or
Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other
combination you might care to name?

Whether we go that way or not, I think storing data in one encoding in
a database with a different encoding is going to be pretty tricky and
require far-reaching changes. You haven't mentioned any of those
issues or discussed how you would solve them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#8)
Re: UTF8 national character data type support WIP patch and list of open issues.

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Sep 16, 2013 at 8:49 AM, MauMau <maumau307@gmail.com> wrote:

2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
contain Unicode data.
...
3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
Fixed-width encoding may allow faster string manipulation as described in
Oracle's manual. But I'm not sure about this, because UTF-16 is not a real
fixed-width encoding due to supplementary characters.

It seems to me that these two points here are the real core of your
proposal. The rest is just syntactic sugar.

Let me start with the second one: I don't think there's likely to be
any benefit in using UTF-16 as the internal encoding. In fact, I
think it's likely to make things quite a bit more complicated, because
we have a lot of code that assumes that server encodings have certain
properties that UTF-16 doesn't - specifically, that any byte with the
high-bit clear represents the corresponding ASCII character.

Another point to keep in mind is that UTF16 is not really any easier
to deal with than UTF8, unless you write code that fails to support
characters outside the basic multilingual plane. Which is a restriction
I don't believe we'd accept. But without that restriction, you're still
forced to deal with variable-width characters; and there's nothing very
nice about the way that's done in UTF16. So on the whole I think it
makes more sense to use UTF8 for this.

I share Robert's misgivings about difficulties in dealing with characters
that are not representable in the database's principal encoding. Still,
you probably won't find out about many of those until you try it.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#8)
Re: UTF8 national character data type support WIP patch and list of open issues.

On 18.09.2013 16:16, Robert Haas wrote:

On Mon, Sep 16, 2013 at 8:49 AM, MauMau<maumau307@gmail.com> wrote:

2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
contain Unicode data.

...

3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
Fixed-width encoding may allow faster string manipulation as described in
Oracle's manual. But I'm not sure about this, because UTF-16 is not a real
fixed-width encoding due to supplementary characters.

It seems to me that these two points here are the real core of your
proposal. The rest is just syntactic sugar.

Let me start with the second one: I don't think there's likely to be
any benefit in using UTF-16 as the internal encoding. In fact, I
think it's likely to make things quite a bit more complicated, because
we have a lot of code that assumes that server encodings have certain
properties that UTF-16 doesn't - specifically, that any byte with the
high-bit clear represents the corresponding ASCII character.

As to the first one, if we're going to go to the (substantial) trouble
of building infrastructure to allow a database to store data in
multiple encodings, why limit it to storing UTF-8 in non-UTF-8
databases? What about storing SHIFT-JIS in UTF-8 databases, or
Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other
combination you might care to name?

Whether we go that way or not, I think storing data in one encoding in
a database with a different encoding is going to be pretty tricky and
require far-reaching changes. You haven't mentioned any of those
issues or discussed how you would solve them.

I'm not too thrilled about complicating the system for that, either. If
you really need to deal with many different languages, you can do that
today by using UTF-8 everywhere. Sure, it might not be the most
efficient encoding for some characters, but it works.

There is one reason, however, that makes it a lot more compelling: we
already support having databases with different encodings in the same
cluster, but the encoding used in the shared catalogs, for usernames and
database names for example, is not well-defined. If we dealt with
different encodings in the same database, that inconsistency would go away.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Robert Haas (#8)
Re: UTF8 national character data type support WIP patch and list of open issues.

From: "Robert Haas" <robertmhaas@gmail.com>

On Mon, Sep 16, 2013 at 8:49 AM, MauMau <maumau307@gmail.com> wrote:

2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
contain Unicode data.

...

3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
Fixed-width encoding may allow faster string manipulation as described in
Oracle's manual. But I'm not sure about this, because UTF-16 is not a
real
fixed-width encoding due to supplementary characters.

It seems to me that these two points here are the real core of your
proposal. The rest is just syntactic sugar.

No, those are "desirable if possible" features. What's important is to
declare in the manual that PostgreSQL officially supports national character
types, as I stated below.

1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically.
This is already implemented. PostgreSQL treats NCHAR/NVARCHAR as synonyms
for CHAR/VARCHAR, and ignores N prefix. But this is not documented.

2. Declare support for national character support in the manual.
1 is not sufficient because users don't want to depend on undocumented
behavior. This is exactly what the TODO item "national character support"
in PostgreSQL TODO wiki is about.

3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so
that:
- psql \d can display the user-specified data types.
- pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as
CHAR/VARCHAR.
- To implement additional features for NCHAR/NVARCHAR in the future, as
described below.

And when declaring that, we had better implement NCHAR types as distinct
types with their own OIDs so that we can extend NCHAR behavior in the
future.
As the first stage, I think it's okay to treat NCHAR types exactly the same
as CHAR/VARCHAR types. For example, in ECPG:

switch (type)
case OID_FOR_CHAR:
case OID_FOR_VARCHAR:
case OID_FOR_TEXT:
case OID_FOR_NCHAR: /* new code */
case OID_FOR_NVARCHAR: /* new code */
some processing;
break;
And in JDBC, just call methods for non-national character types.
Currently, those national character methods throw SQLException.

public void setNString(int parameterIndex, String value) throws SQLException
{
setString(parameterIndex, value);
}

Let me start with the second one: I don't think there's likely to be
any benefit in using UTF-16 as the internal encoding. In fact, I
think it's likely to make things quite a bit more complicated, because
we have a lot of code that assumes that server encodings have certain
properties that UTF-16 doesn't - specifically, that any byte with the
high-bit clear represents the corresponding ASCII character.

As to the first one, if we're going to go to the (substantial) trouble
of building infrastructure to allow a database to store data in
multiple encodings, why limit it to storing UTF-8 in non-UTF-8
databases? What about storing SHIFT-JIS in UTF-8 databases, or
Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other
combination you might care to name?

Whether we go that way or not, I think storing data in one encoding in
a database with a different encoding is going to be pretty tricky and
require far-reaching changes. You haven't mentioned any of those
issues or discussed how you would solve them.

Yes, you are probably right -- I'm not sure UTF-16 has really benefits that
UTF-8 doesn't have. But why did Windows and Java choose UTF-16 for internal
strings rather than UTF-8? Why did Oracle recommend UTF-16 for NCHAR? I
have no clear idea. Anyway, I don't strongly push UTF-16 and complicate the
encoding handling.

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Tom Lane (#9)
Re: UTF8 national character data type support WIP patch and list of open issues.

From: "Tom Lane" <tgl@sss.pgh.pa.us>

Another point to keep in mind is that UTF16 is not really any easier
to deal with than UTF8, unless you write code that fails to support
characters outside the basic multilingual plane. Which is a restriction
I don't believe we'd accept. But without that restriction, you're still
forced to deal with variable-width characters; and there's nothing very
nice about the way that's done in UTF16. So on the whole I think it
makes more sense to use UTF8 for this.

I feel so. I guess why Windows, Java, and Oracle chose UTF-16 is ... it was
UCS-2 only with BMP when they chose it. So character handling was easier
and faster thanks to fixed-width encoding.

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Robert Haas
robertmhaas@gmail.com
In reply to: Tsunakawa, Takayuki (#11)
Re: UTF8 national character data type support WIP patch and list of open issues.

On Wed, Sep 18, 2013 at 6:42 PM, MauMau <maumau307@gmail.com> wrote:

It seems to me that these two points here are the real core of your
proposal. The rest is just syntactic sugar.

No, those are "desirable if possible" features. What's important is to
declare in the manual that PostgreSQL officially supports national character
types, as I stated below.

That may be what's important to you, but it's not what's important to
me. I am not keen to introduce support for nchar and nvarchar as
differently-named types with identical semantics. And I think it's an
even worse idea to introduce them now, making them work one way, and
then later change the behavior in a backward-incompatible fashion.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Robert Haas (#13)
Re: UTF8 national character data type support WIP patch and list of open issues.

From: "Robert Haas" <robertmhaas@gmail.com>

That may be what's important to you, but it's not what's important to
me.

National character types support may be important to some potential users of
PostgreSQL and the popularity of PostgreSQL, not me. That's why national
character support is listed in the PostgreSQL TODO wiki. We might be losing
potential users just because their selection criteria includes national
character support.

I am not keen to introduce support for nchar and nvarchar as
differently-named types with identical semantics.

Similar examples already exist:

- varchar and text: the only difference is the existence of explicit length
limit
- numeric and decimal
- int and int4, smallint and int2, bigint and int8
- real/double precison and float

In addition, the SQL standard itself admits:

"The <key word>s NATIONAL CHARACTER are used to specify the character type
with an implementation-
defined character set. Special syntax (N'string') is provided for
representing literals in that character set.
...
"NATIONAL CHARACTER" is equivalent to the corresponding <character string
type> with a specification
of "CHARACTER SET CSN", where "CSN" is an implementation-defined <character
set name>."

"A <national character string literal> is equivalent to a <character string
literal> with the "N" replaced by
"<introducer><character set specification>", where "<character set
specification>" is an implementation-
defined <character set name>."

And I think it's an
even worse idea to introduce them now, making them work one way, and
then later change the behavior in a backward-incompatible fashion.

I understand your feeling. The concern about incompatibility can be
eliminated by thinking the following way. How about this?

- NCHAR can be used with any database encoding.

- At first, NCHAR is exactly the same as CHAR. That is,
"implementation-defined character set" described in the SQL standard is the
database character set.

- In the future, the character set for NCHAR can be selected at database
creation like Oracle's CREATE DATABAWSE .... NATIONAL CHARACTER SET
AL16UTF16. The default it the database set.

Could you tell me what kind of specification we should implement if we
officially support national character types?

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Robert Haas (#8)
Re: UTF8 national character data type support WIP patch and list of open issues.

On Mon, Sep 16, 2013 at 8:49 AM, MauMau <maumau307@gmail.com> wrote:

2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
contain Unicode data.

...

3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
Fixed-width encoding may allow faster string manipulation as described in
Oracle's manual. But I'm not sure about this, because UTF-16 is not a real
fixed-width encoding due to supplementary characters.

It seems to me that these two points here are the real core of your
proposal. The rest is just syntactic sugar.

Let me start with the second one: I don't think there's likely to be
any benefit in using UTF-16 as the internal encoding. In fact, I
think it's likely to make things quite a bit more complicated, because
we have a lot of code that assumes that server encodings have certain
properties that UTF-16 doesn't - specifically, that any byte with the
high-bit clear represents the corresponding ASCII character.

Agreed.

As to the first one, if we're going to go to the (substantial) trouble
of building infrastructure to allow a database to store data in
multiple encodings, why limit it to storing UTF-8 in non-UTF-8
databases? What about storing SHIFT-JIS in UTF-8 databases, or
Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other
combination you might care to name?

Whether we go that way or not, I think storing data in one encoding in
a database with a different encoding is going to be pretty tricky and
require far-reaching changes. You haven't mentioned any of those
issues or discussed how you would solve them.

What about limiting to use NCHAR with a database which has same
encoding or "compatible" encoding (on which the encoding conversion is
defined)? This way, NCHAR text can be automatically converted from
NCHAR to the database encoding in the server side thus we can treat
NCHAR exactly same as CHAR afterward. I suppose what encoding is used
for NCHAR should be defined in initdb time or creation of the database
(if we allow this, we need to add a new column to know what encoding
is used for NCHAR).

For example, "CREATE TABLE t1(t NCHAR(10))" will succeed if NCHAR is
UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is
SHIFT-JIS and database encoding is UTF-8 because there is a conversion
between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is
SHIFT-JIS and database encoding is ISO-8859-1 because there's no
conversion between them.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Tsunakawa, Takayuki (#14)
Re: UTF8 national character data type support WIP patch and list of open issues.

Hi,

That may be what's important to you, but it's not what's important to

me.

National character types support may be important to some potential users
of PostgreSQL and the popularity of PostgreSQL, not me. That's why
national character support is listed in the PostgreSQL TODO wiki. We might
be losing potential users just because their selection criteria includes
national character support.

the whole NCHAR appeared as hack for the systems, that did not have it from
the beginning. It would not be needed, if all the text would be magically
stored in UNICODE or UTF from the beginning and idea of character would be
the same as an idea of a rune and not a byte.

PostgreSQL has a very powerful possibilities for storing any kind of
encoding. So maybe it makes sense to add the ENCODING as another column
property, the same way a COLLATION was added?

It would make it possible to have a database, that talks to the clients in
UTF8 and stores text and varchar data in the encoding that is the most
appropriate for the situation.

It will make it impossible (or complicated) to make the database have a
non-UTF8 default encoding (I wonder who should need that in this case), as
conversions will not be possible from the broader charsets into the default
database encoding.

One could define an additional DATABASE property like LC_ENCODING that
would work for the ENCODING property of a column like LC_COLLATE for
COLLATE property of a column.

Text operations should work automatically, as in memory all strings will be
converted to the database encoding.

This approach will also open a possibility to implement custom ENCODINGs
for the column data storage, like snappy compression or even BSON, gobs or
protbufs for much more compact type storage.

Regards,

-- Valentine Gogichashvili

#17Martijn van Oosterhout
kleptog@svana.org
In reply to: Tatsuo Ishii (#15)
Re: UTF8 national character data type support WIP patch and list of open issues.

On Fri, Sep 20, 2013 at 08:58:53AM +0900, Tatsuo Ishii wrote:

For example, "CREATE TABLE t1(t NCHAR(10))" will succeed if NCHAR is
UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is
SHIFT-JIS and database encoding is UTF-8 because there is a conversion
between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is
SHIFT-JIS and database encoding is ISO-8859-1 because there's no
conversion between them.

As far as I can tell the whole reason for introducing NCHAR is to
support SHIFT-JIS, there hasn't been call for any other encodings, that
I can remember anyway.

So rather than this whole NCHAR thing, why not just add a type
"sjistext", and a few type casts and call it a day...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.

-- Arthur Schopenhauer

#18Robert Haas
robertmhaas@gmail.com
In reply to: Tsunakawa, Takayuki (#14)
Re: UTF8 national character data type support WIP patch and list of open issues.

On Thu, Sep 19, 2013 at 6:42 PM, MauMau <maumau307@gmail.com> wrote:

National character types support may be important to some potential users of
PostgreSQL and the popularity of PostgreSQL, not me. That's why national
character support is listed in the PostgreSQL TODO wiki. We might be losing
potential users just because their selection criteria includes national
character support.

We'd have to go back and search the archives to figure out why that
item was added to the TODO, but I'd be surprised if anyone ever had it
in mind to create additional types that behave just like existing
types but with different names. I don't think that you'll be able to
get consensus around that path on this mailing list.

I am not keen to introduce support for nchar and nvarchar as
differently-named types with identical semantics.

Similar examples already exist:

- varchar and text: the only difference is the existence of explicit length
limit
- numeric and decimal
- int and int4, smallint and int2, bigint and int8
- real/double precison and float

I agree that the fact we have both varchar and text feels like a wart.
The other examples mostly involve different names for the same
underlying type, and so are different from what you are asking for
here.

I understand your feeling. The concern about incompatibility can be
eliminated by thinking the following way. How about this?

- NCHAR can be used with any database encoding.

- At first, NCHAR is exactly the same as CHAR. That is,
"implementation-defined character set" described in the SQL standard is the
database character set.

- In the future, the character set for NCHAR can be selected at database
creation like Oracle's CREATE DATABAWSE .... NATIONAL CHARACTER SET
AL16UTF16. The default it the database set.

Hmm. So under that design, a database could support up to a total of
two character sets, the one that you get when you say 'foo' and the
other one that you get when you say n'foo'.

I guess we could do that, but it seems a bit limited. If we're going
to go to the trouble of supporting multiple character sets, why not
support an arbitrary number instead of just two?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Robert Haas
robertmhaas@gmail.com
In reply to: Tatsuo Ishii (#15)
Re: UTF8 national character data type support WIP patch and list of open issues.

On Thu, Sep 19, 2013 at 7:58 PM, Tatsuo Ishii <ishii@postgresql.org> wrote:

What about limiting to use NCHAR with a database which has same
encoding or "compatible" encoding (on which the encoding conversion is
defined)? This way, NCHAR text can be automatically converted from
NCHAR to the database encoding in the server side thus we can treat
NCHAR exactly same as CHAR afterward. I suppose what encoding is used
for NCHAR should be defined in initdb time or creation of the database
(if we allow this, we need to add a new column to know what encoding
is used for NCHAR).

For example, "CREATE TABLE t1(t NCHAR(10))" will succeed if NCHAR is
UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is
SHIFT-JIS and database encoding is UTF-8 because there is a conversion
between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is
SHIFT-JIS and database encoding is ISO-8859-1 because there's no
conversion between them.

I think the point here is that, at least as I understand it, encoding
conversion and sanitization happens at a very early stage right now,
when we first receive the input from the client. If the user sends a
string of bytes as part of a query or bind placeholder that's not
valid in the database encoding, it's going to error out before any
type-specific code has an opportunity to get control. Look at
textin(), for example. There's no encoding check there. That means
it's already been done at that point. To make this work, someone's
going to have to figure out what to do about *that*. Until we have a
sketch of what the design for that looks like, I don't see how we can
credibly entertain more specific proposals.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Peter Eisentraut
peter_e@gmx.net
In reply to: Robert Haas (#18)
Re: UTF8 national character data type support WIP patch and list of open issues.

On 9/20/13 2:22 PM, Robert Haas wrote:

I am not keen to introduce support for nchar and nvarchar as

differently-named types with identical semantics.

Similar examples already exist:

- varchar and text: the only difference is the existence of explicit length
limit
- numeric and decimal
- int and int4, smallint and int2, bigint and int8
- real/double precison and float

I agree that the fact we have both varchar and text feels like a wart.
The other examples mostly involve different names for the same
underlying type, and so are different from what you are asking for
here.

Also note that we already have NCHAR [VARYING]. It's mapped to char or
varchar, respectively, in the parser, just like int, real, etc. are handled.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Tatsuo Ishii (#15)
#22Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Martijn van Oosterhout (#17)
#23Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Valentine Gogichashvili (#16)
#24Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Robert Haas (#18)
#25Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Robert Haas (#19)
#26Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Robert Haas (#19)
In reply to: Tsunakawa, Takayuki (#23)
#28Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Tatsuo Ishii (#26)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Tsunakawa, Takayuki (#24)
#30Peter Eisentraut
peter_e@gmx.net
In reply to: Tsunakawa, Takayuki (#28)
#31Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Robert Haas (#29)
#32Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Peter Eisentraut (#30)
#33Peter Eisentraut
peter_e@gmx.net
In reply to: Tsunakawa, Takayuki (#32)
#34Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Peter Eisentraut (#33)
#35Bruce Momjian
bruce@momjian.us
In reply to: Tsunakawa, Takayuki (#32)
#36Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Bruce Momjian (#35)
#37Arulappan, Arul Shaji
arul@fast.au.fujitsu.com
In reply to: Tsunakawa, Takayuki (#36)
#38Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Arulappan, Arul Shaji (#37)
#39Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Laurenz Albe (#38)
#40Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Tsunakawa, Takayuki (#39)
#41Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Laurenz Albe (#40)
#42Peter Eisentraut
peter_e@gmx.net
In reply to: Arulappan, Arul Shaji (#37)
#43Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#42)
#44Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Robert Haas (#43)
#45Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Tsunakawa, Takayuki (#44)
#46Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Laurenz Albe (#45)
#47Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Laurenz Albe (#45)
#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tsunakawa, Takayuki (#47)
#49Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#48)
#50Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#49)
#51Martijn van Oosterhout
kleptog@svana.org
In reply to: Tatsuo Ishii (#49)
#52Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#51)
#53Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Martijn van Oosterhout (#51)
#54Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#52)
#55Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#54)
#56Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#55)
#57Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#56)
#58Chapman Flack
chap@anastigmatix.net
In reply to: Robert Haas (#18)