Encoding problems with "COMMENT ON DATABASE .." causing pg_restore (and pg_upgrade) to fail

Started by Jan Lentferabout 10 years ago10 messagesbugs
Jump to latest
#1Jan Lentfer
Jan.Lentfer@web.de

We found a problem with comments on databases using german umlauts (äöü)
- or more generally, we found encoding problems when doing "COMMENT ON
database foo ..." when foo is UTF-8 encoded and the database you are
connected to (e.g. postgres) is SQL_ASCII.
I analyzed this with the help of Andrew Gierth (and others) on IRC - it
seems in that constellation you can write an non-UTF-8 comment onto a
UTF-8 database.
This leads to the problem, that wen trying to use pg_restore -C with a
dump created with -Fc the restore will fail when trying to do the
"COMMENT ON DATABASE.."
We stumpled across this because also pg_upgrade fails on this (during
the schema part), but doesn't detect this situation with the -c (check)
option beforehand.
This is 9.4.5 on Solaris 11 - but we did have that problem already when
using pg_upgrade from 8.4 to 9.1 a few years back. I am sorry we didn't
analyze further back then, but it was "just a comment". But now it bit
us again....

Here is what I did so far to analyze the problem (with instrucions from
Andrew Gierth). I hope this makes it clear, otherwise please don't
hesitate to request more details.

postgres=# create database comment_test template template0 encoding
'UTF-8';
CREATE DATABASE
#####

---> use pgadmin3 on Windows to set comment on database coment_test with
string "für", while being connected to postgres (SQL_ASCII encoded)
#####

$ pg_dump -Fc comment_test -f comment_test.pgdump
postgres=# alter database comment_test rename to comment_test_orig;
ALTER DATABASE
#####

$ pg_restore -C -d template1 comment_test.pgdump
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 1989; 1262 641528
COMMENT comment_test postgres
pg_restore: [archiver (db)] could not execute query: ERROR: invalid
byte sequence for encoding "UTF8": 0xfc
Command was: COMMENT ON DATABASE comment_test IS 'f¦r';
###

postgres@sz-pp-pg02-dev[~] $ pg_restore comment_test.pgdump
--
-- PostgreSQL database dump
--

SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;

--
-- Name: comment_test; Type: COMMENT; Schema: -; Owner: postgres
--

COMMENT ON DATABASE comment_test IS 'f¦r';

--
-- Name: plpgsql; Type: EXTENSION; Schema: -; Owner:
--

CREATE EXTENSION IF NOT EXISTS plpgsql WITH SCHEMA pg_catalog;

--
-- Name: EXTENSION plpgsql; Type: COMMENT; Schema: -; Owner:
--

COMMENT ON EXTENSION plpgsql IS 'PL/pgSQL procedural language';

--
-- Name: public; Type: ACL; Schema: -; Owner: postgres
--

REVOKE ALL ON SCHEMA public FROM PUBLIC;
REVOKE ALL ON SCHEMA public FROM postgres;
GRANT ALL ON SCHEMA public TO postgres;
GRANT ALL ON SCHEMA public TO PUBLIC;

--
-- PostgreSQL database dump complete
--
#######

$ pg_restore comment_test.pgdump | head -16 | tail -1 |xxd
0000000: 434f 4d4d 454e 5420 4f4e 2044 4154 4142 COMMENT ON DATAB
0000010: 4153 4520 636f 6d6d 656e 745f 7465 7374 ASE comment_test
0000020: 2049 5320 2766 fc72 273b 0a IS 'f.r';.
######

postgres=# select
convert_to(shobj_description(oid,'pg_database'),'SQL_ASCII') from
pg_database where datname='comment_test_orig';
convert_to
------------
\x66fc72
(1 row)
######

postgres=# \l
List of databases
Name | Owner | Encoding | Collate |
Ctype | Access privileges
--------------------+-----------+-----------+------------------+------------------+-----------------------
comment_test | postgres | UTF8 | C | C
|
[...]
postgres | postgres | SQL_ASCII | C | C
|
template0 | postgres | SQL_ASCII | C | C
| =c/postgres +
| | | |
| postgres=CTc/postgres
template1 | postgres | SQL_ASCII | C | C
| postgres=CTc/postgres+
| | | |
| =c/postgres

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Jan Lentfer (#1)
Re: Encoding problems with "COMMENT ON DATABASE .." causing pg_restore (and pg_upgrade) to fail

I am not familiar with pgadmin3 but it seems a bug with pgadmin3 or
misuse of it to me. Probably when this:

---> use pgadmin3 on Windows to set comment on database coment_test with
---> string "für", while being connected to postgres (SQL_ASCII encoded)

was executed, the client encoding is not properly set to UTF-8 for
some reasons.

I did almost similar thing except that I used Japanese HIRAGANA
character (I do not write German) and used psql to input COMMENT
command. Also the PostgreSQL version is 9.5.0, which is different from
yours but I don't think there has been no change in this area.

t-ishii@localhost: initdb --no-locale -E SQL_ASCII -D /tmp/testdb
The files belonging to this database system will be owned by user "t-ishii".
This user must also own the server process.

The database cluster will be initialized with locale "C".
The default text search configuration will be set to "english".
[snip]
Success. You can now start the database server using:

pg_ctl -D /tmp/testdb -l logfile start

t-ishii@localhost: export PGPORT=5435
t-ishii@localhost: pg_ctl -D /tmp/testdb/ start
t-ishii@localhost: psql -l
Pager usage is off.
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+---------+-----------+---------+-------+-------------------------
postgres | t-ishii | SQL_ASCII | C | C |
template0 | t-ishii | SQL_ASCII | C | C | =c/"t-ishii" +
| | | | | "t-ishii"=CTc/"t-ishii"
template1 | t-ishii | SQL_ASCII | C | C | =c/"t-ishii" +
| | | | | "t-ishii"=CTc/"t-ishii"
(3 rows)

t-ishii@localhost: psql postgres
postgres=# create database comment_test template template0 encoding 'UTF-8';
CREATE DATABASE
postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
--------------+---------+-----------+---------+-------+-------------------------
comment_test | t-ishii | UTF8 | C | C |
postgres | t-ishii | SQL_ASCII | C | C |
template0 | t-ishii | SQL_ASCII | C | C | =c/"t-ishii" +
| | | | | "t-ishii"=CTc/"t-ishii"
template1 | t-ishii | SQL_ASCII | C | C | =c/"t-ishii" +
| | | | | "t-ishii"=CTc/"t-ishii"
(4 rows)

postgres=# \encoding UTF8
postgres=# comment on database comment_test is 'あいう';
COMMENT
postgres=# \q
t-ishii@localhost: pg_dump -Fc comment_test -f /tmp/comment_test.pgdump
t-ishii@localhost: psql postgres
Pager usage is off.
psql (9.5.0)
Type "help" for help.

postgres=# alter database comment_test rename to comment_test_orig;
ALTER DATABASE
postgres-# \q
t-ishii@localhost: pg_restore -C -d template1 /tmp/comment_test.pgdump
t-ishii@localhost: psql postgres
Pager usage is off.
psql (9.5.0)
Type "help" for help.

postgres=#
postgres=# \l+
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description
-------------------+---------+-----------+---------+-------+-------------------------+---------+------------+--------------------------------------------
comment_test | t-ishii | UTF8 | C | C | | 6992 kB | pg_default | あいう
comment_test_orig | t-ishii | UTF8 | C | C | | 6992 kB | pg_default | あいう
postgres | t-ishii | SQL_ASCII | C | C | | 6992 kB | pg_default | default administrative connection database
template0 | t-ishii | SQL_ASCII | C | C | =c/"t-ishii" +| 6873 kB | pg_default | unmodifiable empty database
| | | | | "t-ishii"=CTc/"t-ishii" | | |
template1 | t-ishii | SQL_ASCII | C | C | =c/"t-ishii" +| 6992 kB | pg_default | default template for new databases
| | | | | "t-ishii"=CTc/"t-ishii" | | |
(5 rows)

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

We found a problem with comments on databases using german umlauts
(äöü) - or more generally, we found encoding problems when doing
"COMMENT ON database foo ..." when foo is UTF-8 encoded and the
database you are connected to (e.g. postgres) is SQL_ASCII.
I analyzed this with the help of Andrew Gierth (and others) on IRC -
it seems in that constellation you can write an non-UTF-8 comment onto
a UTF-8 database.
This leads to the problem, that wen trying to use pg_restore -C with a
dump created with -Fc the restore will fail when trying to do the
"COMMENT ON DATABASE.."
We stumpled across this because also pg_upgrade fails on this (during
the schema part), but doesn't detect this situation with the -c
(check) option beforehand.
This is 9.4.5 on Solaris 11 - but we did have that problem already
when using pg_upgrade from 8.4 to 9.1 a few years back. I am sorry we
didn't analyze further back then, but it was "just a comment". But now
it bit us again....

Here is what I did so far to analyze the problem (with instrucions
from Andrew Gierth). I hope this makes it clear, otherwise please
don't hesitate to request more details.

postgres=# create database comment_test template template0 encoding
'UTF-8';
CREATE DATABASE
#####

---> use pgadmin3 on Windows to set comment on database coment_test with
---> string "für", while being connected to postgres (SQL_ASCII encoded)
#####

$ pg_dump -Fc comment_test -f comment_test.pgdump
postgres=# alter database comment_test rename to comment_test_orig;
ALTER DATABASE
#####

$ pg_restore -C -d template1 comment_test.pgdump
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 1989; 1262 641528
COMMENT comment_test postgres
pg_restore: [archiver (db)] could not execute query: ERROR: invalid
byte sequence for encoding "UTF8": 0xfc
Command was: COMMENT ON DATABASE comment_test IS 'f¦r';
###

postgres@sz-pp-pg02-dev[~] $ pg_restore comment_test.pgdump
--
-- PostgreSQL database dump
--

SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;

--
-- Name: comment_test; Type: COMMENT; Schema: -; Owner: postgres
--

COMMENT ON DATABASE comment_test IS 'f¦r';

--
-- Name: plpgsql; Type: EXTENSION; Schema: -; Owner:
--

CREATE EXTENSION IF NOT EXISTS plpgsql WITH SCHEMA pg_catalog;

--
-- Name: EXTENSION plpgsql; Type: COMMENT; Schema: -; Owner:
--

COMMENT ON EXTENSION plpgsql IS 'PL/pgSQL procedural language';

--
-- Name: public; Type: ACL; Schema: -; Owner: postgres
--

REVOKE ALL ON SCHEMA public FROM PUBLIC;
REVOKE ALL ON SCHEMA public FROM postgres;
GRANT ALL ON SCHEMA public TO postgres;
GRANT ALL ON SCHEMA public TO PUBLIC;

--
-- PostgreSQL database dump complete
--
#######

$ pg_restore comment_test.pgdump | head -16 | tail -1 |xxd
0000000: 434f 4d4d 454e 5420 4f4e 2044 4154 4142 COMMENT ON DATAB
0000010: 4153 4520 636f 6d6d 656e 745f 7465 7374 ASE comment_test
0000020: 2049 5320 2766 fc72 273b 0a IS 'f.r';.
######

postgres=# select
convert_to(shobj_description(oid,'pg_database'),'SQL_ASCII') from
pg_database where datname='comment_test_orig';
convert_to
------------
\x66fc72
(1 row)
######

postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
--------------------+-----------+-----------+------------------+------------------+-----------------------
comment_test | postgres | UTF8 | C | C |
[...]
postgres | postgres | SQL_ASCII | C | C |
template0 | postgres | SQL_ASCII | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | SQL_ASCII | C | C | postgres=CTc/postgres+
| | | | | =c/postgres

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Jan Lentfer
Jan.Lentfer@web.de
In reply to: Tatsuo Ishii (#2)
Re: Encoding problems with "COMMENT ON DATABASE .." causing pg_restore (and pg_upgrade) to fail

Am 2016-01-27 01:45, schrieb Tatsuo Ishii:

I am not familiar with pgadmin3 but it seems a bug with pgadmin3 or
misuse of it to me. Probably when this:

---> use pgadmin3 on Windows to set comment on database coment_test
with
---> string "für", while being connected to postgres (SQL_ASCII
encoded)

was executed, the client encoding is not properly set to UTF-8 for
some reasons.

I think you are actually right. It seems this only happens if you
initially connect to a SQL_ASCII databse with pgadmin3 and then set a
comment containing special characters on a UTF-8 database. I could now
also reproduce this behaviour on a Linux host with Postgre 9.4.5.
And I can also confirm that this does not happen using psql.

I will forward this issue to the pgadmin mailing list.

Thanks and best regards

Jan

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#4Bruce Momjian
bruce@momjian.us
In reply to: Jan Lentfer (#3)
Re: Encoding problems with "COMMENT ON DATABASE .." causing pg_restore (and pg_upgrade) to fail

On Wed, Jan 27, 2016 at 08:44:27AM +0100, Jan Lentfer wrote:

Am 2016-01-27 01:45, schrieb Tatsuo Ishii:

I am not familiar with pgadmin3 but it seems a bug with pgadmin3 or
misuse of it to me. Probably when this:

---> use pgadmin3 on Windows to set comment on database
coment_test with
---> string "f�r", while being connected to postgres (SQL_ASCII
encoded)

was executed, the client encoding is not properly set to UTF-8 for
some reasons.

I think you are actually right. It seems this only happens if you
initially connect to a SQL_ASCII databse with pgadmin3 and then set
a comment containing special characters on a UTF-8 database. I could
now also reproduce this behaviour on a Linux host with Postgre
9.4.5.
And I can also confirm that this does not happen using psql.

I will forward this issue to the pgadmin mailing list.

I see no one else commented on this. We have trouble keeping the global
system catalogs consistent when databases in the same cluster use
different encodings. I am not sure how we could improve this.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#5Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Bruce Momjian (#4)
Re: Encoding problems with "COMMENT ON DATABASE .." causing pg_restore (and pg_upgrade) to fail

On 2/16/16 10:38 AM, Bruce Momjian wrote:

I see no one else commented on this. We have trouble keeping the global
system catalogs consistent when databases in the same cluster use
different encodings. I am not sure how we could improve this.

Could we force the global catalogs to always be accessed via UTF8, at
least for modification? I suspect that would mean changing encodings on
the fly in the appropriate command functions (such as what's listed in
src/include/commands/user.h).
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#6Bruce Momjian
bruce@momjian.us
In reply to: Jim Nasby (#5)
Re: Encoding problems with "COMMENT ON DATABASE .." causing pg_restore (and pg_upgrade) to fail

On Tue, Feb 16, 2016 at 01:36:24PM -0600, Jim Nasby wrote:

On 2/16/16 10:38 AM, Bruce Momjian wrote:

I see no one else commented on this. We have trouble keeping the global
system catalogs consistent when databases in the same cluster use
different encodings. I am not sure how we could improve this.

Could we force the global catalogs to always be accessed via UTF8,
at least for modification? I suspect that would mean changing
encodings on the fly in the appropriate command functions (such as
what's listed in src/include/commands/user.h).

I don't remember us favoring UTF8 in this way in the past.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#7John R Pierce
pierce@hogranch.com
In reply to: Bruce Momjian (#6)
Re: Encoding problems with "COMMENT ON DATABASE .." causing pg_restore (and pg_upgrade) to fail

On 2/16/2016 12:06 PM, Bruce Momjian wrote:

On Tue, Feb 16, 2016 at 01:36:24PM -0600, Jim Nasby wrote:

On 2/16/16 10:38 AM, Bruce Momjian wrote:

I see no one else commented on this. We have trouble keeping the global
system catalogs consistent when databases in the same cluster use
different encodings. I am not sure how we could improve this.

Could we force the global catalogs to always be accessed via UTF8,
at least for modification? I suspect that would mean changing
encodings on the fly in the appropriate command functions (such as
what's listed in src/include/commands/user.h).

I don't remember us favoring UTF8 in this way in the past.

what encoding SHOULD the global catalogs be in when the databases are in
different encodings? I'm not sure there's any answer better than utf8 ?

--
john r pierce, recycling bits in santa cruz

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#8Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: John R Pierce (#7)
Re: Encoding problems with "COMMENT ON DATABASE .." causing pg_restore (and pg_upgrade) to fail

On 2/16/16 2:10 PM, John R Pierce wrote:

Could we force the global catalogs to always be accessed via UTF8,
at least for modification? I suspect that would mean changing
encodings on the fly in the appropriate command functions (such as
what's listed in src/include/commands/user.h).

I don't remember us favoring UTF8 in this way in the past.

what encoding SHOULD the global catalogs be in when the databases are in
different encodings? I'm not sure there's any answer better than utf8 ?

You could probably make an argument for SQL_ASCII.

The part about all of this that concerns me is what happens when people
query global catalogs directly... I'm not sure if we have the ability to
force specific conversions when that happens.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#6)
Re: Encoding problems with "COMMENT ON DATABASE .." causing pg_restore (and pg_upgrade) to fail

Bruce Momjian <bruce@momjian.us> writes:

On Tue, Feb 16, 2016 at 01:36:24PM -0600, Jim Nasby wrote:

Could we force the global catalogs to always be accessed via UTF8,
at least for modification? I suspect that would mean changing
encodings on the fly in the appropriate command functions (such as
what's listed in src/include/commands/user.h).

I don't remember us favoring UTF8 in this way in the past.

Yeah. I'm pretty sure the Far Eastern contingent has specifically lobbied
against giving UTF8 such a preference. Also, if a name in the shared
catalog is UTF8, what do you do when it cannot be converted to the local
database encoding? I don't think pretending the entry isn't there will do.

Perhaps a reasonable thing for now is to document that it's a bad idea
to put non-ASCII characters in names or comments of shared objects
(databases, roles, tablespaces) unless all databases of the cluster share
the same encoding. I don't know if it would be useful/practical to try
to mechanically enforce such a rule, but we could at least warn people
about the issue.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#10Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Bruce Momjian (#4)
Re: Encoding problems with "COMMENT ON DATABASE .." causing pg_restore (and pg_upgrade) to fail

I see no one else commented on this. We have trouble keeping the global
system catalogs consistent when databases in the same cluster use
different encodings. I am not sure how we could improve this.

Long time ago I proposed an idea to solve this kind of problems:

/messages/by-id/20131112.155752.666523035722474275.t-ishii@sraoss.co.jp

The implementation details are not necessarily the best but still I
believe what we need is "round trip encoding conversion safe"
universal encoding for internal use.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs