Multibyte still broken

Started by Michael Robinsonover 25 years ago15 messages
#1Michael Robinson
robinson@netrinsics.com

These are exerpts from a message from Tatsuo Ishii dated January 26, on
the subject of fragile code in the multibyte routines:

---- begin ----
Defensive programming saves the system but does not user. Once
corrupted data is stored in the system, it's totally useless for the
user anyway. What about validating data *before* inserting it into a
table?
---- end ----

---- begin ----

Here it is. With this patch, copy out should be happy even with the
wrong data. I'm not sure if it could be displayed correctly, though.

Thank you very much. However, I think even this is too optimistic:

! if (*s & 0x80)

Shouldn't it be something like:

if ((*s & 0x80) && (*(s+1) & 0x80))

Even though "\242\242\242\0" is an invalid EUC sequence, it still shouldn't be
allowed to break the software.

Thanks for the suggestion. More robust code is always good.
---- end ----

More robust code may always be good, but "good" apparently doesn't always go
into the tree. Imagine my surprise, while upgrading a production server
from 6.5.3 to 7.0, when the data dumped from the old database failed to load
into the new database (well, crashed the backend, to be specific).

Apparently the "validate your own damn data" sentiment of the first excerpt
above has prevailed, because, on inspection, the MB code is just as fragile
as it was five months ago.

I was forced to perform emergency repairs to my database dump file to fool a
non-multibyte 7.0 into accepting it. Since EUC_CN is compatible with
Latin-1, and since the benefits of multibyte are small compared to the
risks, I intend to stick with unibyte Postgres henceforth.

I would, though, recommend a warning in the "INSTALL" file along the lines of:

"WARNING: Use of improperly-encoded text with multi-byte support enabled
WILL lead to data corruption and/or loss. Do not enable multi-byte support
unless you intend to fully validate your own damn data."

-Michael Robinson

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Robinson (#1)
Re: Multibyte still broken

Michael Robinson <robinson@netrinsics.com> writes:

More robust code may always be good, but "good" apparently doesn't always go
into the tree. Imagine my surprise, while upgrading a production server
from 6.5.3 to 7.0, when the data dumped from the old database failed to load
into the new database (well, crashed the backend, to be specific).

Sounds like someone failed to follow through on applying that fix :-(.
Patches have been known to slip through the cracks before, especially
when not fully worked out and formally submitted.

I would, though, recommend a warning in the "INSTALL" file along the lines of:

"WARNING: Use of improperly-encoded text with multi-byte support enabled
WILL lead to data corruption and/or loss. Do not enable multi-byte support
unless you intend to fully validate your own damn data."

Sorry you had trouble, but the above seems uncalled-for. What would be
more productive is a submitted patch to apply against 7.0.

regards, tom lane

#3Mitch Vincent
mitch@huntsvilleal.com
In reply to: Michael Robinson (#1)
Great, big errors ... Again.

Ok, it appears the database backend is being stupid again, I'm getting all
kinds of errors.

From my PHP code, I'm getting this kind of error now:

heap_beginscan:
!RelationIsValid(relation) in classes/db_pgsql.inc on line 63
Database error: Invalid SQL: select
user_id_hash,session_id,user_id,agency_id,date_part('epoch',created) as
start,date_part('epoch',updated) as age from users_online where
user_id_hash='9de5085eac89de8be0169b6dc6801524'
PostgreSQL Error: 1 (ERROR: heap_beginscan: !RelationIsValid(relation) )
Session halted

I got "Sorry, too many clients already" -- a little while ago..

I'm running at debug level 3, but have nothing to show for it in the log
file except this :

NOTICE: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend died
abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am going to terminate
your database system connection and exit.
Please reconnect to the database system and repeat your query.
/usr/local/pgsql/bin/postmaster: CleanupProc: sending SIGUSR1 to process
44639
/usr/local/pgsql/bin/postmaster: CleanupProc: reinitializing shared memory
and semaphores
shmem_exit(0) [#0]

So.. Now that I've got it to break, any ideas on what I can do to figure out
why it's happening? This is still a 6.5.3 server.. (I weas going to upgrade
today believe it or not.)

- Mitch

"The only real failure is quitting."

#4The Hermit Hacker
scrappy@hub.org
In reply to: Mitch Vincent (#3)
Re: Great, big errors ... Again.

On Wed, 10 May 2000, Mitch Vincent wrote:

Ok, it appears the database backend is being stupid again, I'm getting all
kinds of errors.

From my PHP code, I'm getting this kind of error now:

heap_beginscan:
!RelationIsValid(relation) in classes/db_pgsql.inc on line 63
Database error: Invalid SQL: select
user_id_hash,session_id,user_id,agency_id,date_part('epoch',created) as
start,date_part('epoch',updated) as age from users_online where
user_id_hash='9de5085eac89de8be0169b6dc6801524'
PostgreSQL Error: 1 (ERROR: heap_beginscan: !RelationIsValid(relation) )
Session halted

I got "Sorry, too many clients already" -- a little while ago..

I'm running at debug level 3, but have nothing to show for it in the log
file except this :

NOTICE: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend died
abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am going to terminate
your database system connection and exit.
Please reconnect to the database system and repeat your query.
/usr/local/pgsql/bin/postmaster: CleanupProc: sending SIGUSR1 to process
44639
/usr/local/pgsql/bin/postmaster: CleanupProc: reinitializing shared memory
and semaphores
shmem_exit(0) [#0]

So.. Now that I've got it to break, any ideas on what I can do to figure out
why it's happening? This is still a 6.5.3 server.. (I weas going to upgrade
today believe it or not.)

My suggestions at this time, with v7.0 relesed (and assuming you aren't
using ident?) is to get the upgrade over and done with, so that we can
debug that. Its hard to spend the resources on debugging v6.5.3 when
a) you are planning on upgrading anyway and b) it might already be fixed
in v7.0 :(

At least if you can generate the errors with v7.0, its fresh in ppls minds
...

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#5Mitch Vincent
mitch@huntsvilleal.com
In reply to: The Hermit Hacker (#4)
Re: Great, big errors ... Again.

Already going...

- Mitch

"The only real failure is quitting."

----- Original Message -----
From: The Hermit Hacker <scrappy@hub.org>
To: Mitch Vincent <mitch@huntsvilleal.com>
Cc: <pgsql-hackers@hub.org>
Sent: Wednesday, May 10, 2000 1:01 PM
Subject: Re: [HACKERS] Great, big errors ... Again.

On Wed, 10 May 2000, Mitch Vincent wrote:

Ok, it appears the database backend is being stupid again, I'm getting

all

kinds of errors.

From my PHP code, I'm getting this kind of error now:

heap_beginscan:
!RelationIsValid(relation) in classes/db_pgsql.inc on line 63
Database error: Invalid SQL: select
user_id_hash,session_id,user_id,agency_id,date_part('epoch',created) as
start,date_part('epoch',updated) as age from users_online where
user_id_hash='9de5085eac89de8be0169b6dc6801524'
PostgreSQL Error: 1 (ERROR: heap_beginscan: !RelationIsValid(relation) )
Session halted

I got "Sorry, too many clients already" -- a little while ago..

I'm running at debug level 3, but have nothing to show for it in the log
file except this :

NOTICE: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend died
abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am going to

terminate

your database system connection and exit.
Please reconnect to the database system and repeat your query.
/usr/local/pgsql/bin/postmaster: CleanupProc: sending SIGUSR1 to process
44639
/usr/local/pgsql/bin/postmaster: CleanupProc: reinitializing shared

memory

and semaphores
shmem_exit(0) [#0]

So.. Now that I've got it to break, any ideas on what I can do to figure

out

why it's happening? This is still a 6.5.3 server.. (I weas going to

upgrade

today believe it or not.)

My suggestions at this time, with v7.0 relesed (and assuming you aren't
using ident?) is to get the upgrade over and done with, so that we can
debug that. Its hard to spend the resources on debugging v6.5.3 when
a) you are planning on upgrading anyway and b) it might already be fixed
in v7.0 :(

At least if you can generate the errors with v7.0, its fresh in ppls minds
...

Marc G. Fournier ICQ#7615664 IRC Nick:

Scrappy

Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:

scrappy@{freebsd|postgresql}.org

Show quoted text
#6Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Michael Robinson (#1)
Re: Multibyte still broken

More robust code may always be good, but "good" apparently doesn't always go
into the tree. Imagine my surprise, while upgrading a production server
from 6.5.3 to 7.0, when the data dumped from the old database failed to load
into the new database (well, crashed the backend, to be specific).

Apparently the "validate your own damn data" sentiment of the first excerpt
above has prevailed, because, on inspection, the MB code is just as fragile
as it was five months ago.

I was forced to perform emergency repairs to my database dump file to fool a
non-multibyte 7.0 into accepting it. Since EUC_CN is compatible with
Latin-1, and since the benefits of multibyte are small compared to the
risks, I intend to stick with unibyte Postgres henceforth.

I would, though, recommend a warning in the "INSTALL" file along the lines of:

"WARNING: Use of improperly-encoded text with multi-byte support enabled
WILL lead to data corruption and/or loss. Do not enable multi-byte support
unless you intend to fully validate your own damn data."

Sorry for the problem. I forgot about issue:-<

What I'm thinking now to fix the problem you found is that doing data
validataion in the text/var/char input functions, rather than tweaking
the mb functions. If corrupted MB string was found, then call
elog(ERROR) to abort the transation. Will appear in 7.0.1 unless
someone objects.
--
Tatsuo Ishii

#7Michael Robinson
robinson@netrinsics.com
In reply to: Tom Lane (#2)
Re: Multibyte still broken

Tom Lane <tgl@sss.pgh.pa.us> writes:

Sorry you had trouble,

Trouble I had foreseen, which I had brought to the mailing list, and for
which I had provided an explanation and fix, by the way.

but the above seems uncalled-for. What would be
more productive is a submitted patch to apply against 7.0.

Actually, I think there are three separate issues here.

1. There was a potentially serious problem, and it didn't get on Bruce's
master TODO list, which is the authoritative reference for what in Postgres
needs to be fixed. I take responsibility for not seeing to it that this
was done.

2. By its nature, the multi-byte code is poorly overseen, and poorly
exercised. This is understandable, as the overwhelming majority of
Postgres installations don't want or need it. However, the result is that
it's effectively a "contrib"-quality component in the Postgres core.

3. The multi-byte support, as it currently exists, is founded on a particular
philosophy, one which I argue is not the most pragmatic. In the East Asia
that I know and love, multibyte support, as a rule, is an ugly hack on top
of unibyte tools and infrastructure. The exception is end-to-end Unicode
systems, which can be relied on to produce predictable data. However, I've
never encountered a native Simplified Chinese GB application in which it
was the least bit difficult to produce "illegal" code sequences. In a
real-world environment, with email attachments, cut-and-paste, and whatnot,
it's practically inevitable.

Thus, I cannot accept a situation where my database aborts on, or
otherwise rejects data that is produced and accepted by all other tools
in the work environment, and will stick with a unibyte build as long as
this is the case.

-Michael Robinson

#8Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Michael Robinson (#7)
Re: Multibyte still broken

What I'm thinking now to fix the problem you found is that doing data
validataion in the text/var/char input functions, rather than tweaking
the mb functions.

Could you explain why? I'd sure like it better if the mb code checked for
mb things.

It's simple:

Do not allow corrupted data imported into database.

To acomplish this, of course we have to call existing MB functions, though.
--
Tatsuo Ishii

#9Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Michael Robinson (#7)
Re: Multibyte still broken

Thus, I cannot accept a situation where my database aborts on, or
otherwise rejects data that is produced and accepted by all other tools
in the work environment, and will stick with a unibyte build as long as
this is the case.

We are planning on working on NATIONAL CHARACTER and CHARACTER SET
features for the next release, which would move MB into a supported
feature which can coexist with ascii in the standard installation.
Would you be interested in contributing to that, either with code or
simply with implementation ideas and testing? I know that Tatsuo is
planning on participating, and I'm looking forward to working on it
too, though as you point out those of us who can get by with ascii
text don't know how to exercise MB capabilities so I'll need
suggestions and help.

Regards.

- Thomas

--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California

#10Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Michael Robinson (#7)
Re: Multibyte still broken

3. The multi-byte support, as it currently exists, is founded on a particular
philosophy, one which I argue is not the most pragmatic. In the East Asia
that I know and love, multibyte support, as a rule, is an ugly hack on top
of unibyte tools and infrastructure. The exception is end-to-end Unicode
systems, which can be relied on to produce predictable data. However, I've
never encountered a native Simplified Chinese GB application in which it
was the least bit difficult to produce "illegal" code sequences. In a
real-world environment, with email attachments, cut-and-paste, and whatnot,
it's practically inevitable.

I am supprised to hear that you have so poor quality tools that
produce illegal code sequences of Simplified Chinese. In Japan, as far
as I know, we never have such a low quality tools which generate
illegal Japanese charaters just because they are not accepted in the
market, even in the case of email attachments, or cut-and-past or
whatever. I would like to hear from one living in others countries
such as Taiwan or Korea about their situations.

Anyway, IMHO letting users know that they have corrupted data is much
better than making them believe that their data are ok.
--
Tatsuo Ishii

#11Michael Robinson
robinson@netrinsics.com
In reply to: Thomas Lockhart (#9)
Re: Multibyte still broken

Thomas Lockhart <lockhart@alumni.caltech.edu> writes:

We are planning on working on NATIONAL CHARACTER and CHARACTER SET
features for the next release, which would move MB into a supported
feature which can coexist with ascii in the standard installation.
Would you be interested in contributing to that, either with code or
simply with implementation ideas and testing?

It will be my pleasure to provide as much support as my circumstances permit.
Unfortunately, though, my circumstances lately don't permit a whole lot :-(.

-Michael

#12Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Michael Robinson (#7)
Re: Multibyte still broken

Tom Lane <tgl@sss.pgh.pa.us> writes:

Sorry you had trouble,

Trouble I had foreseen, which I had brought to the mailing list, and for
which I had provided an explanation and fix, by the way.

but the above seems uncalled-for. What would be
more productive is a submitted patch to apply against 7.0.

Actually, I think there are three separate issues here.

1. There was a potentially serious problem, and it didn't get on Bruce's
master TODO list, which is the authoritative reference for what in Postgres
needs to be fixed. I take responsibility for not seeing to it that this
was done.

Please tell me what to add.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#13Michael Robinson
robinson@netrinsics.com
In reply to: Tatsuo Ishii (#10)
Re: Multibyte still broken

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

I am supprised to hear that you have so poor quality tools that
produce illegal code sequences of Simplified Chinese. In Japan, as far
as I know, we never have such a low quality tools which generate
illegal Japanese charaters just because they are not accepted in the
market, even in the case of email attachments, or cut-and-past or
whatever.

The problem is not that the tools produce "illegal characters". The problem
is that, as an EUC code, GB permits the coexistance of standard ascii
characters with double-byte hanzi characters. Furthermore, most Chinese
software is an operating-system "hack" on top of English-language software
based on a Latin-1 character set (the Chinese software market is underserved
compared to Japan, so we have to cope as best we can).

The result is that it is possible to, for example, insert a carriage return
or ASCII comma into the middle of a hanzi, which breaks the alignment for all
the hanzi on the rest of the line. It's also possible, in non-native Chinese
applications, to select one byte of a hanzi character in a cut or copy
operation.

So the problem is that the tools do not uniformly respect the integrity of
a double-byte hanzi character, but rather treat it as two individual Latin-1
characters.

The important point, though, is that all tools, whether native Chinese or
"hacked" English, accept the resulting invalid code sequences consistently,
robustly, and without complaint.

-Michael

#14Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Michael Robinson (#13)
Re: Multibyte still broken

I am going to need a confirmation from someone else before adding it to
the TODO list.

Bruce Momjian <pgman@candle.pha.pa.us> writes:

1. There was a potentially serious problem, and it didn't get on Bruce's
master TODO list, which is the authoritative reference for what in Postgres
needs to be fixed. I take responsibility for not seeing to it that this
was done.

Please tell me what to add.

TODO: Multibyte encoding/decoding routines need to handle invalid character
sequences safely (and, if possible, gracefully).

-Michael Robinson

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#15Michael Robinson
robinson@netrinsics.com
In reply to: Bruce Momjian (#12)
Re: Multibyte still broken

Bruce Momjian <pgman@candle.pha.pa.us> writes:

1. There was a potentially serious problem, and it didn't get on Bruce's
master TODO list, which is the authoritative reference for what in Postgres
needs to be fixed. I take responsibility for not seeing to it that this
was done.

Please tell me what to add.

TODO: Multibyte encoding/decoding routines need to handle invalid character
sequences safely (and, if possible, gracefully).

-Michael Robinson