Re: PostgreSQL and Unicode

Started by Tatsuo Ishiiover 25 years ago9 messages
#1Tatsuo Ishii
t-ishii@sra.co.jp

hackers:

I got following mail:

I was just looking over the PostgreSQL 7.0 docs and noticed that there
doesn't seem to be any new features for Unicode support. I wanted to verify
if this is true?

Right now we have a database that must support many different languages.
This works ok when we use UTF8 but the problem is that we do not know how
many characters the text will be. I was hoping that PG7.0 would support true
Unicode (2 byte) instead of just UTF8. Do you know if there is any plan to
support plain Unicode?

I think supporting "true Unicode (2 byte)" (probably that means UCS-2)
is not that easy since it includes '\0'. We need to fix at least:

the parser
libpq
psql
all client programs ...

Another idea might be doing a conversion between UTF-8 and UCS-2
somewhere between frontend and backend. However we still need to fix:

libpq
psql
all client programs ...

in this case. Any idea?

By the way, does anobody know what's wrong with UTF-8? In my
understanding UTF-8 and UCS-2 are logically identical.
--
Tatsuo Ishii

#2Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tatsuo Ishii (#1)
RE: PostgreSQL and Unicode

My understanding of the problem is UTF8 is this. Functionally, it is
equivalent to UCS-2, that is you can encode any Unicode character in UTF-8
that you could encode in UCS-2.
The problem we've run into is only related to Postgres. For example we had
a field that was fixed at 20 characters. If we put in ASCII then we could
put in all 20 characters. If we put in UTF8 encoded Japanese then (depending
on which characters were used) we got about 3 UTF8 characters for each
Japanese character. Aside from going from 20 characters to 7 (*problem #1*)
we also now have unpredictable behavior. Some characters, like Japanese,
were 3:1 ratio when encoding. UTF8 can go as high as 6:1 encoding ratio for
some language (I don't know which off hand) this is *problem #2*. Finally,
as a side affect of this, the string was just truncated so we sometimes got
only a partial UTF8 character in the database. This made the unencoding
either fail or produce weird results (*problem #3*).

Yes, I have noticed this problem too. But don't we have same problem
with UCS-2, with 2:1 ratio, then? I think we should fix this in the
way:
char(10) should means 10 letters, not 10 bytes no matter what
encoding we use

I will tackle this problem for 7.1.

How do you think, Rainer? Are you still unhappy with the solution
above?
--
Tatsuo Ishii

#3Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tatsuo Ishii (#2)
RE: PostgreSQL and Unicode

This sounds good. I agree that char(x) should mean x letters, not x bytes.

If this could be done in 7.1 that would be great! That means about 2 weeks,
right?! ;-)

No no:-) you must be talking about 7.0.1. I think that fix would
introduce some data format imcompatibility that is not allowed in the
minor version ups.

P.S. Can anyone point me to the right person to ask regarding a problem
we've been having with postmaster processes not going away. It seems to be
related to JDBC although I've heard of PHP people having similar problems.

Can you tell us more about "postmaster processes not going away"
problem?
--
Tatsuo Ishii

#4Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tatsuo Ishii (#3)
RE: PostgreSQL and Unicode

Regarding the "postmaster processes not going away" problem...

We're developing a Java application that connects to a PostgreSQL db. During
our development process we debug the code and sometimes kill the program in
the middle of the run. Sometimes this means that an open Connection to the
database is not properly closed. Now, I realize that this is an unfriendly
thing to do to PG but I would think it would eventually recover. What
happens instead is that the postmaster/postgres process that was handling
that connection never terminates. I have seen processes that are more than 2
weeks old before we noticed and restarted postmaster manually.

The problem is that eventually PG runs out of connections and stops allowing
new ones.

So, is there a way to tell PG to timeout unused connections after some
specified time? I've looked through all the docs and could not find anything
like this. I realize that this is a difficult issue because if there is an
unresolved transaction what do you do with it. I guess all you could do is
roll it back.

Any other suggestions? If not, can I request this as a future feature?
Although our problems are happening during debugging, they could happen
during deployment given a hardware problem or, *gasp*, a bug in our code.

What about adding KEEPALIVE option to the socket? This would take a
while to detect orphaned socket, though.
--
Tatsuo Ishii

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#4)
Re: RE: PostgreSQL and Unicode

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

Regarding the "postmaster processes not going away" problem...

What about adding KEEPALIVE option to the socket?

Of course, since whatever OS he's using on the client side is too broken
to notice that the socket is orphaned and close it, it might be so
broken as to respond to the keepalive pings :-(. Still, it'd be an easy
thing to try...

Even though the stated case sounds more like an OS bug than anything
else, setting KEEPALIVE on our TCP connections is probably still a good
idea. If the client machine were to crash completely then it wouldn't
be reasonable to expect it to close the connection, and we'd want to
have some method of ensuring that the connected backend shuts down
eventually. KEEPALIVE seems sufficiently low-overhead (and easy to
implement) to be the right answer for this scenario.

regards, tom lane

#6Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#5)
Re: RE: PostgreSQL and Unicode

What about adding KEEPALIVE option to the socket?

Of course, since whatever OS he's using on the client side is too broken
to notice that the socket is orphaned and close it, it might be so
broken as to respond to the keepalive pings :-(. Still, it'd be an easy
thing to try...

Even though the stated case sounds more like an OS bug than anything
else, setting KEEPALIVE on our TCP connections is probably still a good
idea. If the client machine were to crash completely then it wouldn't
be reasonable to expect it to close the connection, and we'd want to
have some method of ensuring that the connected backend shuts down
eventually. KEEPALIVE seems sufficiently low-overhead (and easy to
implement) to be the right answer for this scenario.

Ok. Here are patches against 7.0. BTW, does this break some platforms
such as Windows NT or QUNX4?

*** postgresql-7.0/src/backend/libpq/pqcomm.c.orig	Tue May 16 18:06:42 2000
--- postgresql-7.0/src/backend/libpq/pqcomm.c	Wed May 17 08:23:09 2000
***************
*** 375,381 ****
  		if (setsockopt(port->sock, pe->p_proto, TCP_NODELAY,
  					   &on, sizeof(on)) < 0)
  		{
! 			perror("postmaster: StreamConnection: setsockopt");
  			return STATUS_ERROR;
  		}
  	}
--- 375,387 ----
  		if (setsockopt(port->sock, pe->p_proto, TCP_NODELAY,
  					   &on, sizeof(on)) < 0)
  		{
! 			perror("postmaster: StreamConnection: setsockopt(TCP_NODELAY)");
! 			return STATUS_ERROR;
! 		}
! 		if (setsockopt(port->sock, SOL_SOCKET, SO_KEEPALIVE,
! 					   &on, sizeof(on)) < 0)
! 		{
! 			perror("postmaster: StreamConnection: setsockopt(SO_KEEPALIVE)");
  			return STATUS_ERROR;
  		}
  	}
#7Kardos, Dr. Andreas
kardos@repas-aeg.de
In reply to: Tatsuo Ishii (#4)
Re: RE: PostgreSQL and Unicode

This should be save. Both NT and QNX4 support SO_KEEPALIVE.

-----Urspr�ngliche Nachricht-----
Von: Tatsuo Ishii <t-ishii@sra.co.jp>
An: <tgl@sss.pgh.pa.us>
Cc: <t-ishii@sra.co.jp>; <rmager@vgkk.co.jp>; <ishii@postgresql.org>;
<hackers@postgresql.org>
Gesendet: Mittwoch, 17. Mai 2000 03:07
Betreff: Re: [HACKERS] RE: PostgreSQL and Unicode

Show quoted text

What about adding KEEPALIVE option to the socket?

Ok. Here are patches against 7.0. BTW, does this break some platforms
such as Windows NT or QUNX4?

#8Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Kardos, Dr. Andreas (#7)
Re: RE: PostgreSQL and Unicode

This should be save. Both NT and QNX4 support SO_KEEPALIVE.

Thanks for the info. The fix will apear in 7.0.1.
--
Tatsuo Ishii

#9Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tatsuo Ishii (#6)
Re: RE: PostgreSQL and Unicode

Seems this is applied.

What about adding KEEPALIVE option to the socket?

Of course, since whatever OS he's using on the client side is too broken
to notice that the socket is orphaned and close it, it might be so
broken as to respond to the keepalive pings :-(. Still, it'd be an easy
thing to try...

Even though the stated case sounds more like an OS bug than anything
else, setting KEEPALIVE on our TCP connections is probably still a good
idea. If the client machine were to crash completely then it wouldn't
be reasonable to expect it to close the connection, and we'd want to
have some method of ensuring that the connected backend shuts down
eventually. KEEPALIVE seems sufficiently low-overhead (and easy to
implement) to be the right answer for this scenario.

Ok. Here are patches against 7.0. BTW, does this break some platforms
such as Windows NT or QUNX4?

*** postgresql-7.0/src/backend/libpq/pqcomm.c.orig	Tue May 16 18:06:42 2000
--- postgresql-7.0/src/backend/libpq/pqcomm.c	Wed May 17 08:23:09 2000
***************
*** 375,381 ****
if (setsockopt(port->sock, pe->p_proto, TCP_NODELAY,
&on, sizeof(on)) < 0)
{
! 			perror("postmaster: StreamConnection: setsockopt");
return STATUS_ERROR;
}
}
--- 375,387 ----
if (setsockopt(port->sock, pe->p_proto, TCP_NODELAY,
&on, sizeof(on)) < 0)
{
! 			perror("postmaster: StreamConnection: setsockopt(TCP_NODELAY)");
! 			return STATUS_ERROR;
! 		}
! 		if (setsockopt(port->sock, SOL_SOCKET, SO_KEEPALIVE,
! 					   &on, sizeof(on)) < 0)
! 		{
! 			perror("postmaster: StreamConnection: setsockopt(SO_KEEPALIVE)");
return STATUS_ERROR;
}
}
-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026