Different encoding for string values and identifier strings? Or (select 'tést' as tést) returns different values for string and identifier...

Started by Francisco Figueiredo Jr.about 15 years ago11 messagesgeneral
Jump to latest
#1Francisco Figueiredo Jr.
francisco@npgsql.org

Hi all!

I received a bug report about problems with identifiers.

If I do the following select:

select 1 as tést

The column name returned contains strange chars.

If I do:

select 'tést' as tést

the select value returned is ok but the column name is not. :(

Is there any configuration I need to change to specify the encoding of
the identifiers?

I'm using UTF-8 for my database.

Thanks in advance.

--
Regards,

Francisco Figueiredo Jr.
Npgsql Lead Developer
http://www.npgsql.org
http://fxjr.blogspot.com
http://twitter.com/franciscojunior

#2Andreas Kretschmer
akretschmer@spamfence.net
In reply to: Francisco Figueiredo Jr. (#1)
Re: Different encoding for string values and identifier strings? Or (select 'tést' as tést) returns different values for string and identifier...

Francisco Figueiredo Jr. <francisco@npgsql.org> wrote:

Hi all!

I received a bug report about problems with identifiers.

If I do the following select:

select 1 as t�st

The column name returned contains strange chars.

If I do:

select 't�st' as t�st

the select value returned is ok but the column name is not. :(

Is there any configuration I need to change to specify the encoding of
the identifiers?

test=# select 't�st' as "t�st";
t�st
------
t�st

looks okay for me ...

Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect. (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly." (unknown)
Kaufbach, Saxony, Germany, Europe. N 51.05082�, E 13.56889�

#3Francisco Figueiredo Jr.
francisco@npgsql.org
In reply to: Andreas Kretschmer (#2)
Re: [GENERAL] Different encoding for string values and identifier strings? Or (select 'tést' as tést) returns different values for string and identifier...

This is the query our user sent as one of the problems:

select "zurück" as zurück3_103_ from "Tabelle" tabellenhib0_

And here is the link to our bug report:

http://pgfoundry.org/tracker/index.php?func=detail&amp;aid=1010988&amp;group_id=1000140&amp;atid=590

I didn't test putting double quotes in the column name identifier as
the bug report isn't using it.

What happens if you remove the double quotes in the column name identifier?

Thanks in advance.

On Tue, Mar 15, 2011 at 15:21, Andreas Kretschmer
<akretschmer@spamfence.net> wrote:

Francisco Figueiredo Jr. <francisco@npgsql.org> wrote:

Hi all!

I received a bug report about problems with identifiers.

If I do the following select:

select 1 as tést

The column name returned contains strange chars.

If I do:

select 'tést' as tést

the select value returned is ok but the column name is not. :(

Is there any configuration I need to change to specify the encoding of
the identifiers?

test=# select 'tést' as "tést";
 tést
------
 tést

looks okay for me ...

Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect.                              (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly."   (unknown)
Kaufbach, Saxony, Germany, Europe.              N 51.05082°, E 13.56889°

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Regards,

Francisco Figueiredo Jr.
Npgsql Lead Developer
http://www.npgsql.org
http://fxjr.blogspot.com
http://twitter.com/franciscojunior

#4Andreas Kretschmer
akretschmer@spamfence.net
In reply to: Francisco Figueiredo Jr. (#3)
Re: Re: [GENERAL] Different encoding for string values and identifier strings? Or (select 'tést' as tést) returns different values for string and identifier...

Francisco Figueiredo Jr. <francisco@npgsql.org> wrote:

What happens if you remove the double quotes in the column name identifier?

the same:

test=*# select 't�st' as t�st;
t�st
------
t�st
(1 Zeile)

Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect. (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly." (unknown)
Kaufbach, Saxony, Germany, Europe. N 51.05082�, E 13.56889�

#5Francisco Figueiredo Jr.
francisco@npgsql.org
In reply to: Andreas Kretschmer (#4)
Re: [GENERAL] Re: [GENERAL] Different encoding for string values and identifier strings? Or (select 'tést' as tést) returns different values for string and identifier...

Hmmmmmmmm,

What would change the encoding of the identifiers?

Because on my dev machine which unfortunately isn't with me right now
I can't get the identifier returned correctly :(

I remember that it returns:

test=*# select 'tést' as tést;
 tst
------
 tést

Is there any config I can change at runtime in order to have it
returned correctly?

Thanks in advance.

On Tue, Mar 15, 2011 at 15:45, Andreas Kretschmer
<akretschmer@spamfence.net> wrote:

Francisco Figueiredo Jr. <francisco@npgsql.org> wrote:

What happens if you remove the double quotes in the column name identifier?

the same:

test=*# select 'tést' as tést;
 tést
------
 tést
(1 Zeile)

Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect.                              (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly."   (unknown)
Kaufbach, Saxony, Germany, Europe.              N 51.05082°, E 13.56889°

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Regards,

Francisco Figueiredo Jr.
Npgsql Lead Developer
http://www.npgsql.org
http://fxjr.blogspot.com
http://twitter.com/franciscojunior

#6Francisco Figueiredo Jr.
francisco@npgsql.org
In reply to: Francisco Figueiredo Jr. (#5)
Re: [GENERAL] Re: [GENERAL] Different encoding for string values and identifier strings? Or (select 'tést' as tést) returns different values for string and identifier...

Now, I'm using my dev machine.

With the tests I'm doing, I can see the following:

If I use:

select 'seléct' as "seléct";

column name returns ok as expected.

If I do:

select 'seléct' as seléct;

This is the sequence of bytes I receive from postgresql:

byte1 - 115 UTF-8 for s
byte2 - 101 UTF-8 for e
byte3 - 108 UTF-8 for l
byte4 - 227
byte5 - 169
byte6 - 99 UTF-8 for c
byte7 - 116 UTF-8 for t

The problem lies in the byte4.
According to [1]http://en.wikipedia.org/wiki/UTF-8, the first byte defines how many bytes will compose
the UTF-8 char. the problem is that 227 encodes to a binary value of
1110 0011 and so, the UTF-8 decoder will think there are 3 bytes in
sequence when actually there are only 2! :( And this seems to be the
root of the problem for me.

For the select value the correct byte is returned:

byte1 - 115 UTF-8 for s
byte2 - 101 UTF-8 for e
byte3 - 108 UTF-8 for l
byte4 - 195
byte5 - 169
byte6 - 99 UTF-8 for c
byte7 - 116 UTF-8 for t

Where 195 is 1100 0011 which gives two bytes in sequence and the
decoder can decode this to the U+00E9 which is the char "é"

Do you think this can be related to my machine? I'm using OSX 10.6.6
and I compiled postgresql 9.0.1 from source code.

Thanks in advance.

[1]: http://en.wikipedia.org/wiki/UTF-8

On Tue, Mar 15, 2011 at 15:52, Francisco Figueiredo Jr.
<francisco@npgsql.org> wrote:

Hmmmmmmmm,

What would change the encoding of the identifiers?

Because on my dev machine which unfortunately isn't with me right now
I can't get the identifier returned correctly :(

I remember that it returns:

 test=*# select 'tést' as tést;
  tst
 ------
  tést

Is there any config I can change at runtime in order to have it
returned correctly?

Thanks in advance.

On Tue, Mar 15, 2011 at 15:45, Andreas Kretschmer
<akretschmer@spamfence.net> wrote:

Francisco Figueiredo Jr. <francisco@npgsql.org> wrote:

What happens if you remove the double quotes in the column name identifier?

the same:

test=*# select 'tést' as tést;
 tést
------
 tést
(1 Zeile)

Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect.                              (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly."   (unknown)
Kaufbach, Saxony, Germany, Europe.              N 51.05082°, E 13.56889°

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Regards,

Francisco Figueiredo Jr.
Npgsql Lead Developer
http://www.npgsql.org
http://fxjr.blogspot.com
http://twitter.com/franciscojunior

--
Regards,

Francisco Figueiredo Jr.
Npgsql Lead Developer
http://www.npgsql.org
http://fxjr.blogspot.com
http://twitter.com/franciscojunior

#7Francisco Figueiredo Jr.
francisco@npgsql.org
In reply to: Francisco Figueiredo Jr. (#6)
Re: [GENERAL] Re: [GENERAL] Different encoding for string values and identifier strings? Or (select 'tést' as tést) returns different values for string and identifier...

Any ideas??

Would it be possible that Postgresql would be using another encoding
for the identifiers when they aren't wrapped by double quotes?

On Tue, Mar 15, 2011 at 23:37, Francisco Figueiredo Jr.
<francisco@npgsql.org> wrote:

Now, I'm using my dev machine.

With the tests I'm doing, I can see the following:

If I use:

select 'seléct' as "seléct";

column name returns ok as expected.

If I do:

select 'seléct' as seléct;

This is the sequence of bytes I receive from postgresql:

byte1 - 115 UTF-8 for s
byte2 - 101 UTF-8 for e
byte3 - 108 UTF-8 for l
byte4 - 227
byte5 - 169
byte6 - 99 UTF-8 for c
byte7 - 116 UTF-8 for t

The problem lies in the byte4.
According to [1], the first byte defines how many bytes will compose
the UTF-8 char. the problem is that 227 encodes to a binary value of
1110 0011 and so, the UTF-8 decoder will think there are 3 bytes in
sequence when actually there are only 2! :( And this seems to be the
root of the problem for me.

For the select value the correct byte is returned:

byte1 - 115 UTF-8 for s
byte2 - 101 UTF-8 for e
byte3 - 108 UTF-8 for l
byte4 - 195
byte5 - 169
byte6 - 99 UTF-8 for c
byte7 - 116 UTF-8 for t

Where 195 is 1100 0011 which gives two bytes in sequence and the
decoder can decode this to the U+00E9 which is the char "é"

Do you think this can be related to my machine? I'm using OSX 10.6.6
and I compiled postgresql 9.0.1 from source code.

Thanks in advance.

[1] - http://en.wikipedia.org/wiki/UTF-8

On Tue, Mar 15, 2011 at 15:52, Francisco Figueiredo Jr.
<francisco@npgsql.org> wrote:

Hmmmmmmmm,

What would change the encoding of the identifiers?

Because on my dev machine which unfortunately isn't with me right now
I can't get the identifier returned correctly :(

I remember that it returns:

 test=*# select 'tést' as tést;
  tst
 ------
  tést

Is there any config I can change at runtime in order to have it
returned correctly?

Thanks in advance.

On Tue, Mar 15, 2011 at 15:45, Andreas Kretschmer
<akretschmer@spamfence.net> wrote:

Francisco Figueiredo Jr. <francisco@npgsql.org> wrote:

What happens if you remove the double quotes in the column name identifier?

the same:

test=*# select 'tést' as tést;
 tést
------
 tést
(1 Zeile)

Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect.                              (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly."   (unknown)
Kaufbach, Saxony, Germany, Europe.              N 51.05082°, E 13.56889°

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Regards,

Francisco Figueiredo Jr.
Npgsql Lead Developer
http://www.npgsql.org
http://fxjr.blogspot.com
http://twitter.com/franciscojunior

--
Regards,

Francisco Figueiredo Jr.
Npgsql Lead Developer
http://www.npgsql.org
http://fxjr.blogspot.com
http://twitter.com/franciscojunior

--
Regards,

Francisco Figueiredo Jr.
Npgsql Lead Developer
http://www.npgsql.org
http://fxjr.blogspot.com
http://twitter.com/franciscojunior

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Francisco Figueiredo Jr. (#7)
Re: Re: [GENERAL] Re: [GENERAL] Different encoding for string values and identifier strings? Or (select 'tést' as tést) returns different values for string and identifier...

"Francisco Figueiredo Jr." <francisco@npgsql.org> writes:

Would it be possible that Postgresql would be using another encoding
for the identifiers when they aren't wrapped by double quotes?

No. I'm betting this is a client-side bug ... but you haven't told us
what the client-side code is.

regards, tom lane

#9Francisco Figueiredo Jr.
francisco@npgsql.org
In reply to: Tom Lane (#8)
Re: [GENERAL] Re: [GENERAL] Re: [GENERAL] Different encoding for string values and identifier strings? Or (select 'tést' as tést) returns different values for string and identifier...

Oh, sorry for that.

My client code is Npgsql. I pulled those bytes from a debugging session
directly from the network stream. I wanted to know what bytes npgsql was
receiving.

This is the method which reads the data:

public static String ReadString(Stream network_stream)
{
NpgsqlEventLog.LogMethodEnter(LogLevel.Debug, CLASSNAME, "ReadString");

List<byte> buffer = new List<byte>();
for (int bRead = network_stream.ReadByte(); bRead != 0; bRead =
network_stream.ReadByte())
{
if (bRead == -1)
{
throw new IOException();
}
else
{
buffer.Add((byte) bRead);
}
}

if (NpgsqlEventLog.Level >= LogLevel.Debug)
NpgsqlEventLog.LogMsg(resman, "Log_StringRead",
LogLevel.Debug, ENCODING_UTF8.GetString(buffer.ToArray()));

return ENCODING_UTF8.GetString(buffer.ToArray());
}

My database has encoding set to UTF-8 although my lc_collate is pt.BR.UTF-8
this lc setting my have cause some trouble?

I also have problems with psql client where the char doesn't appear at all.
Andreas could see the char though...

I hope it helps.

Thanks in advance.
--
Sent from my Android phone

Francisco Figueiredo Jr.
Npgsql lead developer
fxjr.blogspot.com
twitter.com/franciscojunior
Em 18/03/2011 01:29, "Tom Lane" <tgl@sss.pgh.pa.us> escreveu:

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Francisco Figueiredo Jr. (#9)
Re: Re: [GENERAL] Re: [GENERAL] Re: [GENERAL] Different encoding for string values and identifier strings? Or (select 'tést' as tést) returns different values for string and identifier...

"Francisco Figueiredo Jr." <francisco@npgsql.org> writes:

My database has encoding set to UTF-8 although my lc_collate is pt.BR.UTF-8
this lc setting my have cause some trouble?

Hmmm ... actually, it strikes me that this may be a downcasing problem.
PG will try to feed an unquoted identifier through tolower(), and that
basically can't work on multibyte characters. Most implementations of
tolower() are smart enough to not change high-bit-set bytes in UTF8
locales, but maybe your platform's is not ...

regards, tom lane

#11Francisco Figueiredo Jr.
francisco@npgsql.org
In reply to: Tom Lane (#10)
Re: [GENERAL] Re: [GENERAL] Re: [GENERAL] Re: [GENERAL] Different encoding for string values and identifier strings? Or (select 'tést' as tést) returns different values for string and identifier...

Hmmmmm,

I'm using osx 10.6.6 and I compiled PG myself from source. Is there any
configure option or any library I may use to get the correct behavior? Is
there any runtime setting I can make to change this tolower() behavior,
maybe skip the call?

Thanks in advance.

--
Sent from my Android phone

Francisco Figueiredo Jr.
Npgsql lead developer
fxjr.blogspot.com
twitter.com/franciscojunior
Em 18/03/2011 22:01, "Tom Lane" <tgl@sss.pgh.pa.us> escreveu:

"Francisco Figueiredo Jr." <francisco@npgsql.org> writes:

My database has encoding set to UTF-8 although my lc_collate is

pt.BR.UTF-8

Show quoted text

this lc setting my have cause some trouble?

Hmmm ... actually, it strikes me that this may be a downcasing problem.
PG will try to feed an unquoted identifier through tolower(), and that
basically can't work on multibyte characters. Most implementations of
tolower() are smart enough to not change high-bit-set bytes in UTF8
locales, but maybe your platform's is not ...

regards, tom lane