Different encoding for string values and identifier strings? Or (select 'tést' as tést) returns different values for string and identifier...
Hi all!
I received a bug report about problems with identifiers.
If I do the following select:
select 1 as tést
The column name returned contains strange chars.
If I do:
select 'tést' as tést
the select value returned is ok but the column name is not. :(
Is there any configuration I need to change to specify the encoding of
the identifiers?
I'm using UTF-8 for my database.
Thanks in advance.
--
Regards,
Francisco Figueiredo Jr.
Npgsql Lead Developer
http://www.npgsql.org
http://fxjr.blogspot.com
http://twitter.com/franciscojunior
Francisco Figueiredo Jr. <francisco@npgsql.org> wrote:
Hi all!
I received a bug report about problems with identifiers.
If I do the following select:
select 1 as t�st
The column name returned contains strange chars.
If I do:
select 't�st' as t�st
the select value returned is ok but the column name is not. :(
Is there any configuration I need to change to specify the encoding of
the identifiers?
test=# select 't�st' as "t�st";
t�st
------
t�st
looks okay for me ...
Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect. (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly." (unknown)
Kaufbach, Saxony, Germany, Europe. N 51.05082�, E 13.56889�
This is the query our user sent as one of the problems:
select "zurück" as zurück3_103_ from "Tabelle" tabellenhib0_
And here is the link to our bug report:
http://pgfoundry.org/tracker/index.php?func=detail&aid=1010988&group_id=1000140&atid=590
I didn't test putting double quotes in the column name identifier as
the bug report isn't using it.
What happens if you remove the double quotes in the column name identifier?
Thanks in advance.
On Tue, Mar 15, 2011 at 15:21, Andreas Kretschmer
<akretschmer@spamfence.net> wrote:
Francisco Figueiredo Jr. <francisco@npgsql.org> wrote:
Hi all!
I received a bug report about problems with identifiers.
If I do the following select:
select 1 as tést
The column name returned contains strange chars.
If I do:
select 'tést' as tést
the select value returned is ok but the column name is not. :(
Is there any configuration I need to change to specify the encoding of
the identifiers?test=# select 'tést' as "tést";
tést
------
téstlooks okay for me ...
Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect. (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly." (unknown)
Kaufbach, Saxony, Germany, Europe. N 51.05082°, E 13.56889°--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Regards,
Francisco Figueiredo Jr.
Npgsql Lead Developer
http://www.npgsql.org
http://fxjr.blogspot.com
http://twitter.com/franciscojunior
Francisco Figueiredo Jr. <francisco@npgsql.org> wrote:
What happens if you remove the double quotes in the column name identifier?
the same:
test=*# select 't�st' as t�st;
t�st
------
t�st
(1 Zeile)
Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect. (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly." (unknown)
Kaufbach, Saxony, Germany, Europe. N 51.05082�, E 13.56889�
Hmmmmmmmm,
What would change the encoding of the identifiers?
Because on my dev machine which unfortunately isn't with me right now
I can't get the identifier returned correctly :(
I remember that it returns:
test=*# select 'tést' as tést;
tst
------
tést
Is there any config I can change at runtime in order to have it
returned correctly?
Thanks in advance.
On Tue, Mar 15, 2011 at 15:45, Andreas Kretschmer
<akretschmer@spamfence.net> wrote:
Francisco Figueiredo Jr. <francisco@npgsql.org> wrote:
What happens if you remove the double quotes in the column name identifier?
the same:
test=*# select 'tést' as tést;
tést
------
tést
(1 Zeile)Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect. (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly." (unknown)
Kaufbach, Saxony, Germany, Europe. N 51.05082°, E 13.56889°--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Regards,
Francisco Figueiredo Jr.
Npgsql Lead Developer
http://www.npgsql.org
http://fxjr.blogspot.com
http://twitter.com/franciscojunior
Now, I'm using my dev machine.
With the tests I'm doing, I can see the following:
If I use:
select 'seléct' as "seléct";
column name returns ok as expected.
If I do:
select 'seléct' as seléct;
This is the sequence of bytes I receive from postgresql:
byte1 - 115 UTF-8 for s
byte2 - 101 UTF-8 for e
byte3 - 108 UTF-8 for l
byte4 - 227
byte5 - 169
byte6 - 99 UTF-8 for c
byte7 - 116 UTF-8 for t
The problem lies in the byte4.
According to [1]http://en.wikipedia.org/wiki/UTF-8, the first byte defines how many bytes will compose
the UTF-8 char. the problem is that 227 encodes to a binary value of
1110 0011 and so, the UTF-8 decoder will think there are 3 bytes in
sequence when actually there are only 2! :( And this seems to be the
root of the problem for me.
For the select value the correct byte is returned:
byte1 - 115 UTF-8 for s
byte2 - 101 UTF-8 for e
byte3 - 108 UTF-8 for l
byte4 - 195
byte5 - 169
byte6 - 99 UTF-8 for c
byte7 - 116 UTF-8 for t
Where 195 is 1100 0011 which gives two bytes in sequence and the
decoder can decode this to the U+00E9 which is the char "é"
Do you think this can be related to my machine? I'm using OSX 10.6.6
and I compiled postgresql 9.0.1 from source code.
Thanks in advance.
[1]: http://en.wikipedia.org/wiki/UTF-8
On Tue, Mar 15, 2011 at 15:52, Francisco Figueiredo Jr.
<francisco@npgsql.org> wrote:
Hmmmmmmmm,
What would change the encoding of the identifiers?
Because on my dev machine which unfortunately isn't with me right now
I can't get the identifier returned correctly :(I remember that it returns:
test=*# select 'tést' as tést;
tst
------
téstIs there any config I can change at runtime in order to have it
returned correctly?Thanks in advance.
On Tue, Mar 15, 2011 at 15:45, Andreas Kretschmer
<akretschmer@spamfence.net> wrote:Francisco Figueiredo Jr. <francisco@npgsql.org> wrote:
What happens if you remove the double quotes in the column name identifier?
the same:
test=*# select 'tést' as tést;
tést
------
tést
(1 Zeile)Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect. (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly." (unknown)
Kaufbach, Saxony, Germany, Europe. N 51.05082°, E 13.56889°--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general--
Regards,Francisco Figueiredo Jr.
Npgsql Lead Developer
http://www.npgsql.org
http://fxjr.blogspot.com
http://twitter.com/franciscojunior
--
Regards,
Francisco Figueiredo Jr.
Npgsql Lead Developer
http://www.npgsql.org
http://fxjr.blogspot.com
http://twitter.com/franciscojunior
Any ideas??
Would it be possible that Postgresql would be using another encoding
for the identifiers when they aren't wrapped by double quotes?
On Tue, Mar 15, 2011 at 23:37, Francisco Figueiredo Jr.
<francisco@npgsql.org> wrote:
Now, I'm using my dev machine.
With the tests I'm doing, I can see the following:
If I use:
select 'seléct' as "seléct";
column name returns ok as expected.
If I do:
select 'seléct' as seléct;
This is the sequence of bytes I receive from postgresql:
byte1 - 115 UTF-8 for s
byte2 - 101 UTF-8 for e
byte3 - 108 UTF-8 for l
byte4 - 227
byte5 - 169
byte6 - 99 UTF-8 for c
byte7 - 116 UTF-8 for tThe problem lies in the byte4.
According to [1], the first byte defines how many bytes will compose
the UTF-8 char. the problem is that 227 encodes to a binary value of
1110 0011 and so, the UTF-8 decoder will think there are 3 bytes in
sequence when actually there are only 2! :( And this seems to be the
root of the problem for me.For the select value the correct byte is returned:
byte1 - 115 UTF-8 for s
byte2 - 101 UTF-8 for e
byte3 - 108 UTF-8 for l
byte4 - 195
byte5 - 169
byte6 - 99 UTF-8 for c
byte7 - 116 UTF-8 for tWhere 195 is 1100 0011 which gives two bytes in sequence and the
decoder can decode this to the U+00E9 which is the char "é"Do you think this can be related to my machine? I'm using OSX 10.6.6
and I compiled postgresql 9.0.1 from source code.Thanks in advance.
[1] - http://en.wikipedia.org/wiki/UTF-8
On Tue, Mar 15, 2011 at 15:52, Francisco Figueiredo Jr.
<francisco@npgsql.org> wrote:Hmmmmmmmm,
What would change the encoding of the identifiers?
Because on my dev machine which unfortunately isn't with me right now
I can't get the identifier returned correctly :(I remember that it returns:
test=*# select 'tést' as tést;
tst
------
téstIs there any config I can change at runtime in order to have it
returned correctly?Thanks in advance.
On Tue, Mar 15, 2011 at 15:45, Andreas Kretschmer
<akretschmer@spamfence.net> wrote:Francisco Figueiredo Jr. <francisco@npgsql.org> wrote:
What happens if you remove the double quotes in the column name identifier?
the same:
test=*# select 'tést' as tést;
tést
------
tést
(1 Zeile)Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect. (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly." (unknown)
Kaufbach, Saxony, Germany, Europe. N 51.05082°, E 13.56889°--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general--
Regards,Francisco Figueiredo Jr.
Npgsql Lead Developer
http://www.npgsql.org
http://fxjr.blogspot.com
http://twitter.com/franciscojunior--
Regards,Francisco Figueiredo Jr.
Npgsql Lead Developer
http://www.npgsql.org
http://fxjr.blogspot.com
http://twitter.com/franciscojunior
--
Regards,
Francisco Figueiredo Jr.
Npgsql Lead Developer
http://www.npgsql.org
http://fxjr.blogspot.com
http://twitter.com/franciscojunior
"Francisco Figueiredo Jr." <francisco@npgsql.org> writes:
Would it be possible that Postgresql would be using another encoding
for the identifiers when they aren't wrapped by double quotes?
No. I'm betting this is a client-side bug ... but you haven't told us
what the client-side code is.
regards, tom lane
Oh, sorry for that.
My client code is Npgsql. I pulled those bytes from a debugging session
directly from the network stream. I wanted to know what bytes npgsql was
receiving.
This is the method which reads the data:
public static String ReadString(Stream network_stream)
{
NpgsqlEventLog.LogMethodEnter(LogLevel.Debug, CLASSNAME, "ReadString");
List<byte> buffer = new List<byte>();
for (int bRead = network_stream.ReadByte(); bRead != 0; bRead =
network_stream.ReadByte())
{
if (bRead == -1)
{
throw new IOException();
}
else
{
buffer.Add((byte) bRead);
}
}
if (NpgsqlEventLog.Level >= LogLevel.Debug)
NpgsqlEventLog.LogMsg(resman, "Log_StringRead",
LogLevel.Debug, ENCODING_UTF8.GetString(buffer.ToArray()));
return ENCODING_UTF8.GetString(buffer.ToArray());
}
My database has encoding set to UTF-8 although my lc_collate is pt.BR.UTF-8
this lc setting my have cause some trouble?
I also have problems with psql client where the char doesn't appear at all.
Andreas could see the char though...
I hope it helps.
Thanks in advance.
--
Sent from my Android phone
Francisco Figueiredo Jr.
Npgsql lead developer
fxjr.blogspot.com
twitter.com/franciscojunior
Em 18/03/2011 01:29, "Tom Lane" <tgl@sss.pgh.pa.us> escreveu:
Import Notes
Resolved by subject fallback
"Francisco Figueiredo Jr." <francisco@npgsql.org> writes:
My database has encoding set to UTF-8 although my lc_collate is pt.BR.UTF-8
this lc setting my have cause some trouble?
Hmmm ... actually, it strikes me that this may be a downcasing problem.
PG will try to feed an unquoted identifier through tolower(), and that
basically can't work on multibyte characters. Most implementations of
tolower() are smart enough to not change high-bit-set bytes in UTF8
locales, but maybe your platform's is not ...
regards, tom lane
Hmmmmm,
I'm using osx 10.6.6 and I compiled PG myself from source. Is there any
configure option or any library I may use to get the correct behavior? Is
there any runtime setting I can make to change this tolower() behavior,
maybe skip the call?
Thanks in advance.
--
Sent from my Android phone
Francisco Figueiredo Jr.
Npgsql lead developer
fxjr.blogspot.com
twitter.com/franciscojunior
Em 18/03/2011 22:01, "Tom Lane" <tgl@sss.pgh.pa.us> escreveu:
"Francisco Figueiredo Jr." <francisco@npgsql.org> writes:
My database has encoding set to UTF-8 although my lc_collate is
pt.BR.UTF-8
Show quoted text
this lc setting my have cause some trouble?
Hmmm ... actually, it strikes me that this may be a downcasing problem.
PG will try to feed an unquoted identifier through tolower(), and that
basically can't work on multibyte characters. Most implementations of
tolower() are smart enough to not change high-bit-set bytes in UTF8
locales, but maybe your platform's is not ...regards, tom lane