Perl DBI converts UTF-8 again to UTF-8 before sending it to the server

Started by Matthias Apitzover 6 years ago7 messagesgeneral
Jump to latest
#1Matthias Apitz
guru@unixarea.de

Hello,

We're facing the problem that UTF-8 data to be INSERT'ed into a CHAR
column is converted again to UTF-8, assuming it's ISO. I have here a
small Perl program which can be used for testing:

#!/usr/local/bin/perl

use utf8;

my $PGDB = 'dbi:Pg:dbname=newsisis;host=127.0.0.1';
my $PGDB_USER = 'sisis';
my $SQL_INSERT = 'INSERT INTO dbctest (tstchar25, tstint) VALUES (?, ?)';

use DBI;

my $dbh = DBI->connect($PGDB, $PGDB_USER)
|| die "Couldn't connect to $PGDB as user $PGDB_USER: $DBI::errstr\n";

print "DBI is version $DBI::VERSION, DBD::Pg is version $DBD::Pg::VERSION\n";

$dbh->do("SET client_encoding TO UTF8");

$dbh->{pg_enable_utf8} = 1;

my $sth = $dbh->prepare( $SQL_INSERT )
|| die "Can't prepare insert statement $SQL_INSERT: $DBI::errstr";

my $text = "\xc3\xa4";
print "text: ".$text."\n";

$sth->execute($text, 1) or die $sth->errstr, "\n";

Running this, gives the following output:

$ ./utf8.pl
DBI is version 1.642, DBD::Pg is version 3.8.0
text: ä

$ ./utf8.pl | od -tx1
0000000 44 42 49 20 69 73 20 76 65 72 73 69 6f 6e 20 31
0000020 2e 36 34 32 2c 20 44 42 44 3a 3a 50 67 20 69 73
0000040 20 76 65 72 73 69 6f 6e 20 33 2e 38 2e 30 0a 74
0000060 65 78 74 3a 20 c3 a4 0a
^^^^^
(this shows that the var '$text' contains \xc3a4, an UTF-8 'ä'
(a-Umlaut).

If we now look into the table in hex we see:

$ printf "select tstchar25::bytea from dbctest ;\n" | psql -Usisis -dnewsisis
tstchar25
----------------------------------------------------------
\xc383c2a42020202020202020202020202020202020202020202020
(1 Zeile)

i.e. the 'ä' is converted again, like this cmd would do:

$ printf 'ä' | iconv -f iso-8859-1 -t utf-8 | od -tx1
0000000 c3 83 c2 a4

and ofc it's looking broken:

$ printf "select tstchar25 from dbctest ;\n" | psql -Usisis -dnewsisis
tstchar25
---------------------------
ä
(1 Zeile)

I watched the trafic between the client ./utf8.pl and the server with
strace and it's sent broken already to the server:

...
write(1, "text: \303\244\n", 9) = 9
sendto(3, "P\0\0\0G\0INSERT INTO dbctest (tstchar25, tstint) VALUES ($1, $2)\0\0\2\0\0\0\0\0\0\0\0B\0\0\0\33\0\0\0\0\0\2\0\0\0\4\303\203\302\244\0\0\0\0011\0\1\0\0D\0\0\0\6P\0E\0\0\0\t\0\0\0\0\0S\0\0\0\4", 122, MSG_NOSIGNAL, NULL, 0) = 122
...

see the sequence '\303\203\302\244' in octal values.

What is the problem with DBI? Thanks

matthias

--
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

3. Oktober! Wir gratulieren! Der Berliner Fernsehturm wird 50
aus: https://www.jungewelt.de/2019/10-02/index.php

#2Christoph Moench-Tegeder
cmt@burggraben.net
In reply to: Matthias Apitz (#1)
Re: Perl DBI converts UTF-8 again to UTF-8 before sending it to the server

## Matthias Apitz (guru@unixarea.de):

my $text = "\xc3\xa4";
print "text: ".$text."\n";

Your output is lying to you:
you need a binmode(STDOUT, ':encoding(utf8)'), which will make this print
"ä", and a utf8::decode($text), after which you get "ä". And when you
pass that $text through DBD::Pg, it will still be an "ä". And when you
change $text to "ä", it still works. Most other combinations will
either fail with "\xc3\xa4" or "ä".

Welcome to modern perl's handling of utf8. Cue "Everything is Fine" meme.

Regards,
Christoph

--
Spare Space

#3Matthias Apitz
guru@unixarea.de
In reply to: Christoph Moench-Tegeder (#2)
Re: Perl DBI converts UTF-8 again to UTF-8 before sending it to the server

El día viernes, octubre 04, 2019 a las 04:29:32p. m. +0200, Christoph Moench-Tegeder escribió:

## Matthias Apitz (guru@unixarea.de):

my $text = "\xc3\xa4";
print "text: ".$text."\n";

Your output is lying to you:
you need a binmode(STDOUT, ':encoding(utf8)'), which will make this print
"ä", and a utf8::decode($text), after which you get "ä". And when you
pass that $text through DBD::Pg, it will still be an "ä". And when you
change $text to "ä", it still works. Most other combinations will
either fail with "\xc3\xa4" or "ä".

Thanks, but:

...
my $text = "ä \xc3\xa4";
print "text before STDOUT set to UTF-8: ".$text."\n";
binmode(STDOUT, ':encoding(utf8)');
print "text after STDOUT set to UTF-8: ".$text."\n";
utf8::decode($text);
print "text after utf8::decode(\$text): ".$text."\n";
$sth->execute($text, 1) or die $sth->errstr, "\n";

gives:

./utf8.pl
text before STDOUT set to UTF-8: ä
text after STDOUT set to UTF-8: ä ä
text after utf8::decode($text): ä ä
printf "select tstchar25::bytea from dbctest ;\n" | psql -Usisis -dnewsisis
tstchar25
------------------------------------------------------------
\xc3a420c383c2a4202020202020202020202020202020202020202020
(1 Zeile)

Welcome to modern perl's handling of utf8. Cue "Everything is Fine" meme.

Seems so :-(

matthias

--
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

3. Oktober! Wir gratulieren! Der Berliner Fernsehturm wird 50
aus: https://www.jungewelt.de/2019/10-02/index.php

#4Christoph Moench-Tegeder
cmt@burggraben.net
In reply to: Matthias Apitz (#3)
Re: Perl DBI converts UTF-8 again to UTF-8 before sending it to the server

## Matthias Apitz (guru@unixarea.de):

my $text = "ä \xc3\xa4";

That will only work if you remove "use utf8". And then other stuff may
break.

Regards,
Christoph

--
Spare Space

#5Matthias Apitz
guru@unixarea.de
In reply to: Matthias Apitz (#3)
Re: Perl DBI converts UTF-8 again to UTF-8 before sending it to the server

Christoph,

May I come back to the UTF-8 problem, but now for the reading aspect:

I connect to the PG server with:

$dbh = DBI->connect($PGDB, $PGDB_USER, $PGDB_PASS,
{ pg_utf8_flag => 1,
pg_enable_utf8 => 1,
AutoCommit => 0,
RaiseError => 0,
PrintError => 0,
}
);

and do a SELECT for a column which contains UTF-8 data (I double checked
this with SQL and ::bytea):

$sth=$dbh->prepare(
"select d02name from d02ben where d02bnr = '00001048313'")
or die "parse error\n".$DBI::errstr."\n";

$sth->execute
or die "exec error\n".$DBI::errstr."\n";

but when I now fetch the first row with:

@row = $sth->fetchrow_array;
$HexStr = unpack("H*", $row[0]);
print "HexStr: " . $HexStr . "\n";
print "$row[0]\n";

The resulting column contains ISO data:

HexStr: 50e46461676f67697363686520486f6368736368756c65205765696e67617274656e2020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020
P<E4>dagogische Hochschule Weingarten

Following the man page of DBD::Pg the attribute pg_enable_utf8 => 1
should ensure that strings are returned from DBI with the UTF-8 flag
switched on. The server sends the string in UTF-8 as I can see with
strace:

...
recvfrom(3, "T\0\0\0 \0\1d02name\0\0\1\313\237\0\3\0\0\4\22\377\377\0\0\0|\0\0D\0\0\0\203\0\1\0\0\0yP\303\244dagogische Hochschule Weingarten C\0\0\0\rSELECT 1\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 185
write(1, "HexStr: 50e46461676f67697363686520486f6368736368756c65205765696e67617274656e2020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020\n", 249) = 249
write(1, "P\344dagogische Hochschule Weingarten

But why it gets translated to ISO?

Thanks for your help again.

matthias
--
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

#6Christoph Moench-Tegeder
cmt@burggraben.net
In reply to: Matthias Apitz (#5)
Re: Perl DBI converts UTF-8 again to UTF-8 before sending it to the server

## Matthias Apitz (guru@unixarea.de):

but when I now fetch the first row with:

@row = $sth->fetchrow_array;
$HexStr = unpack("H*", $row[0]);
print "HexStr: " . $HexStr . "\n";
print "$row[0]\n";

The resulting column contains ISO data:

As expected: https://perldoc.perl.org/perluniintro.html
Specifically, if all code points in the string are 0xFF or less, Perl
uses the native eight-bit character set.

P<E4>dagogische Hochschule Weingarten

And then it doesn't know that your terminal expects UTF-8 (perl
just dumps the binary string here), because you didn't tell it:
"binmode(STDOUT, ':encoding(utf8)')" would fix that.
See: https://perldoc.perl.org/perlunifaq.html specifically "What if I
don't decode?", "What if I don't encode?" and "Is there a way to
automatically decode or encode?".

The whole PostgreSQL-DBI-UTF8-thingy is working: use "Tijl Müller"
as test data (that's the dutch "ij"-digraph in there, a character
decidedly not in "latin-9/15" and therefore not "0xFF or less").
That will break your "unpack('H*')" - it tries to unpack that wide
character into a hex byte and "Character in 'H' format wrapped in
unpack". Use "print(join(' ', unpack('U*', $row[0])))" to see that
the ij has codepoint 307 (decimal).

Regards,
Christoph

--
Spare Space

#7Daniel Verite
daniel@manitou-mail.org
In reply to: Christoph Moench-Tegeder (#6)
Re: Perl DBI converts UTF-8 again to UTF-8 before sending it to the server

Christoph Moench-Tegeder wrote:

And then it doesn't know that your terminal expects UTF-8 (perl
just dumps the binary string here), because you didn't tell it:
"binmode(STDOUT, ':encoding(utf8)')" would fix that.

Or use perl -C, so that it gets that from the environment.

From https://perldoc.perl.org/perlrun.html :

-C on its own (not followed by any number or option list), or the
empty string "" for the PERL_UNICODE environment variable, has the
same effect as -CSDL. In other words, the standard I/O handles and
the default open() layer are UTF-8-fied but only if the locale
environment variables indicate a UTF-8 locale.

Now for what the OP is doing, I'd suggest to use Dump() from the
Devel::Peek module instead of print.

To see the difference between a literal "ä" and "\xc3\xa4" from the
point of view of Perl:

use Devel::Peek;
use utf8;

$str = "\xc3\xa4";
Dump($str);

$str = "ä";
Dump($str);

Result:

SV = PV(0x55af63beeda0) at 0x55af63c185d0
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK)
PV = 0x55af63c3c230 "\303\244"\0
CUR = 2
LEN = 10
COW_REFCNT = 1
SV = PV(0x55af63beeda0) at 0x55af63c185d0
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK,UTF8)
PV = 0x55af63c58dc0 "\303\244"\0 [UTF8 "\x{e4}"]
CUR = 2
LEN = 10
COW_REFCNT = 1

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite