Postgres, apps, special characters and UTF-8 encoding
Hi. I've got a recurring problem with character encoding for a
Postgres-based web PHP app, and am hoping someone can clue me in or at
least point me in the right direction. I'll confess upfront my
understanding of encoding issues is extremely limited. Here goes.
The app uses a Postgres database, UTF-8 encoded. Through their browsers,
users can add and edit records often including text. Most of the time this
works fine. Though sometimes this will fail with Postgres complaining, for
example, "Could query with ... , The error text was: ERROR: invalid byte
sequence for encoding "UTF8": 0xe9 0x20 0x67"
So this generally happens when people copy and paste things out of their
word documents and such.
As I understand it, those are likely encoded in something non-UTF-8, like
WIN-1251 or something. And that one way or another, the encoding needs to
be translated before it can be placed into the database. I'm not clear how
this is supposed to happen though. Automatically by the browser? Done in
the app? Some other way? And if in the app, how is one supposed to know
what the incoming encoding is?
Thanks in advance for any help or pointers.
Ken
--
AGENCY Software
A Free Software data system
By and for non-profits
*http://agency-software.org/ <http://agency-software.org/>*
*https://agency-software.org/demo/client
<https://agency-software.org/demo/client>*
ken.tanzer@agency-software.org
(253) 245-3801
Subscribe to the mailing list
<agency-general-request@lists.sourceforge.net?body=subscribe> to
learn more about AGENCY or
follow the discussion.
On Tue, Mar 7, 2017 at 4:20 PM, Ken Tanzer <ken.tanzer@gmail.com> wrote:
As I understand it, those are likely encoded in something non-UTF-8, like
WIN-1251 or something. And that one way or another, the encoding needs to
be translated before it can be placed into the database. I'm not clear how
this is supposed to happen though. Automatically by the browser? Done in
the app? Some other way? And if in the app, how is one supposed to know
what the incoming encoding is?
Haven't run into this problem personally, probably accidentally lucky, but
ISTM that you need to tell the browser what character set you are working
in. It is in the best position to mediate between the user and the server.
Ideally, on the server, you can examine HTTP headers to learn about the
incoming data charset/encoding (I may not be using these terms precisely
but you should get the idea).
Googling "html input field encoding" seems to provide a decent start.
Note that technically the data encoding issues can occur without HTML, its
really an HTTP layer thing, but the medium of use you care about is
HTTP/Browsers.
David J.
On 03/07/2017 03:20 PM, Ken Tanzer wrote:
Hi. I've got a recurring problem with character encoding for a
Postgres-based web PHP app, and am hoping someone can clue me in or at
least point me in the right direction. I'll confess upfront my
understanding of encoding issues is extremely limited. Here goes.The app uses a Postgres database, UTF-8 encoded. Through their
browsers, users can add and edit records often including text. Most of
the time this works fine. Though sometimes this will fail with Postgres
complaining, for example, "Could query with ... , The error text was:
ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x20 0x67"So this generally happens when people copy and paste things out of their
word documents and such.As I understand it, those are likely encoded in something non-UTF-8,
like WIN-1251 or something. And that one way or another, the encoding
needs to be translated before it can be placed into the database. I'm
not clear how this is supposed to happen though. Automatically by the
browser? Done in the app? Some other way? And if in the app, how is
one supposed to know what the incoming encoding is?
I don't use PHP, but found this:
http://www.php.net/manual/en/function.mb-detect-encoding.php
and this:
http://php.net/manual/en/function.mb-convert-encoding.php
Thanks in advance for any help or pointers.
Ken
--
AGENCY Software
A Free Software data system
By and for non-profits
/http://agency-software.org//
/https://agency-software.org/demo/client/
ken.tanzer@agency-software.org <mailto:ken.tanzer@agency-software.org>
(253) 245-3801Subscribe to the mailing list
<mailto:agency-general-request@lists.sourceforge.net?body=subscribe> to
learn more about AGENCY or
follow the discussion.
--
Adrian Klaver
adrian.klaver@aklaver.com
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hi Ken,
On Tue, 2017-03-07 at 15:20 -0800, Ken Tanzer wrote:
Hi. I've got a recurring problem with character encoding for a
Postgres-based web PHP app, and am hoping someone can clue me in or
at least point me in the right direction. I'll confess upfront my
understanding of encoding issues is extremely limited. Here goes.The app uses a Postgres database, UTF-8 encoded. Through their
browsers, users can add and edit records often including text. Most
of the time this works fine. Though sometimes this will fail with
Postgres complaining, for example, "Could query with ... , The error
text was: ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x20
0x67"So this generally happens when people copy and paste things out of
their word documents and such.As I understand it, those are likely encoded in something non-UTF-8,
like WIN-1251 or something. And that one way or another, the
encoding needs to be translated before it can be placed into the
database. I'm not clear how this is supposed to happen though.
Automatically by the browser? Done in the app? Some other way? And
if in the app, how is one supposed to know what the incoming encoding
is?Thanks in advance for any help or pointers.
Ken
1) Make sure the text editor you use to create your pages, etc. uses
UTF-8 as its encoding. That way the file's BOM is set correctly.
2) Make sure your headers contain the following:-
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
or
<meta charset="UTF-8"/> which is HTML5, however the long version is
still recognised by HTML5.
I understand that some versions of IE have issues with correctly
determining the character set, so, unfortunately, you still have to
verify that user input is UTF-8 compatible.
HTH,
Rob
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 2017-03-08 00:20, Ken Tanzer wrote:
Hi. I've got a recurring problem with character encoding for a
Postgres-based web PHP app, and am hoping someone can clue me in or at
least point me in the right direction. I'll confess upfront my
understanding of encoding issues is extremely limited. Here goes.
And that one way or another, the encoding
needs to be translated before it can be placed into the database.
Ken
--
You don't really have to translate the encoding, because all parts of
the system
are capable of dealing with all encodings.
What you have to make sure that that they are indeed all working in the
same encoding.
You have to set the encoding of the HTML document, the database, and the
database connection
to the same encoding, like utf8. People tend to forget the "set names"
on the database connection,
which can make the database think you are sending latin1, but you are
really sending utf-8, and presto problemo.
Then the only problem left is that PHP doesn't do utf-8 very well
internally
so if you receive data from an UTF-8 page and want to substring etc
then you have to use the multibyte variants
of those functions. You could convert everything back to latin1 first,
but then
you might as well just do everything in latin1 in the first place.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Ken Tanzer wrote:
Hi. I've got a recurring problem with character encoding for a Postgres-based web PHP app, and am
hoping someone can clue me in or at least point me in the right direction. I'll confess upfront my
understanding of encoding issues is extremely limited. Here goes.The app uses a Postgres database, UTF-8 encoded. Through their browsers, users can add and edit
records often including text. Most of the time this works fine. Though sometimes this will fail with
Postgres complaining, for example, "Could query with ... , The error text was: ERROR: invalid byte
sequence for encoding "UTF8": 0xe9 0x20 0x67"So this generally happens when people copy and paste things out of their word documents and such.
As I understand it, those are likely encoded in something non-UTF-8, like WIN-1251 or something. And
that one way or another, the encoding needs to be translated before it can be placed into the
database. I'm not clear how this is supposed to happen though. Automatically by the browser? Done
in the app? Some other way? And if in the app, how is one supposed to know what the incoming
encoding is?Thanks in advance for any help or pointers.
The byte sequence 0xe9 0x20 0x67 means "é g" in ISO-8859-1 and WINDOWS-1252,
so I think that your setup is as follows:
- The PHP application gets data encoded in ISO-8859-1 or WINDOWS-1252
and tries to store it in a database.
- The PHP application has a database connection with client_encoding
set to UTF8.
Then the database thinks it gets UTF-8 and will choke if it gets something
different.
The solution:
- Make sure that your web application gets data in only one encoding.
- Set client_encoding to that encoding.
Yours,
Laurenz Albe
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general