Facing issue in using special characters
Hi all,
Facing issue in using special characters. We are trying to insert records to a remote Postgres Server and our application not able to perform this because of errors.
It seems that issue is because of the special characters that has been used in one of the field of a row.
Regards
Tarkeshwar
On Thursday, March 14, 2019, M Tarkeshwar Rao <m.tarkeshwar.rao@ericsson.com>
wrote:
Facing issue in using special characters. We are trying to insert records
to a remote Postgres Server and our application not able to perform this
because of errors.It seems that issue is because of the special characters that has been
used in one of the field of a row.
Emailing -general ONLY is both sufficient and polite. Providing more
detail, and ideally an example, is necessary.
David J.
This is not an issue for "hackers" nor "performance" in fact even for
"general" it isn't really an issue.
"Special characters" is actually nonsense.
When people complain about "special characters" they haven't thought
things through.
If you are unwilling to think things through and go step by step to make
sure you know what you are doing, then you will not get it and really
nobody can help you.
In my professional experience, people who complain about "special
characters" need to be cut loose or be given a chance (if they are
established employees who carry some weight). If a contractor complains
about "special characters" they need to be fired.
Understand charsets -- character set, code point, and encoding. Then
understand how encoding and string literals and "escape sequences" in
string literals might work.
Know that UNICODE today is the one standard, and there is no more need
to do code table switch. There is nothing special about a Hebrew alef or
a greek lower case alpha or a latin A. Nor a hyphen and en-dash or an
em-dash. All these characters are in the UNICODE. Yes, there are some
Japanese who claim that they don't like that their Chinese character
versions are put together with simplified reform Chinese font. But
that's a font issue, not a character code issue.
7 bit ASCII is the first page of UNICODE, even in the UTF-8 encoding.
ISO Latin 1, or the Windoze 123 whatever special table of ISO Latin 1
has the same code points as UNICODE pages 0 and 1, but not compatible
with UTF-8 coding because of the way UTF-8 uses the 8th bit.
But none of this is likely your problem.
Your problem is about string literals in SQL for examples. About the
configuration of your database (I always use initdb with --locale C and
--encoding UTF-8). Use UTF-8 in the database. Then all your issues are
about string literals in SQL and in JAVA and JSON and XML or whatever
you are using.
You have to do the right thing. If you produce any representation,
whether that is XML or JSON or SQL or URL query parameters, or a CSV
file, or anything at all, you need to escape your string values properly.
This question with no detail didn't deserve such a thorough answer, but
it's my soap box. I do not accept people complaining about "special
characters". My own people get that same sermon from me when they make
that mistake.
-Gunther
Show quoted text
On 3/15/2019 1:19, M Tarkeshwar Rao wrote:
Hi all,
Facing issue in using special characters. We are trying to insert
records to a remote Postgres Server and our application not able to
perform this because of errors.It seems that issue is because of the special characters that has been
used in one of the field of a row.Regards
Tarkeshwar
On 3/15/19 11:59 AM, Gunther wrote:
This is not an issue for "hackers" nor "performance" in fact even for
"general" it isn't really an issue.
As long as it's already been posted, may as well make it something
helpful to find in the archive.
Understand charsets -- character set, code point, and encoding. Then
understand how encoding and string literals and "escape sequences" in
string literals might work.
Good advice for sure.
Know that UNICODE today is the one standard, and there is no more need
I wasn't sure from the question whether the original poster was in
a position to choose the encoding of the database. Lots of things are
easier if it can be set to UTF-8 these days, but perhaps it's a legacy
situation.
Maybe a good start would be to go do
SHOW server_encoding;
SHOW client_encoding;
and then hit the internet and look up what that encoding (or those
encodings, if different) can and can't represent, and go from there.
It's worth knowing that, when the server encoding isn't UTF-8,
PostgreSQL will have the obvious limitations entailed by that,
but also some non-obvious ones that may be surprising, e.g. [1]/messages/by-id/CA+TgmobUp8Q-wcjaKvV=sbDcziJoUUvBCB8m+_xhgOV4DjiA1A@mail.gmail.com.
-Chap
[1]: /messages/by-id/CA+TgmobUp8Q-wcjaKvV=sbDcziJoUUvBCB8m+_xhgOV4DjiA1A@mail.gmail.com
/messages/by-id/CA+TgmobUp8Q-wcjaKvV=sbDcziJoUUvBCB8m+_xhgOV4DjiA1A@mail.gmail.com
Many of us have faced character encoding issues because we are not in control of our input sources and made the common assumption that UTF-8 covers everything.
In my lab, as an example, some of our social media posts have included ZawGyi Burmese character sets rather than Unicode Burmese. (Because Myanmar developed technology In a closed to the world environment, they made up their own non-standard character set which is very common still in Mobile phones.). We had fully tested the app with Unicode Burmese, but honestly didn’t know ZawGyi was even a thing that we would see in our dataset. We’ve also had problems with non-Unicode word separators in Arabic.
What we’ve found to be helpful is to view the troubling code in a hex editor and determine what non-standard characters may be causing the problem.
It may be some data conversion is necessary before insertion. But the first step is knowing WHICH characters are causing the issue.
On 2019-03-17 15:01:40 +0000, Warner, Gary, Jr wrote:
Many of us have faced character encoding issues because we are not in control
of our input sources and made the common assumption that UTF-8 covers
everything.
UTF-8 covers "everything" in the sense that there is a round-trip from
each character in every commonly-used charset/encoding to Unicode and
back.
The actual code may of course be different. For example, the € sign is
0xA4 in iso-8859-15, but U+20AC in Unicode. So you need an
encoding/decoding step.
And "commonly-used" means just that. Unicode covers a lot of character
sets, but it can't cover every character set ever invented (I invented
my own character sets when I was sixteen. Nobody except me ever used
them and they have long succumbed to bit rot).
In my lab, as an example, some of our social media posts have included ZawGyi
Burmese character sets rather than Unicode Burmese. (Because Myanmar developed
technology In a closed to the world environment, they made up their own
non-standard character set which is very common still in Mobile phones.).
I'd be surprised if there was a character set which is "very common in
Mobile phones", even in a relatively poor country like Myanmar. Does
ZawGyi actually include characters which aren't in Unicode are are they
just encoded differently?
hp
--
_ | Peter J. Holzer | we build much bigger, better disasters now
|_|_) | | because we have much more sophisticated
| | | hjp@hjp.at | management tools.
__/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>