unexpected character used as group separator by to_char

Started by Vincent Veyronabout 5 years ago13 messagesgeneral
Jump to latest
#1Vincent Veyron
vv.lists@wanadoo.fr

Hi,

I'm having a problem with to_char() inserting unexpected characters as group separator.

For the record below, I wish to display 'credit' properly formated in an html form, using to_char(credit, 'FM999G990D00')

SELECT *, to_char(credit, 'FM999G990D00') as yo from tblprospect_balance where id_item = 33;

id_prospect | id_item | date_item | libelle | debit | credit | lettrage | id_facture | yo
-------------+---------+------------+---------+-------+---------+----------+------------+----------
385 | 33 | 2021-03-09 | yo man | 0.00 | 2345.10 | | 8 | 2 345,10

The numbers are properly displayed in the html form. However, updating the record requires the user to manually delete the space in '2 345,10', otherwise the last digit is lost. Typing numbers including a group separator using the space bar works fine.

I exported the record to a text file and inspected it with emacs in hexadecimal mode :

3338 3509 3333 0932 3032 312d 3033 2d30 385.33.2021-03-0
00000010: 3909 796f 206d 616e 0930 2e30 3009 3233 9.yo man.0.00.23
00000020: 3435 2e31 3009 0938 0932 e280 af33 3435 45.10..8.2...345
00000030: 2c31 300a ,10.

As you can see, the space in field 'libelle' (yo man) is different from the one in field 'yo' (2...345,10)

The difference is also apparent in a link built using the record's fields as parameters :

/base/prospect?id_prospect=385&balance=0&update_item=0&id_item=33&id_facture=8&date_item=2021-03-09&libelle=yo%20man&debit=0,00&credit=2%E2%80%AF345%2C10&lettrage=
^^^^^^^^^^^^^ ^^^^^^^^^^^^
What can I do to get a standard space as group separator for my numbers?

#### system information ########
I use postgresql 11.9 on Debian buster

My settings are :

show lc_numeric;
lc_numeric
-------------
fr_FR.UTF-8

locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=

\d tblprospect_balance
Table "public.tblprospect_balance"
Column | Type | Collation | Nullable | Default
-------------+---------------+-----------+----------+------------------------------------------------------
id_prospect | integer | | not null |
id_item | integer | | not null | nextval('tblprospect_balance_id_item_seq'::regclass)
date_item | date | | not null | CURRENT_DATE
libelle | text | | |
debit | numeric(10,2) | | not null | 0
credit | numeric(10,2) | | not null | 0
lettrage | text | | |
id_facture | integer | | not null | 0

--
Bien à vous, Vincent Veyron

https://marica.fr/
Logiciel de gestion des sinistres assurances, des dossiers contentieux et des contrats pour le service juridique

#2David G. Johnston
david.g.johnston@gmail.com
In reply to: Vincent Veyron (#1)
Re: unexpected character used as group separator by to_char

On Tue, Mar 9, 2021 at 1:28 PM Vincent Veyron <vv.lists@wanadoo.fr> wrote:

For the record below, I wish to display 'credit' properly formated in an
html form, using to_char(credit, 'FM999G990D00')

^^^^^^^^^^^^^ ^^^^^^^^^^^^
What can I do to get a standard space as group separator for my numbers?

By using "G" you are giving up control and letting your locale settings
decide what gets output. You can continue to use to_char but take control
back by being explicit, or pass actual numbers into and out of the database
and let your front-end deal with presentation concerns. I suggest the
latter.

David J.

#3Vincent Veyron
vv.lists@wanadoo.fr
In reply to: David G. Johnston (#2)
Re: unexpected character used as group separator by to_char

On Tue, 9 Mar 2021 13:33:19 -0700
"David G. Johnston" <david.g.johnston@gmail.com> wrote:

By using "G" you are giving up control and letting your locale settings
decide what gets output. You can continue to use to_char but take control
back by being explicit, or pass actual numbers into and out of the database
and let your front-end deal with presentation concerns. I suggest the
latter.

Well, that was a fast answer :-)

Indeed, replacing "G" with \s works. However, the application is destined to be multi-lingual, and number formats vary widely between countries :

1,000.25 (US)
1'000,25 (CH)
1 000,25 (FR)
1.000,25 (GER)
etc...

So, I intended to follow this advice by Tom Lane :

https://www.postgresql-archive.org/GENERAL-setting-LC-NUMERIC-td1857521.html

Having a database with the proper lc_numeric setting for each country, and using to_char/to_number to manipulate numbers is much more appealing than writing my own parser in my front end. But this weird space is getting in my way.

--
Bien à vous, Vincent Veyron

https://marica.fr/
Logiciel de gestion des sinistres assurances, des dossiers contentieux et des contrats pour le service juridique

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Vincent Veyron (#3)
Re: unexpected character used as group separator by to_char

Vincent Veyron <vv.lists@wanadoo.fr> writes:

Having a database with the proper lc_numeric setting for each country, and using to_char/to_number to manipulate numbers is much more appealing than writing my own parser in my front end. But this weird space is getting in my way.

The point here is that 'G' and related format codes act as specified by
your LC_MONETARY locale. If you don't like the results, you need to use
a different locale.

(I suppose you could also use regexp_replace to convert random forms
of whitespace to plain ASCII space.)

regards, tom lane

#5Rob Sargent
robjsargent@gmail.com
In reply to: Tom Lane (#4)
Re: unexpected character used as group separator by to_char

On 3/9/21 2:22 PM, Tom Lane wrote:

Vincent Veyron <vv.lists@wanadoo.fr> writes:

Having a database with the proper lc_numeric setting for each country, and using to_char/to_number to manipulate numbers is much more appealing than writing my own parser in my front end. But this weird space is getting in my way.

The point here is that 'G' and related format codes act as specified by
your LC_MONETARY locale. If you don't like the results, you need to use
a different locale.

(I suppose you could also use regexp_replace to convert random forms
of whitespace to plain ASCII space.)

regards, tom lane

I'm in David's camp on this one: there are plenty of currency format
conversion libraries from which to choose.

rjs

#6Vincent Veyron
vv.lists@wanadoo.fr
In reply to: Tom Lane (#4)
Re: unexpected character used as group separator by to_char

On Tue, 09 Mar 2021 16:22:07 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

The point here is that 'G' and related format codes act as specified by
your LC_MONETARY locale. If you don't like the results, you need to use
a different locale.

This is a numeric(10,2) type field though. I tried casting it to money type, with lc_monetary set to "fr_FR.UTF-8", same weird space

(I suppose you could also use regexp_replace to convert random forms
of whitespace to plain ASCII space.)

No dice. 'G' formatting looks like a whitespace, but is different (it appears to be slightly narrower when displayed in html, too) :

select regexp_replace(to_char(1234.56, 'FM999 990D00'), E'[\\s]', 'x');
regexp_replace
----------------
1x234,56
(1 row)

select regexp_replace(to_char(1234.56, 'FM999G990D00'), E'[\\s]', 'x');
regexp_replace
----------------
1 234,56
(1 row)

#7Rob Sargent
robjsargent@gmail.com
In reply to: Vincent Veyron (#6)
Re: unexpected character used as group separator by to_char

On 3/9/21 4:03 PM, Vincent Veyron wrote:

No dice. 'G' formatting looks like a whitespace, but is different (it appears to be slightly narrower when displayed in html, too) :

An "n-space", no?

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Vincent Veyron (#6)
Re: unexpected character used as group separator by to_char

Vincent Veyron <vv.lists@wanadoo.fr> writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

(I suppose you could also use regexp_replace to convert random forms
of whitespace to plain ASCII space.)

No dice. 'G' formatting looks like a whitespace, but is different (it appears to be slightly narrower when displayed in html, too) :

Huh ... do you have lc_monetary and lc_ctype set the same? The
latter is (I believe) what determines the behavior of regex \s.

In any case, you could force the issue with a pattern like '[\s\uNNNN]'
for whatever the code point of that character is.

regards, tom lane

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Vincent Veyron (#6)
Re: unexpected character used as group separator by to_char

On 2021-Mar-10, Vincent Veyron wrote:

On Tue, 09 Mar 2021 16:22:07 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

(I suppose you could also use regexp_replace to convert random forms
of whitespace to plain ASCII space.)

No dice. 'G' formatting looks like a whitespace, but is different (it
appears to be slightly narrower when displayed in html, too) :

That space (0xe280af) is U+202F, which appears to be used for French and
Mongolian languages (exclusively?). It is quite possible that in the
future some other language will end up using some different whitespace
character, possibly breaking any code you write today -- the use of
U+202F appears to be quite recent.

Maybe it'd be better to have the query return two fields: one destined
for printing, the other you can use to feed the database back.

--
�lvaro Herrera Valdivia, Chile
"Porque Kim no hac�a nada, pero, eso s�,
con extraordinario �xito" ("Kim", Kipling)

#10Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Vincent Veyron (#1)
Re: unexpected character used as group separator by to_char

On Tue, 2021-03-09 at 21:28 +0100, Vincent Veyron wrote:

I'm having a problem with to_char() inserting unexpected characters as group separator.

For the record below, I wish to display 'credit' properly formated in an html form, using to_char(credit, 'FM999G990D00')

SELECT *, to_char(credit, 'FM999G990D00') as yo from tblprospect_balance where id_item = 33;

id_prospect | id_item | date_item | libelle | debit | credit | lettrage | id_facture | yo
-------------+---------+------------+---------+-------+---------+----------+------------+----------
385 | 33 | 2021-03-09 | yo man | 0.00 | 2345.10 | | 8 | 2 345,10

The numbers are properly displayed in the html form. However, updating the record requires
the user to manually delete the space in '2 345,10', otherwise the last digit is lost.
Typing numbers including a group separator using the space bar works fine.

I exported the record to a text file and inspected it with emacs in hexadecimal mode :

3338 3509 3333 0932 3032 312d 3033 2d30 385.33.2021-03-0
00000010: 3909 796f 206d 616e 0930 2e30 3009 3233 9.yo man.0.00.23
00000020: 3435 2e31 3009 0938 0932 e280 af33 3435 45.10..8.2...345
00000030: 2c31 300a ,10.

As you can see, the space in field 'libelle' (yo man) is different from the one in field 'yo' (2...345,10)

What can I do to get a standard space as group separator for my numbers?

#### system information ########
I use postgresql 11.9 on Debian buster

My settings are :

show lc_numeric;
lc_numeric
-------------
fr_FR.UTF-8

That "space" is UNICODE U+202F ("Narrow No-Break Space"), and that's what your
C library thinks to be the correct group separator for the French language.

You can either replace the character:

SELECT replace(to_char(2345.10, 'FM999G990D00'), E'\u202F', ' ');

or you can choose a different locale that uses a different group separator
(not sure if that exists).

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#11Gavan Schneider
list.pg.gavan@pendari.org
In reply to: Alvaro Herrera (#9)
Re: unexpected character used as group separator by to_char

On 10 Mar 2021, at 16:24, Alvaro Herrera wrote:

That space (0xe280af) is U+202F, which appears to be used for French
and
Mongolian languages (exclusively?). It is quite possible that in the
future some other language will end up using some different whitespace
character, possibly breaking any code you write today -- the use of
U+202F appears to be quite recent.

Drifting off topic a little. That a proper code point for things that
will benefit from the whitespace but should still stay together.
Also it’s not that new, added in 1999 —
https://codepoints.net/U+202F
The other use case is between the number and its ISO symbol e.g.,
20 °C — and the non-breaking form is important here
As for who uses a thin space in money… it is a normal in European
notation — https://www.languageediting.com/format-numbers-eu-vs-us/
And the thin space is part of the international standard for breaking up
large numbers (from 1948), specifically no dots or commas should be used
in this role. The dot or comma is only to be used for the decimal point!
All of which just highlights problems in localisation and begs the
question whether there is an ISO locale setting — it should work
everywhere ;)

Gavan Schneider
——
Gavan Schneider, Sodwalls, NSW, Australia
Explanations exist; they have existed for all time; there is always a
well-known solution to every human problem — neat, plausible, and
wrong.
— H. L. Mencken, 1920

#12Vincent Veyron
vv.lists@wanadoo.fr
In reply to: Alvaro Herrera (#9)
Re: unexpected character used as group separator by to_char

On Wed, 10 Mar 2021 02:24:57 -0300
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

That space (0xe280af) is U+202F, which appears to be used for French and
Mongolian languages (exclusively?)

On Tue, 09 Mar 2021 18:57:05 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

In any case, you could force the issue with a pattern like '[\s\uNNNN]'
for whatever the code point of that character is.

On Wed, 10 Mar 2021 09:41:19 +0100
Laurenz Albe <laurenz.albe@cybertec.at> wrote:

SELECT replace(to_char(2345.10, 'FM999G990D00'), E'\u202F', ' ');

Thank you all for the helpful answers, I can work with that.

Strange that to_char() and to_number() would use a different separator, though :

select to_number(to_char(1234.56, 'FM999G990D00'), 'FM999G990D00');
to_number
-----------
1234.5

select to_number(replace(to_char(1234.56, 'FM999G990D00'),E'\u202F', ' '), 'FM999G990D00');
to_number
-----------
1234.56

--
Bien à vous, Vincent Veyron

https://compta.libremen.com
Logiciel libre de comptabilité générale en partie double

#13Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Gavan Schneider (#11)
Re: unexpected character used as group separator by to_char

On 2021-Mar-10, Gavan Schneider wrote:

On 10 Mar 2021, at 16:24, Alvaro Herrera wrote:

That space (0xe280af) is U+202F, which appears to be used for French and
Mongolian languages (exclusively?). It is quite possible that in the
future some other language will end up using some different whitespace
character, possibly breaking any code you write today -- the use of
U+202F appears to be quite recent.

Drifting off topic a little. That a proper code point for things that will
benefit from the whitespace but should still stay together.
Also it’s not that new, added in 1999 — https://codepoints.net/U+202F

I probably got misled on this whole thing by these change proposals.
https://www.unicode.org/L2/L2019/19116-clarify-nnbsp.pdf
https://www.unicode.org/L2/L2020/20008-core-text.pdf
Apparently prior to this, they (?) had been using/recommending
THIN SPACE U+2009 as separator, which is not non-breaking.

Anyway, it reinforces my point that it's not impossible that some other
locale definition could use U+2009 when printing numbers, or even some
other kind of spacing entity in non-Latin languages etc. So I think
that for truly robust handling you should separate the thing you use for
display from the thing you use to talk to the database.

And the thin space is part of the international standard for breaking up
large numbers (from 1948), specifically no dots or commas should be used in
this role. The dot or comma is only to be used for the decimal point!

Interesting U+2014 EM DASH I didn't know this.

--
Álvaro Herrera Valdivia, Chile
"This is a foot just waiting to be shot" (Andrew Dunstan)