Fix number skipping in to_number

Started by Oliver Fordalmost 9 years ago18 messageshackers

ojford@gmail.com

almost 9 years ago

Prevents an issue where numbers can be skipped in the to_number()
function when the format mask contains a "G" or a "," but the input
string doesn't contain a separator. This resolves the TODO item "Fix
to_number() handling for values not matching the format string".

== Change ==

Currently, if the format mask in to_number() has a "G" or a ",", it
will assume that the input string contains a separator character in
the same place. If however a number is there instead, that number will
be silently skipped and not appear in the output. So we get:

select to_number('34,50','999,99');
to_number
-----------
340
(1 row)

This patch checks the input string when it encounters a "G" or "," in
the format mask. If the separator character is found, the code moves
over it as normal. If it's not found, then the code no longer
increments the relevant pointer so as not to skip the character. After
the patch, we get the correct result:

select to_number('34,50','999,99');
to_number
-----------
3450
(1 row)

This is in line with Oracle's result.

== Rationale ==

This patch is a small change, which leaves PostgreSQL behavior
different from Oracle behavior in related cases. Oracle's
implementation seems to read from right-to-left, and raises an
"ORA-01722: invalid number" error if there are digits in the input
string which don't have corresponding characters in the format mask. I
have chosen not to throw such errors, because there are use cases for
only returning part of a number string. For example, the following is
an error on Oracle:

select to_number('123,000', '999G') from dual;
Error report -
SQL Error: ORA-01722: invalid number

But if you wanted to only take the characters before the comma, and
discard the thousands part, you can do so on PostgreSQL with:

select to_number('123,000', '999G');
to_number
-----------
123
(1 row)

This is the current behavior. Which is why I think it makes more sense
to do what PostgreSQL currently does and read from left-to-right. The
only change, as mentioned above, is that the current behavior can skip
a digit:

select to_number('123456', '999G999');
to_number
-----------
12356
(1 row)

After the patch, this returns all the digits:

select to_number('123456', '999G999');
to_number
-----------
123456
(1 row)

== Testing ==

Tested on Windows with MinGW using the latest checkout from master.
Added regression tests to check for this new behavior. All existing
tests pass.

Thomas Munro

thomas.munro@gmail.com

almost 9 years ago

In reply to: Oliver Ford (#1)

Re: Fix number skipping in to_number

On Thu, Aug 10, 2017 at 10:21 PM, Oliver Ford <ojford@gmail.com> wrote:

Prevents an issue where numbers can be skipped in the to_number()
function when the format mask contains a "G" or a "," but the input
string doesn't contain a separator. This resolves the TODO item "Fix
to_number() handling for values not matching the format string".

From the TODO I found the previous discussion here:

/messages/by-id/be46a4f30909212157o71dc82bep7e074f9fa7eb1d14@mail.gmail.com

There were some votes against changing it and some votes for. I don't
see what's good about the current behaviour. It's hard (though not
impossible) to imagine that someone is depending on that weirdness,
and it's harder to believe that anyone wants that behaviour knowingly.
For people migrating from Oracle or writing application portable to
both, the current behaviour sucks. It's not a SQL standard feature,
it's a be-like-Oracle feature, so it should be like Oracle. Aside
from that, by the principle of least astonishment alone I think it's
pretty clear that "a separator goes here" shouldn't mean "eat a
digit".

So +1 from me.

== Change ==

Currently, if the format mask in to_number() has a "G" or a ",", it
will assume that the input string contains a separator character in
the same place. If however a number is there instead, that number will
be silently skipped and not appear in the output. So we get:

select to_number('34,50','999,99');
to_number
-----------
340
(1 row)

This patch checks the input string when it encounters a "G" or "," in
the format mask. If the separator character is found, the code moves
over it as normal. If it's not found, then the code no longer
increments the relevant pointer so as not to skip the character. After
the patch, we get the correct result:

select to_number('34,50','999,99');
to_number
-----------
3450
(1 row)

This is in line with Oracle's result.

In other words the separators are completely ignored. Presumably the
only reason you'd have them at all is so that you could use the same
format string in to_number() and to_char() calls to do round-trip
conversions. That seems reasonable to me.

== Rationale ==

This patch is a small change, which leaves PostgreSQL behavior
different from Oracle behavior in related cases. Oracle's
implementation seems to read from right-to-left, and raises an
"ORA-01722: invalid number" error if there are digits in the input
string which don't have corresponding characters in the format mask. I
have chosen not to throw such errors, because there are use cases for
only returning part of a number string. For example, the following is
an error on Oracle:

select to_number('123,000', '999G') from dual;
Error report -
SQL Error: ORA-01722: invalid number

But if you wanted to only take the characters before the comma, and
discard the thousands part, you can do so on PostgreSQL with:

select to_number('123,000', '999G');
to_number
-----------
123
(1 row)

I can see that it's hard to change this one, because it's more likely
to break existing applications that were written by people who didn't
think of the format string as limiting the maximum size of the number.
I think someone could argue for changing that too, but I agree with
your choice for this patch.

== Testing ==

Tested on Windows with MinGW using the latest checkout from master.
Added regression tests to check for this new behavior. All existing
tests pass.

The tests you added pass for me but the int8 test now fails with the
following (this is from regression.diff after running
'PG_REGRESS_DIFF_OPTS="-dU10" make check'). It looks like some new
whitespace is appearing on the right in the output of to_char(). I
didn't try to understand why.

@@ -453,34 +453,34 @@
 ------------------+------------------
  4567890123456789 | 4567890123456789
 (1 row)

 -- TO_CHAR()
 --
 SELECT '' AS to_char_1, to_char(q1, '9G999G999G999G999G999'),
to_char(q2, '9,999,999,999,999,999')
        FROM INT8_TBL;
  to_char_1 |        to_char         |        to_char
 -----------+------------------------+------------------------
-           |                    123 |                    456
+           |                    123 |               456
            |                    123 |  4,567,890,123,456,789
-           |  4,567,890,123,456,789 |                    123
+           |  4,567,890,123,456,789 |               123
            |  4,567,890,123,456,789 |  4,567,890,123,456,789
            |  4,567,890,123,456,789 | -4,567,890,123,456,789
 (5 rows)

 SELECT '' AS to_char_2, to_char(q1, '9G999G999G999G999G999D999G999'),
to_char(q2, '9,999,999,999,999,999.999,999')
        FROM INT8_TBL;
  to_char_2 |            to_char             |            to_char
 -----------+--------------------------------+--------------------------------
-           |                    123.000,000 |                    456.000,000
+           |                    123.000,000 |               456.000,000
            |                    123.000,000 |  4,567,890,123,456,789.000,000
-           |  4,567,890,123,456,789.000,000 |                    123.000,000
+           |  4,567,890,123,456,789.000,000 |               123.000,000
            |  4,567,890,123,456,789.000,000 |  4,567,890,123,456,789.000,000
            |  4,567,890,123,456,789.000,000 | -4,567,890,123,456,789.000,000
 (5 rows)

SELECT '' AS to_char_3, to_char( (q1 * -1), '9999999999999999PR'),
to_char( (q2 * -1), '9999999999999999.999PR')
FROM INT8_TBL;
to_char_3 | to_char | to_char
-----------+--------------------+------------------------
| <123> | <456.000>
| <123> | <4567890123456789.000>

======================================================================

One superficial comment after first glimpse at the patch:

+ if(!strncmp(Np->inout_p, Np->L_thousands_sep, separator_len))

I believe the usual coding around here would be if (strncmp(...) == 0)

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Oliver Ford

ojford@gmail.com

almost 9 years ago

In reply to: Thomas Munro (#2)

Re: Fix number skipping in to_number

@@ -453,34 +453,34 @@
------------------+------------------
4567890123456789 | 4567890123456789
(1 row)

-- TO_CHAR()
--
SELECT '' AS to_char_1, to_char(q1, '9G999G999G999G999G999'),
to_char(q2, '9,999,999,999,999,999')
FROM INT8_TBL;
to_char_1 |        to_char         |        to_char
-----------+------------------------+------------------------
-           |                    123 |                    456
+           |                    123 |               456
|                    123 |  4,567,890,123,456,789
-           |  4,567,890,123,456,789 |                    123
+           |  4,567,890,123,456,789 |               123
|  4,567,890,123,456,789 |  4,567,890,123,456,789
|  4,567,890,123,456,789 | -4,567,890,123,456,789
(5 rows)

SELECT '' AS to_char_2, to_char(q1, '9G999G999G999G999G999D999G999'),
to_char(q2, '9,999,999,999,999,999.999,999')
FROM INT8_TBL;
to_char_2 |            to_char             |            to_char
-----------+--------------------------------+--------------------------------
-           |                    123.000,000 |                    456.000,000
+           |                    123.000,000 |               456.000,000
|                    123.000,000 |  4,567,890,123,456,789.000,000
-           |  4,567,890,123,456,789.000,000 |                    123.000,000
+           |  4,567,890,123,456,789.000,000 |               123.000,000
|  4,567,890,123,456,789.000,000 |  4,567,890,123,456,789.000,000
|  4,567,890,123,456,789.000,000 | -4,567,890,123,456,789.000,000
(5 rows)

That's strange, I can't replicate that issue on my Windows build. I've
tried with 'PG_REGRESS_DIFF_OPTS="-dU10" make check' and all the tests
(including int8) pass fine. The spacing in the results is perfectly
normal. I wonder if something else on your build is causing this? I've
also tried several "make check" options for different locales
mentioned in the docs and they pass fine.

One superficial comment after first glimpse at the patch:

+ if(!strncmp(Np->inout_p, Np->L_thousands_sep, separator_len))

I believe the usual coding around here would be if (strncmp(...) == 0)

Yes you're right, that is the coding standard. I've changed it to that
in the attached v2.

Thomas Munro

thomas.munro@gmail.com

almost 9 years ago

In reply to: Oliver Ford (#3)

Re: Fix number skipping in to_number

On Thu, Aug 17, 2017 at 9:48 PM, Oliver Ford <ojford@gmail.com> wrote:

The tests you added pass for me but the int8 test now fails with the
following (this is from regression.diff after running
'PG_REGRESS_DIFF_OPTS="-dU10" make check'). It looks like some new
whitespace is appearing on the right in the output of to_char(). I
didn't try to understand why.
@@ -453,34 +453,34 @@
------------------+------------------
4567890123456789 | 4567890123456789
(1 row)
-- TO_CHAR()
--
SELECT '' AS to_char_1, to_char(q1, '9G999G999G999G999G999'),
to_char(q2, '9,999,999,999,999,999')
FROM INT8_TBL;
to_char_1 |        to_char         |        to_char
-----------+------------------------+------------------------
-           |                    123 |                    456
+           |                    123 |               456
|                    123 |  4,567,890,123,456,789
-           |  4,567,890,123,456,789 |                    123
+           |  4,567,890,123,456,789 |               123
|  4,567,890,123,456,789 |  4,567,890,123,456,789
|  4,567,890,123,456,789 | -4,567,890,123,456,789
(5 rows)
SELECT '' AS to_char_2, to_char(q1, '9G999G999G999G999G999D999G999'),
to_char(q2, '9,999,999,999,999,999.999,999')
FROM INT8_TBL;
to_char_2 |            to_char             |            to_char
-----------+--------------------------------+--------------------------------
-           |                    123.000,000 |                    456.000,000
+           |                    123.000,000 |               456.000,000
|                    123.000,000 |  4,567,890,123,456,789.000,000
-           |  4,567,890,123,456,789.000,000 |                    123.000,000
+           |  4,567,890,123,456,789.000,000 |               123.000,000
|  4,567,890,123,456,789.000,000 |  4,567,890,123,456,789.000,000
|  4,567,890,123,456,789.000,000 | -4,567,890,123,456,789.000,000
(5 rows)
SELECT '' AS to_char_3, to_char( (q1 * -1), '9999999999999999PR'),
to_char( (q2 * -1), '9999999999999999.999PR')
FROM INT8_TBL;
to_char_3 | to_char | to_char
-----------+--------------------+------------------------
| <123> | <456.000>
| <123> | <4567890123456789.000>
That's strange, I can't replicate that issue on my Windows build. I've
tried with 'PG_REGRESS_DIFF_OPTS="-dU10" make check' and all the tests
(including int8) pass fine. The spacing in the results is perfectly
normal. I wonder if something else on your build is causing this? I've
also tried several "make check" options for different locales
mentioned in the docs and they pass fine.

Hmm. Just in case my macOS laptop (CC=Apple's clang,
LANG=en_NZ.UTF-8) was unduly affected by cosmic rays I tried on a
couple of nearby servers running FreeBSD 11.1 (CC=clang, LANG=<unset>)
and CentOS 7.3 (CC=gcc, LANG=en_US.utf-8) and got the same result:
int8 has lost some whitespace in to_char output.

It looks a bit like it has lost a leading space for every ',' that was
present in the format string but which didn't finish up getting output
(because the number was too small). It looks a bit like that doesn't
happen for 'G', so let's compare the NUM_COMMA and NUM_G cases. Ahh..
I'm not sure, but I think you might have a close-brace in the wrong
place here:

@@ -4879,6 +4883,10 @@ NUM_processor(FormatNode *node, NUMDesc *Num,
char *inout,
                                                                continue;
                                                }
                                        }    <--- this guy here might
need to move after "noadd = true"?
+                                               if
(strncmp(Np->inout_p, Np->L_thousands_sep, separator_len) == 0)
+                                                       Np->inout_p +=
separator_len - 1;
+                                               else
+                                                       noadd = true;
                                        break;

case NUM_G:

With that change the test passes for me. But why do we get different results?

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Oliver Ford

ojford@gmail.com

almost 9 years ago

In reply to: Thomas Munro (#4)

Re: Fix number skipping in to_number

On Thu, Aug 17, 2017 at 11:55 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Hmm. Just in case my macOS laptop (CC=Apple's clang,
LANG=en_NZ.UTF-8) was unduly affected by cosmic rays I tried on a
couple of nearby servers running FreeBSD 11.1 (CC=clang, LANG=<unset>)
and CentOS 7.3 (CC=gcc, LANG=en_US.utf-8) and got the same result:
int8 has lost some whitespace in to_char output.

It looks a bit like it has lost a leading space for every ',' that was
present in the format string but which didn't finish up getting output
(because the number was too small). It looks a bit like that doesn't
happen for 'G', so let's compare the NUM_COMMA and NUM_G cases. Ahh..
I'm not sure, but I think you might have a close-brace in the wrong
place here:
@@ -4879,6 +4883,10 @@ NUM_processor(FormatNode *node, NUMDesc *Num,
char *inout,
continue;
}
}    <--- this guy here might
need to move after "noadd = true"?
+                                               if
(strncmp(Np->inout_p, Np->L_thousands_sep, separator_len) == 0)
+                                                       Np->inout_p +=
separator_len - 1;
+                                               else
+                                                       noadd = true;
break;
case NUM_G:

With that change the test passes for me. But why do we get different results?

--
Thomas Munro
http://www.enterprisedb.com

Ok I've made that change in the attached v3. I'm not sure as I'm on
en_US.UTF-8 locale too. Maybe something Windows specific?

Nathan Wagner

nw+pg@hydaspes.if.org

almost 9 years ago

In reply to: Oliver Ford (#5)

Re: Fix number skipping in to_number

On Thu, Aug 17, 2017 at 12:33:02PM +0100, Oliver Ford wrote:

Ok I've made that change in the attached v3. I'm not sure as I'm on
en_US.UTF-8 locale too. Maybe something Windows specific?

This patch applies against master (8485a25a), compiles, and
passes a make check.

I tested both on my mac laptop, and my linux server.

If we want this patch, I'd say it's ready for committer. We may want
(and I can't believe I'm saying this) more discussion as to exactly what
the strategy for to_number() (and friends) is. Do we want to duplicate
Oracle's functionality, or do we want a similar function to do similar
things, without necessarily having a goal of identical behavior to
oracle?

For myself, I pretty much never use the to_date, to_number, or
to_timestamp functions except when porting oracle code. I do use the
to_char functions on occasion. If strftime were available, I probably
wouldn't use them.

I would commit this patch and update the TODO with a goal of making
to_number as Oracle compatible as is reasonable.

--
nw

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Oliver Ford

ojford@gmail.com

almost 9 years ago

In reply to: Nathan Wagner (#6)

Re: Fix number skipping in to_number

On Monday, 25 September 2017, Nathan Wagner <nw+pg@hydaspes.if.org> wrote:

On Thu, Aug 17, 2017 at 12:33:02PM +0100, Oliver Ford wrote:

Ok I've made that change in the attached v3. I'm not sure as I'm on
en_US.UTF-8 locale too. Maybe something Windows specific?

This patch applies against master (8485a25a), compiles, and
passes a make check.

I tested both on my mac laptop, and my linux server.

If we want this patch, I'd say it's ready for committer. We may want
(and I can't believe I'm saying this) more discussion as to exactly what
the strategy for to_number() (and friends) is. Do we want to duplicate
Oracle's functionality, or do we want a similar function to do similar
things, without necessarily having a goal of identical behavior to
oracle?

For myself, I pretty much never use the to_date, to_number, or
to_timestamp functions except when porting oracle code. I do use the
to_char functions on occasion. If strftime were available, I probably
wouldn't use them.

I would commit this patch and update the TODO with a goal of making
to_number as Oracle compatible as is reasonable.

Thanks for your review. The issue is that Oracle throws errors on many more
input cases than Postgres does, so making it exactly like Oracle could
break a lot of existing users. E.g. to_number ('123,000', '999') returns
'123' on Postgres, but throws an error on Oracle. So making it exactly
Oracle-like could break existing users who might rely on the current
behavior.

My view is that we shouldn't deliberately introduce errors in order to be
exactly like Oracle if we don't currently and there's a sane use case for
current behavior. Do you have any examples of results that are different
between Oracle and Postgres, and you think the Oracle result makes more
sense?

Nathan Wagner

nw+pg@hydaspes.if.org

almost 9 years ago

In reply to: Oliver Ford (#7)

Re: Fix number skipping in to_number

On Mon, Sep 25, 2017 at 07:52:19PM +0100, Oliver Ford wrote:

Thanks for your review. The issue is that Oracle throws errors on many
more input cases than Postgres does, so making it exactly like Oracle
could break a lot of existing users. E.g. to_number ('123,000', '999')
returns '123' on Postgres, but throws an error on Oracle. So making it
exactly Oracle-like could break existing users who might rely on the
current behavior.

I wouldn't use to_number for anything other than oracle compatibility,
and then just so I didn't have to wade through all the ported oracle
code. I would use a regex or some such to clean up the number, and then
cast the result. For an input string of '123,000' I might do a
translate('123,000', ',', '')::integer or perhaps use regexp_replace().
Either way I would want a more positive decision as to what was valid or
not, based on the input.

My view is that we shouldn't deliberately introduce errors in order to be
exactly like Oracle if we don't currently and there's a sane use case for
current behavior. Do you have any examples of results that are different
between Oracle and Postgres, and you think the Oracle result makes more
sense?

Not really, other than I think an error report might make more sense.
'123,000' doesn't really match the format of '999'. If anything it
seems like we're guessing rather than validating input. It is
surprising (to me at least) that

to_char(to_number('123,000', '999'), '999')

doesn't give us the original input (in the sense that identical formats
don't preserve the original string). So I'm not sure the current
behavior is a sane use case, but perhaps more people are using
to_number() to get *some* numeric result, rather than for wanting it to
be like oracle. I would generally prefer to throw an exception instead
of getting a number I wasn't expecting, but I can see cases where that
might not be the case.

--
nw

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Daniel Gustafsson

daniel@yesql.se

over 8 years ago

In reply to: Nathan Wagner (#6)

Re: Fix number skipping in to_number

On 25 Sep 2017, at 02:52, Nathan Wagner <nw+pg@hydaspes.if.org> wrote:

On Thu, Aug 17, 2017 at 12:33:02PM +0100, Oliver Ford wrote:

Ok I've made that change in the attached v3. I'm not sure as I'm on
en_US.UTF-8 locale too. Maybe something Windows specific?

This patch applies against master (8485a25a), compiles, and
passes a make check.

I tested both on my mac laptop, and my linux server.

If we want this patch, I'd say it's ready for committer. We may want
(and I can't believe I'm saying this) more discussion as to exactly what
the strategy for to_number() (and friends) is.

Based on this review, I am moving this patch to the next commitfest and will
update it to Ready for committer. There might, as you say, be more discussion
needed, but due to the lack of extensive such so far I’ll move it for now.

cheers ./daniel

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Oliver Ford (#5)

Re: Fix number skipping in to_number

Oliver Ford <ojford@gmail.com> writes:

[ 0001-apply-number-v3.patch ]

I looked at this patch briefly and have a couple of comments:

* It seems entirely wrong to be matching to L_thousands_sep in the
NUM_COMMA case; that format code is by definition not locale aware,
so it should be matching to plain ',' independently of locale.

* Don't we need to fix the NUM_L (currency symbol) case in the
same manner? (The NUM_D and NUM_S cases are handled in
NUM_numpart_from_char and seem ok at a quick glance.)

* I'm not in love with the noadd flag. Other places in this
switch that want to skip the final increment do it with
"continue", and I think this should do likewise.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Oliver Ford

ojford@gmail.com

over 8 years ago

In reply to: Tom Lane (#10)

Re: [HACKERS] Fix number skipping in to_number

On Sun, Nov 12, 2017 at 7:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oliver Ford <ojford@gmail.com> writes:

[ 0001-apply-number-v3.patch ]

I looked at this patch briefly and have a couple of comments:

* It seems entirely wrong to be matching to L_thousands_sep in the
NUM_COMMA case; that format code is by definition not locale aware,
so it should be matching to plain ',' independently of locale.

Ok I've changed it to just search for a comma.

* Don't we need to fix the NUM_L (currency symbol) case in the
same manner? (The NUM_D and NUM_S cases are handled in
NUM_numpart_from_char and seem ok at a quick glance.)

Yes you get the same skipping if you do:

select to_number('12','L99');
to_number
-----------
2

However, this case is not as easy to fix as you can't do a simple
string comparison like with the group separator. The currency symbol
for the locale can be " " but if we do a comparison, it won't match if
the symbol specified is "$" or "£" (so will end up missing characters
at the end of the supplied string). Could we apply the attached patch
and then put fixing it for currency on the TODO list?

* I'm not in love with the noadd flag. Other places in this
switch that want to skip the final increment do it with
"continue", and I think this should do likewise.

I've changed to continue in the latest patch. All existing tests pass.

#12

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Oliver Ford (#11)

Re: [HACKERS] Fix number skipping in to_number

Oliver Ford <ojford@gmail.com> writes:

On Sun, Nov 12, 2017 at 7:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

* Don't we need to fix the NUM_L (currency symbol) case in the
same manner? (The NUM_D and NUM_S cases are handled in
NUM_numpart_from_char and seem ok at a quick glance.)

Yes you get the same skipping if you do:

select to_number('12','L99');
to_number
-----------
2

However, this case is not as easy to fix as you can't do a simple
string comparison like with the group separator. The currency symbol
for the locale can be " " but if we do a comparison, it won't match if
the symbol specified is "$" or "£" (so will end up missing characters
at the end of the supplied string). Could we apply the attached patch
and then put fixing it for currency on the TODO list?

I don't follow your concern? If "$" is not the correct currency
symbol for the locale, we shouldn't accept it as a match to an L format.
Your patch is tightening what we will accept as a match to a G format,
so I don't see why you're concerned about backward compatibility in
one case but not the other.

regards, tom lane

#13

Oliver Ford

ojford@gmail.com

over 8 years ago

In reply to: Tom Lane (#12)

Re: Fix number skipping in to_number

On Monday, 13 November 2017, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oliver Ford <ojford@gmail.com <javascript:;>> writes:

On Sun, Nov 12, 2017 at 7:00 PM, Tom Lane <tgl@sss.pgh.pa.us

<javascript:;>> wrote:

* Don't we need to fix the NUM_L (currency symbol) case in the
same manner? (The NUM_D and NUM_S cases are handled in
NUM_numpart_from_char and seem ok at a quick glance.)

Yes you get the same skipping if you do:

select to_number('12','L99');
to_number
-----------
2

However, this case is not as easy to fix as you can't do a simple
string comparison like with the group separator. The currency symbol
for the locale can be " " but if we do a comparison, it won't match if
the symbol specified is "$" or "£" (so will end up missing characters
at the end of the supplied string). Could we apply the attached patch
and then put fixing it for currency on the TODO list?

I don't follow your concern? If "$" is not the correct currency
symbol for the locale, we shouldn't accept it as a match to an L format.
Your patch is tightening what we will accept as a match to a G format,
so I don't see why you're concerned about backward compatibility in
one case but not the other.

regards, tom lane

It's a guess as to the likely use case. I would imagine that people are
likely to use a currency symbol different from the locale, but unlikely to
use a different group separator. Others might have a different opinion
though.

#14

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Oliver Ford (#13)

Re: Fix number skipping in to_number

Oliver Ford <ojford@gmail.com> writes:

On Monday, 13 November 2017, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I don't follow your concern? If "$" is not the correct currency
symbol for the locale, we shouldn't accept it as a match to an L format.
Your patch is tightening what we will accept as a match to a G format,
so I don't see why you're concerned about backward compatibility in
one case but not the other.

It's a guess as to the likely use case. I would imagine that people are
likely to use a currency symbol different from the locale, but unlikely to
use a different group separator. Others might have a different opinion
though.

Well, if they use a currency symbol different from the locale's, they're
in trouble anyway because the number of bytes might be different. In most
encodings, symbols other than "$" are probably not 1-byte characters.

At the very least I think we need to constrain it enough that it not
swallow a fractional character.

regards, tom lane

#15

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Tom Lane (#14)

Re: Fix number skipping in to_number

I wrote:

Oliver Ford <ojford@gmail.com> writes:

On Monday, 13 November 2017, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I don't follow your concern? If "$" is not the correct currency
symbol for the locale, we shouldn't accept it as a match to an L format.
Your patch is tightening what we will accept as a match to a G format,
so I don't see why you're concerned about backward compatibility in
one case but not the other.

It's a guess as to the likely use case. I would imagine that people are
likely to use a currency symbol different from the locale, but unlikely to
use a different group separator. Others might have a different opinion
though.

Well, if they use a currency symbol different from the locale's, they're
in trouble anyway because the number of bytes might be different. In most
encodings, symbols other than "$" are probably not 1-byte characters.
At the very least I think we need to constrain it enough that it not
swallow a fractional character.

After more testing I understood your concern about L_currency_symbol:
in C locale that's " ", not "$" as I naively imagined. Americans,
at least, would be pretty unhappy if "$1234.56" suddenly stopped matching
"L9999.99". So it seems like we can't institute a strict matching rule.
However, it is certainly not good that this happens:

regression=# select to_number('1234.56', 'L9999.99');
to_number
-----------
234.56
(1 row)

To me that seems just as bad as having ',' or 'G' eat a digit.

After some reflection I propose that the rule that we want is:

* ',' and 'G' consume input only if it exactly matches the expected
separator.

* Other non-data template patterns consume a number of input characters
equal to the number of characters they'd produce in output, *except* that
these patterns will not consume data characters (digits, signs, decimal
point, comma).

I think that while we are at it we should take some measures to ensure
that "character" in this definition means "character", not "byte".
It is not good that a euro currency symbol might consume an
encoding-dependent number of input characters.

That leads me to the attached patch. There is more that could be done
here --- in particular, I'd like to see the character-not-byte-count
rule extended to literal text. But that seems like fit material for
a different patch.

Also, I noticed that in your form of the patch, the strncmp() could read
past the end of the string, possibly resulting in a crash. So I made it
use the AMOUNT_TEST infrastructure from NUM_numpart_from_char to avoid that.

One other note: I realized that it was only pure luck that your regression
test cases worked in locales where 'G' is decimal point --- they still
gave the same answer, but through a totally different interpretation of
the input. That did not seem like a good idea, so I adjusted the
regression test to force C locale for the to_number() tests. I wish we
could use some other locale here, but then it likely wouldn't be portable
to Windows.

regards, tom lane

#16

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Tom Lane (#15)

Re: Fix number skipping in to_number

I wrote:

That leads me to the attached patch. There is more that could be done
here --- in particular, I'd like to see the character-not-byte-count
rule extended to literal text. But that seems like fit material for
a different patch.

Hearing no further comments, I pushed that patch.

regards, tom lane

#17

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Tom Lane (#15)

Re: Fix number skipping in to_number

I wrote:

That leads me to the attached patch. There is more that could be done
here --- in particular, I'd like to see the character-not-byte-count
rule extended to literal text. But that seems like fit material for
a different patch.

Attached is a patch that makes formatting.c more multibyte-aware;
it now handles multibyte characters as single NODE_TYPE_CHAR format
nodes, rather than one node per byte. This doesn't really have much
impact on the output (to_char) side, but on the input side, it
greatly simplifies treating such characters as single characters
rather than multiple ones. An example is that (in UTF8 encoding)
previously we got

u8=# select to_number('$12.34', '€99D99');
to_number
-----------
0.34
(1 row)

because the literal euro sign is 3 bytes long and was thought to be
3 literal characters. Now we get the expected result

u8=# select to_number('$12.34', '€99D99');
to_number
-----------
12.34
(1 row)

Aside from skipping 1 input character (not byte) per literal format
character, I fixed the SKIP_THth macro, allowing to_date/to_timestamp to
also follow the rule of skipping whole characters not bytes for non-data
format patterns. There might be some other places that need similar
adjustments, but I couldn't find any.

Not sure about whether/how to add regression tests for this; it's really
impossible to add specific tests in an ASCII-only test file. Maybe we
could put a test or two into collate.linux.utf8.sql, but it seems a bit
off topic for that, and I think that test file hardly gets run anyway.

Note this needs to be applied over the patch I posted at
/messages/by-id/3626.1510949486@sss.pgh.pa.us
I intend to commit that fairly soon, but it's not in right now.

regards, tom lane

#18

Ioseph Kim

pgsql-kr@postgresql.kr

over 7 years ago

In reply to: Oliver Ford (#1)

Re: [HACKERS] Fix number skipping in to_number

Hi.

I'm checking release note for version 11.

in that.

|"L| and |TH| now only consume characters that are not digits,
positive/negative signs, decimal points, or commas."

postgres@postgres=# select to_number('1234', '+9999');
to_number
-----------
234

Is this right?

Regards, ioseph.

||||

2017년 08월 10일 19:21에 Oliver Ford 이(가) 쓴 글:

Show quoted text

Prevents an issue where numbers can be skipped in the to_number()
function when the format mask contains a "G" or a "," but the input
string doesn't contain a separator. This resolves the TODO item "Fix
to_number() handling for values not matching the format string".

== Change ==

Currently, if the format mask in to_number() has a "G" or a ",", it
will assume that the input string contains a separator character in
the same place. If however a number is there instead, that number will
be silently skipped and not appear in the output. So we get:

select to_number('34,50','999,99');
to_number
-----------
340
(1 row)

This patch checks the input string when it encounters a "G" or "," in
the format mask. If the separator character is found, the code moves
over it as normal. If it's not found, then the code no longer
increments the relevant pointer so as not to skip the character. After
the patch, we get the correct result:

select to_number('34,50','999,99');
to_number
-----------
3450
(1 row)

This is in line with Oracle's result.

== Rationale ==

This patch is a small change, which leaves PostgreSQL behavior
different from Oracle behavior in related cases. Oracle's
implementation seems to read from right-to-left, and raises an
"ORA-01722: invalid number" error if there are digits in the input
string which don't have corresponding characters in the format mask. I
have chosen not to throw such errors, because there are use cases for
only returning part of a number string. For example, the following is
an error on Oracle:

select to_number('123,000', '999G') from dual;
Error report -
SQL Error: ORA-01722: invalid number

But if you wanted to only take the characters before the comma, and
discard the thousands part, you can do so on PostgreSQL with:

select to_number('123,000', '999G');
to_number
-----------
123
(1 row)

This is the current behavior. Which is why I think it makes more sense
to do what PostgreSQL currently does and read from left-to-right. The
only change, as mentioned above, is that the current behavior can skip
a digit:

select to_number('123456', '999G999');
to_number
-----------
12356
(1 row)

After the patch, this returns all the digits:

select to_number('123456', '999G999');
to_number
-----------
123456
(1 row)

== Testing ==

Tested on Windows with MinGW using the latest checkout from master.
Added regression tests to check for this new behavior. All existing
tests pass.

Fix number skipping in to_number

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: