Infinities in type numeric
We had a discussion recently about how it'd be a good idea to support
infinity values in type numeric [1]/messages/by-id/27490.1590414212@sss.pgh.pa.us. Here's a draft patch enabling
that, using the idea suggested in that thread of commandeering some
unused bits in the representation of numeric NaNs. AFAICT we've been
careful to ensure those bits are always zero, so that this will work
without creating any pg_upgrade problems.
This is just WIP, partly because I haven't touched the SGML docs
yet, but also because there are some loose ends to be resolved:
* I believe I made all the functions that correspond to POSIX-standard
functions do what POSIX says for infinite inputs. However, this does
not always match what our existing float8 functions do [2]/messages/by-id/582552.1591917752@sss.pgh.pa.us. I'm
assuming that we'll change those functions to match POSIX; but if we
don't, this might need another look.
* I had to invent some semantics for non-standardized functions,
particularly numeric_mod, numeric_gcd, numeric_lcm. This area
could use review to be sure that I chose desirable behaviors.
* I'm only about 50% sure that I understand what the sort abbreviation
code is doing. A quick look from Peter or some other expert would be
helpful.
* It seems to me that the existing behavior of numeric_stddev_internal
is not quite right for the case of a single input value that is a NaN,
when in "sample" mode. Per the comment "Sample stddev and variance are
undefined when N <= 1", ISTM that we ought to return NULL in this case,
but actually you get a NaN because the check for "NaNcount > 0" is made
before considering that. I think that's the wrong way round --- in some
sense NULL is "less defined" than NaN, so that's what we ought to use.
Moreover, the float8 stddev code agrees: in HEAD you get
regression=# SELECT stddev_samp('nan'::float8);
stddev_samp
-------------
(1 row)
regression=# SELECT stddev_samp('nan'::numeric);
stddev_samp
-------------
NaN
(1 row)
So I think we ought to make the numeric code match the former, and have
done that here. However, the float8 code has its own issues for the
population case [3]/messages/by-id/353062.1591898766@sss.pgh.pa.us, and depending on what we do about that, this might
need further changes to agree. (There's also the question of whether to
back-patch any such bug fixes.)
* The jsonpath code is inconsistent about how it handles NaN vs Inf [4]/messages/by-id/203949.1591879542@sss.pgh.pa.us.
I'm assuming here that we'll fix that by rejecting NaNs in that code,
but if we conclude that we do need to allow non-finite double values
there, probably we need to allow Infs too.
* It seems like there might be a use-case for isfinite() and maybe
isnan() SQL functions. On the other hand, we don't have those for
float4/float8 either. These could be a follow-on addition, anyway.
I'll stick this in the queue for review.
regards, tom lane
[1]: /messages/by-id/27490.1590414212@sss.pgh.pa.us
[2]: /messages/by-id/582552.1591917752@sss.pgh.pa.us
[3]: /messages/by-id/353062.1591898766@sss.pgh.pa.us
[4]: /messages/by-id/203949.1591879542@sss.pgh.pa.us
Attachments:
numeric-infinities-1.patchtext/x-diff; charset=us-ascii; name=numeric-infinities-1.patchDownload+2059-334
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> * I'm only about 50% sure that I understand what the sort
Tom> abbreviation code is doing. A quick look from Peter or some other
Tom> expert would be helpful.
That code was originally mine, so I'll look at it.
--
Andrew (irc:RhodiumToad)
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> @@ -359,10 +390,14 @@ typedef struct NumericSumAccum
Tom> #define NumericAbbrevGetDatum(X) ((Datum) (X))
Tom> #define DatumGetNumericAbbrev(X) ((int64) (X))
Tom> #define NUMERIC_ABBREV_NAN NumericAbbrevGetDatum(PG_INT64_MIN)
Tom> +#define NUMERIC_ABBREV_PINF NumericAbbrevGetDatum(PG_INT64_MIN)
Tom> +#define NUMERIC_ABBREV_NINF NumericAbbrevGetDatum(PG_INT64_MAX)
Tom> #else
Tom> #define NumericAbbrevGetDatum(X) ((Datum) (X))
Tom> #define DatumGetNumericAbbrev(X) ((int32) (X))
Tom> #define NUMERIC_ABBREV_NAN NumericAbbrevGetDatum(PG_INT32_MIN)
Tom> +#define NUMERIC_ABBREV_PINF NumericAbbrevGetDatum(PG_INT32_MIN)
Tom> +#define NUMERIC_ABBREV_NINF NumericAbbrevGetDatum(PG_INT32_MAX)
Tom> #endif
I'd have been more inclined to go with -PG_INT64_MAX / -PG_INT32_MAX for
the NUMERIC_ABBREV_PINF value. It seems more likely to be beneficial to
bucket +Inf and NaN separately (and put +Inf in with the "too large to
abbreviate" values) than to bucket them together so as to distinguish
between +Inf and "too large" values. But this is an edge case in any
event, so it probably wouldn't make a great deal of difference unless
you're sorting on data with a large proportion of both +Inf and NaN
values.
(It would be possible to add another bucket so that "too large", +Inf,
and NaN were three separate buckets, but honestly any more complexity
seems not worth it for handling an edge case.)
The actual changes to the abbrev stuff look fine to me.
--
Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> +#define NUMERIC_ABBREV_PINF NumericAbbrevGetDatum(PG_INT64_MIN)
Tom> +#define NUMERIC_ABBREV_PINF NumericAbbrevGetDatum(PG_INT32_MIN)
I'd have been more inclined to go with -PG_INT64_MAX / -PG_INT32_MAX for
the NUMERIC_ABBREV_PINF value. It seems more likely to be beneficial to
bucket +Inf and NaN separately (and put +Inf in with the "too large to
abbreviate" values) than to bucket them together so as to distinguish
between +Inf and "too large" values. But this is an edge case in any
event, so it probably wouldn't make a great deal of difference unless
you're sorting on data with a large proportion of both +Inf and NaN
values.
I had been worried about things possibly sorting in the wrong order
if I did that. However, now that I look more closely I see that
* We convert the absolute value of the numeric to a 31-bit or 63-bit positive
* value, and then negate it if the original number was positive.
so that a finite value should never map to INT[64]_MIN, making it
safe to do as you suggest. I agree that distinguishing +Inf from NaN
is probably more useful than distinguishing it from the very largest
class of finite values, so will do it as you suggest. Thanks!
regards, tom lane
On Thu, Jun 11, 2020 at 9:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
We had a discussion recently about how it'd be a good idea to support
infinity values in type numeric [1].
Only a minority of that discussion was actually on that topic, and I'm
not sure there was a clear consensus in favor of it.
FWIW, I don't particularly like the idea. Back when I was an
application developer, I remember having to insert special cases into
any code that dealt with double precision to deal with +/-Inf and NaN.
I was happy that I didn't need them for numeric, too. So this change
would have made me sad.
It's possible I'm the only one, though.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Jun 11, 2020 at 9:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
We had a discussion recently about how it'd be a good idea to support
infinity values in type numeric [1].
FWIW, I don't particularly like the idea. Back when I was an
application developer, I remember having to insert special cases into
any code that dealt with double precision to deal with +/-Inf and NaN.
I was happy that I didn't need them for numeric, too. So this change
would have made me sad.
Well, you're already stuck with special-casing numeric NaN, so I'm
not sure that Inf makes your life noticeably worse on that score.
This does tie into something I have a question about in the patch's
comments though. As the patch stands, numeric(numeric, integer)
(that is, the typmod-enforcement function) just lets infinities
through regardless of the typmod, on the grounds that it is/was also
letting NaNs through regardless of typmod. But you could certainly
make the argument that Inf should only be allowed in an unconstrained
numeric column, because by definition it overflows any finite precision
restriction. If we did that, you'd never see Inf in a
standard-conforming column, since SQL doesn't allow unconstrained
numeric columns IIRC. That'd at least ameliorate your concern.
If we were designing this today, I think I'd vote to disallow NaN
in a constrained numeric column, too. But I suppose it's far too
late to change that aspect.
regards, tom lane
On Fri, Jun 12, 2020 at 1:01 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
This does tie into something I have a question about in the patch's
comments though. As the patch stands, numeric(numeric, integer)
(that is, the typmod-enforcement function) just lets infinities
through regardless of the typmod, on the grounds that it is/was also
letting NaNs through regardless of typmod. But you could certainly
make the argument that Inf should only be allowed in an unconstrained
numeric column, because by definition it overflows any finite precision
restriction. If we did that, you'd never see Inf in a
standard-conforming column, since SQL doesn't allow unconstrained
numeric columns IIRC. That'd at least ameliorate your concern.
Yes, I agree. It also seems like a more principled choice - I am not
sure why if I ask for a number no larger than 10^3 we ought to permit
infinity.
BTW, has there been any thought to supporting a negative scale for the
numeric data type? If you can cut off digits after the decimal, why
not before?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
[...]
Tom> so that a finite value should never map to INT[64]_MIN, making it
Tom> safe to do as you suggest. I agree that distinguishing +Inf from
Tom> NaN is probably more useful than distinguishing it from the very
Tom> largest class of finite values, so will do it as you suggest.
Tom> Thanks!
It would make sense to make sure there's a test case in which at least
one value of all three of: a finite value much greater than 10^332, a
+Inf, and a NaN were all present in the same sort, if there isn't one
already.
--
Andrew (irc:RhodiumToad)
Robert Haas <robertmhaas@gmail.com> writes:
BTW, has there been any thought to supporting a negative scale for the
numeric data type? If you can cut off digits after the decimal, why
not before?
Hm, would there be any real use-case?
An implementation issue is that even in the "long" numeric format,
we cram dscale into a 14-bit unsigned field. You could redefine
the field as signed and pray that nobody has dscales above 8K
stored on disk, but I'm dubious that there's a good argument for
taking that risk.
There might be algorithmic issues as well, haven't really looked.
Any such problems would probably be soluble, if need be by forcing
the scale to be at least 0 for calculation and then rounding
afterwards.
regards, tom lane
On Fri, Jun 12, 2020 at 2:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
BTW, has there been any thought to supporting a negative scale for the
numeric data type? If you can cut off digits after the decimal, why
not before?Hm, would there be any real use-case?
Compatibility... apparently people do use it.
An implementation issue is that even in the "long" numeric format,
we cram dscale into a 14-bit unsigned field. You could redefine
the field as signed and pray that nobody has dscales above 8K
stored on disk, but I'm dubious that there's a good argument for
taking that risk.
That doesn't sound too appealing I guess, but couldn't you enforce it
as a typemod without changing the on-disk representation of the
values?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Fri, Jun 12, 2020 at 2:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
BTW, has there been any thought to supporting a negative scale for the
numeric data type? If you can cut off digits after the decimal, why
not before?
Hm, would there be any real use-case?
Compatibility... apparently people do use it.
Uh, compatibility with what? Not the SQL spec, for sure.
An implementation issue is that even in the "long" numeric format,
we cram dscale into a 14-bit unsigned field.
That doesn't sound too appealing I guess, but couldn't you enforce it
as a typemod without changing the on-disk representation of the
values?
On second thought, I'm confusing two distinct though related concepts.
dscale is *display* scale, and it's fine that it's unsigned, because
there is no reason to suppress printing digits to the left of the decimal
point. ("Whaddya mean, 10 is really 100?") We could allow the "scale"
part of typmod to be negative and thereby cause an input of, say,
123.45 to be rounded to say 100 --- but it should display as 100 not 1,
so its display scale is still 0.
Hence, there's no pg_upgrade issue. You'd still need to rethink how
precision and scale get packed into an int32 typmod, but those only
exist in catalog data, so pg_upgrade's schema dump/restore would be
enough to update them.
Having said that, we've long resisted redefining the encoding of
typmod for other data types (despite the clear insanity of some
of the encodings), for fear that client code might be looking at
those catalog columns. I'm not sure how big a deal that really is.
regards, tom lane
Here's a v2 patch:
* Rebased over today's nearby commits
* Documentation changes added
* Sort abbrev support improved per Andrew's suggestions
* Infinities now considered to fail any typmod precision limit,
per discussion with Robert.
regards, tom lane
Attachments:
numeric-infinities-2.patchtext/x-diff; charset=us-ascii; name=numeric-infinities-2.patchDownload+2148-339
On Fri, 12 Jun 2020 at 02:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
* I had to invent some semantics for non-standardized functions,
particularly numeric_mod, numeric_gcd, numeric_lcm. This area
could use review to be sure that I chose desirable behaviors.
I think the semantics you've chosen for numeric_mod() are reasonable,
and I think they're consistent with POSIX fmod().
However, it looks like you've chosen gcd('Inf', x) = x, whereas I'd
say that the result should be 'NaN'.
One way to look at it is that the GCD result should exactly divide
both inputs with no remainder, but the remainder when you divide 'Inf'
by x is undefined, so you can't say that x exactly divides 'Inf'.
Another way to look at it is that gcd('Inf', x) is limit(n -> 'Inf',
gcd(n, x)), but that limit isn't well-defined. For example, suppose
x=10, then gcd('Inf', 10) = limit(n -> 'Inf', gcd(n, 10)), but gcd(n,
10) is either 1,2,5 or 10 depending on n, and it does not converge to
any particular value in the limit n -> 'Inf'.
A third way to look at it would be to apply one round of Euclid's
algorithm to it: gcd('Inf', x) = gcd(x, mod('Inf', x)) = gcd(x, 'NaN')
= 'NaN'.
Now you could argue that x=0 is a special case, and gcd('Inf', 0) =
'Inf' on the grounds that gcd(a, 0) = a for all finite 'a'. However, I
don't think that's particularly useful, and it fails the first test
that the result exactly divides both inputs because mod('Inf', 'Inf')
is undefined ('NaN').
Similar arguments apply to lcm(), so I'd say that both gcd() and lcm()
should return 'NaN' if either input is 'Inf' or 'NaN'.
Regards,
Dean
On 6/12/20 7:00 PM, Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Jun 11, 2020 at 9:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
We had a discussion recently about how it'd be a good idea to support
infinity values in type numeric [1].FWIW, I don't particularly like the idea. Back when I was an
application developer, I remember having to insert special cases into
any code that dealt with double precision to deal with +/-Inf and NaN.
I was happy that I didn't need them for numeric, too. So this change
would have made me sad.Well, you're already stuck with special-casing numeric NaN, so I'm
not sure that Inf makes your life noticeably worse on that score.This does tie into something I have a question about in the patch's
comments though. As the patch stands, numeric(numeric, integer)
(that is, the typmod-enforcement function) just lets infinities
through regardless of the typmod, on the grounds that it is/was also
letting NaNs through regardless of typmod. But you could certainly
make the argument that Inf should only be allowed in an unconstrained
numeric column, because by definition it overflows any finite precision
restriction. If we did that, you'd never see Inf in a
standard-conforming column, since SQL doesn't allow unconstrained
numeric columns IIRC.
It does. The precision and scale are both optional.
If the precision is missing, it's implementation defined; if the scale
is missing, it's 0.
--
Vik Fearing
Vik Fearing <vik@postgresfriends.org> writes:
On 6/12/20 7:00 PM, Tom Lane wrote:
If we did that, you'd never see Inf in a
standard-conforming column, since SQL doesn't allow unconstrained
numeric columns IIRC.
It does. The precision and scale are both optional.
If the precision is missing, it's implementation defined; if the scale
is missing, it's 0.
Ah, right, the way in which we deviate from the spec is that an
unconstrained numeric column doesn't coerce every entry to scale 0.
Still, that *is* a spec deviation, so adding "... and it allows Inf"
doesn't seem like it's making things worse for spec-compliant apps.
regards, tom lane
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
On Fri, 12 Jun 2020 at 02:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
* I had to invent some semantics for non-standardized functions,
particularly numeric_mod, numeric_gcd, numeric_lcm. This area
could use review to be sure that I chose desirable behaviors.
I think the semantics you've chosen for numeric_mod() are reasonable,
and I think they're consistent with POSIX fmod().
Ah, I had not thought to look at fmod(). I see that POSIX treats
x-is-infinite the same as y-is-zero: raise EDOM and return NaN.
I think it's okay to deviate to the extent of throwing an error in
one case and returning NaN in the other, but I added a comment
noting the difference.
Similar arguments apply to lcm(), so I'd say that both gcd() and lcm()
should return 'NaN' if either input is 'Inf' or 'NaN'.
Works for me; that's easier anyway.
The attached v3 patch fixes these things and also takes care of an
oversight in v2: I'd made numeric() apply typmod restrictions to Inf,
but not numeric_in() or numeric_recv(). I believe the patch itself
is in pretty good shape now, though there are still some issues
elsewhere as noted in the first message in this thread.
Thanks for reviewing!
regards, tom lane
Attachments:
numeric-infinities-3.patchtext/x-diff; charset=us-ascii; name=numeric-infinities-3.patchDownload+2153-328
On Tue, 16 Jun 2020 at 18:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The attached v3 patch fixes these things and also takes care of an
oversight in v2: I'd made numeric() apply typmod restrictions to Inf,
but not numeric_in() or numeric_recv(). I believe the patch itself
is in pretty good shape now, though there are still some issues
elsewhere as noted in the first message in this thread.
I had a look at this, and I think it's mostly in good shape. It looks
like everything from the first message in this thread has been
resolved, except I don't know about the jsonpath stuff, because I
haven't been following that.
I tried to go over all the edge cases and I think they all make sense,
except for a couple of cases which I've listed below, along with a few
other minor comments:
1). I don't think that the way in_range() handles infinities is quite
right. For example:
SELECT in_range('inf'::numeric, 10::numeric, 'inf'::numeric, false, false);
in_range
----------
f
(1 row)
But I think that should return "val >= base + offset", which is "Inf
= Inf", which should be true.
Similarly, I think this should return true:
SELECT in_range('-inf'::numeric, 10::numeric, 'inf'::numeric, true, true);
in_range
----------
f
(1 row)
I think this could use some test coverage.
2). I think numeric_pg_lsn() needs updating -- this should probably be an error:
SELECT pg_lsn('inf'::numeric);
pg_lsn
--------
0/0
(1 row)
3). In the bottom half of numeric.c, there's a section header comment
saying "Local functions follow ... In general, these do not support
NaNs ...". That should probably also mention infinities. There are
also now more functions to mention that are exceptions to that comment
about not handling NaN/Inf, but I think that some of the new
exceptions can be avoided.
4). The comment for set_var_from_str() mentions that it doesn't handle
"NaN", so on the face of it, it ought to also mention that it doesn't
handle "Infinity" either. However, this is only a few lines down from
that "Local functions follow ..." section header comment, which
already covers that, so it seems pointless mentioning NaNs and
infinities again for this function (none of the other local functions
in that section of the file do).
5). It seems a bit odd that numeric_to_double_no_overflow() handles
infinite inputs, but not NaN inputs, while its only caller
numeric_float8_no_overflow() handles NaNs, but not infinities. ISTM
that it would be neater to have all the special-handling in one place
(in the caller). That would also stop numeric_to_double_no_overflow()
being an exception to the preceding section header comment about local
functions not handling Nan/Inf. In fact, I wonder why keep
numeric_to_double_no_overflow() at all? It could just be rolled into
its caller, making it more like numeric_float8().
6). The next function, numericvar_to_double_no_overflow(), has a
comment that just says "As above, but work from a NumericVar", but it
isn't really "as above" anymore, since it doesn't handle infinite
inputs. Depending on what happens to numeric_to_double_no_overflow(),
this function's comment might need some tweaking.
7). The new function numeric_is_integral() feels out of place where it
is, amongst arithmetic functions operating on NumericVar's, because it
operates on a Numeric, and also because it handles NaNs, making it
another exception to the preceding comment about local functions that
don't handle NaNs. Perhaps it would fit in better after
numeric_is_nan() and numeric_is_inf(). Even though it's a local
function, it feels more akin to those functions.
Finally, not really in the scope of this patch, but something I
noticed anyway while looking at edge cases -- float and numeric handle
NaN/0 differently:
SELECT 'nan'::float8 / 0::float8;
ERROR: division by zero
SELECT 'nan'::numeric / 0::numeric;
?column?
----------
NaN
I'm not sure if this is worth worrying about, or which behaviour is
preferable though.
Regards,
Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
I had a look at this, and I think it's mostly in good shape. It looks
like everything from the first message in this thread has been
resolved, except I don't know about the jsonpath stuff, because I
haven't been following that.
Thanks for the careful review! Yeah, Alexander fixed the jsonpath
stuff at df646509f, so I think all my original concerns are cleared,
other than the question of whether to invent isfinite() and isnan()
SQL functions. That seems like follow-on work in any case.
1). I don't think that the way in_range() handles infinities is quite
right. For example:
SELECT in_range('inf'::numeric, 10::numeric, 'inf'::numeric, false, false);
in_range
----------
f
(1 row)
But I think that should return "val >= base + offset", which is "Inf
= Inf", which should be true.
Hmm. I modeled the logic on the float8 in_range code, which does the
same thing:
# SELECT in_range('inf'::float8, 10::float8, 'inf'::float8, false, false);
in_range
----------
f
(1 row)
It does seem like this is wrong per the specification of in_range, though,
so do we have a bug to fix in the float in_range support? If so I'd
be inclined to go correct that first and then adapt the numeric patch
to match.
Similarly, I think this should return true:
SELECT in_range('-inf'::numeric, 10::numeric, 'inf'::numeric, true, true);
Same comment.
I think this could use some test coverage.
Evidently :-(
2). I think numeric_pg_lsn() needs updating -- this should probably be an error:
Oh, that was not there when I produced my patch. Will cover it in the
next version.
I agree with your other comments and will update the patch.
Finally, not really in the scope of this patch, but something I
noticed anyway while looking at edge cases -- float and numeric handle
NaN/0 differently:
SELECT 'nan'::float8 / 0::float8;
ERROR: division by zero
SELECT 'nan'::numeric / 0::numeric;
?column?
----------
NaN
Hmm. It seems like we generally ought to try to follow IEEE 754
for the semantics of operations on NaN, but I don't have a copy of
that spec so I'm not sure which result it specifies for this.
I agree that being inconsistent between the two types is not what
we want.
regards, tom lane
I wrote:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
I had a look at this, and I think it's mostly in good shape. It looks
like everything from the first message in this thread has been
resolved, except I don't know about the jsonpath stuff, because I
haven't been following that.
Thanks for the careful review!
Here's a v4 that syncs numeric in_range() with the new behavior of
float in_range(), and addresses your other comments too.
regards, tom lane