[PATCH] Fix overflow and underflow in regr_r2()

Started by Chengpeng Yanabout 2 months ago15 messageshackers
Jump to latest
#1Chengpeng Yan
chengpeng_yan@outlook.com

Hi,

While looking at the corr() overflow/underflow discussion [1]/messages/by-id/19340-6fb9f6637f562092@postgresql.org, I noticed
that regr_r2() still computes

(Sxy * Sxy) / (Sxx * Syy)

directly. At very small or very large scales, those products can round
to zero or infinity even when the ratio itself is finite.

For example,

SELECT regr_r2(1e-100 + g * 1e-105,
1e-100 + g * 1e-105)
FROM generate_series(1, 3) g;

returns NaN without the patch, although the inputs are perfectly
correlated and the result should be 1.

corr() already has a stabilized calculation for the same Sxx * Syy
denominator scale. This patch factors that into a helper and lets
regr_r2() use it as a fallback when one of its direct products has
rounded to zero or infinity. Otherwise, regr_r2() keeps the existing
direct formula.

This preserves regr_r2()'s existing SQL-level special cases. The added
tests cover the fallback path and nearby NaN behavior.

Thoughts?

References:
[1]: /messages/by-id/19340-6fb9f6637f562092@postgresql.org

--
Best regards,
Chengpeng Yan

Attachments:

v1-0001-Avoid-overflow-and-underflow-in-regr_r2.patchapplication/octet-stream; name=v1-0001-Avoid-overflow-and-underflow-in-regr_r2.patchDownload+112-30
#2Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Chengpeng Yan (#1)
Re: [PATCH] Fix overflow and underflow in regr_r2()

On Mon, 27 Apr 2026 at 12:18, Chengpeng Yan <chengpeng_yan@outlook.com> wrote:

While looking at the corr() overflow/underflow discussion [1], I noticed
that regr_r2() still computes

(Sxy * Sxy) / (Sxx * Syy)

directly. At very small or very large scales, those products can round
to zero or infinity even when the ratio itself is finite.

Yes, good point.

corr() already has a stabilized calculation for the same Sxx * Syy
denominator scale. This patch factors that into a helper and lets
regr_r2() use it as a fallback when one of its direct products has
rounded to zero or infinity. Otherwise, regr_r2() keeps the existing
direct formula.

The comments need work -- in particular float8_regr_r2() needs a
comment explaining the new overflow/underflow checks, similar to the
comment in float8_corr(). In fact, doing that, I think it's preferable
to just keep this change local to float8_regr_r2(), rather than
refactoring into a helper function for just a few lines of code.

This new check in float8_regr_r2():

+   if (Sxy == 0 && !isnan(Sxx) && !isnan(Syy))
+       PG_RETURN_FLOAT8(0.0);

seems pointless. It's optimising for a special case that will very
rarely occur in practice, and which is handled fine by the general
code. We don't want to slow down the common code path for such rare
special cases.

I noticed that this new overflow test case:

+SELECT regr_r2(1e154::float8 * g, 1e154::float8 * g)
+  FROM generate_series(1, 2) g;
+ regr_r2
+---------
+       1
+(1 row)

only produces 1 because it's run with a reduced extra_float_digits
value. I think it's better to use the test values "1e100 + g * 1e95",
which still trigger the overflow on HEAD, but more reliably produce 1,
regardless of the extra_float_digits setting, making it less likely
that there will be variations between platforms. That's also more
consistent with the other nearby test cases.

Attached is a v2 patch with those changes, plus a little more tidying
up of the regression tests.

Arguably, this is a bug fix, but given the lack of prior complaints, I
think we should apply this to HEAD only, like 6498287696d, meaning it
will only appear in PG19.

Regards,
Dean

Attachments:

v2-0001-Improve-overflow-underflow-handling-in-regr_r2.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Improve-overflow-underflow-handling-in-regr_r2.patchDownload+87-23
#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dean Rasheed (#2)
Re: [PATCH] Fix overflow and underflow in regr_r2()

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

Attached is a v2 patch with those changes, plus a little more tidying
up of the regression tests.

LGTM.

Arguably, this is a bug fix, but given the lack of prior complaints, I
think we should apply this to HEAD only, like 6498287696d, meaning it
will only appear in PG19.

Agreed on no back-patch. It seems sensible to slip this into v19
rather than wait for v20, since the corresponding fix in corr()
is new as of 19. But since that's a judgment call, maybe you should
get the concurrence of the RMT.

regards, tom lane

#4Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tom Lane (#3)
Re: [PATCH] Fix overflow and underflow in regr_r2()

On Sat, 16 May 2026 at 14:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

Attached is a v2 patch with those changes, plus a little more tidying
up of the regression tests.

LGTM.

Thanks for looking.

Arguably, this is a bug fix, but given the lack of prior complaints, I
think we should apply this to HEAD only, like 6498287696d, meaning it
will only appear in PG19.

Agreed on no back-patch. It seems sensible to slip this into v19
rather than wait for v20, since the corresponding fix in corr()
is new as of 19. But since that's a judgment call, maybe you should
get the concurrence of the RMT.

Yes, that was my thinking. Does anyone from the RMT (cc'ed) have any
objection to slipping this into v19?

Regards,
Dean

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#3)
Re: [PATCH] Fix overflow and underflow in regr_r2()

BTW, on the principle of "where else did we make the same mistake",
I looked through the other aggregates using float8_regr_accum.
Most seem okay, but float8_regr_intercept does this:

PG_RETURN_FLOAT8((Sy - Sx * Sxy / Sxx) / N);

Seems to me that expression is also prone to internal
overflow/underflow. Underflow probably isn't a huge issue,
since the result will reduce to Sy/N which is likely to be good
enough. But can we do anything about overflow?

One simple change that might make things better is to compute

PG_RETURN_FLOAT8((Sy - Sx * (Sxy / Sxx)) / N);

on the theory that the sums of products are likely to both be large.

regards, tom lane

#6Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tom Lane (#5)
Re: [PATCH] Fix overflow and underflow in regr_r2()

On Sat, 16 May 2026 at 17:45, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, on the principle of "where else did we make the same mistake",
I looked through the other aggregates using float8_regr_accum.
Most seem okay, but float8_regr_intercept does this:

PG_RETURN_FLOAT8((Sy - Sx * Sxy / Sxx) / N);

Seems to me that expression is also prone to internal
overflow/underflow. Underflow probably isn't a huge issue,
since the result will reduce to Sy/N which is likely to be good
enough. But can we do anything about overflow?

One simple change that might make things better is to compute

PG_RETURN_FLOAT8((Sy - Sx * (Sxy / Sxx)) / N);

on the theory that the sums of products are likely to both be large.

Hmm, that isn't necessarily better. For example, with this data:

WITH t(x,y) AS (
SELECT 1e-155 + g*1e-160, 1e155 + g*1e150
FROM generate_series(1,10) g
)
SELECT sum(x::float8) sx, sum(y::float8) sy,
regr_sxx(y,x), regr_syy(y,x), regr_sxy(y,x),
regr_intercept(y,x)
FROM t;

sx | sy | regr_sxx |
regr_syy | regr_sxy | regr_intercept
-------------------------+---------------+--------------+------------------------+-----------------------+-------------------------
1.0000550000000001e-154 | 1.000055e+156 | 8.24996e-319 |
8.249999999970085e+301 | 8.249999999965278e-09 |
-5.144448004587567e+149
(1 row)

The current regr_intercept() code works fine, but if you were to
attempt to calculate Sxy / Sxx first, it would overflow.

I think probably the least likely to overflow computation would be Sxy
* (Sx / Sxx), because Sxx is likely to be very large/small whenever Sx
is, so Sx / Sxx seems unlikely to overflow. There may well be examples
disproving that theory too though, so maybe it needs to try multiple
orderings.

Regards,
Dean

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dean Rasheed (#6)
Re: [PATCH] Fix overflow and underflow in regr_r2()

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On Sat, 16 May 2026 at 17:45, Tom Lane <tgl@sss.pgh.pa.us> wrote:

One simple change that might make things better is to compute
PG_RETURN_FLOAT8((Sy - Sx * (Sxy / Sxx)) / N);
on the theory that the sums of products are likely to both be large.

Hmm, that isn't necessarily better.

True. We could do something like

(Sy - exp(log(Sx) + log(Sxy) - log(Sxx))) / N

but this'd require additional special-case code to deal with zero
or negative Sx and Sxy, so it's feeling rather tedious.

regards, tom lane

#8Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Chengpeng Yan (#1)
Re: [PATCH] Fix overflow and underflow in regr_r2()

On Sat, 16 May 2026 at 19:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On Sat, 16 May 2026 at 17:45, Tom Lane <tgl@sss.pgh.pa.us> wrote:

One simple change that might make things better is to compute
PG_RETURN_FLOAT8((Sy - Sx * (Sxy / Sxx)) / N);
on the theory that the sums of products are likely to both be large.

Hmm, that isn't necessarily better.

True. We could do something like

(Sy - exp(log(Sx) + log(Sxy) - log(Sxx))) / N

but this'd require additional special-case code to deal with zero
or negative Sx and Sxy, so it's feeling rather tedious.

I just had a thought: a simpler (and probably faster and more
accurate) solution would be to use frexp() and ldexp(), which are both
part of C99, so ought to be OK.

I don't think we can ignore underflow, because Sy might be exactly
zero, in which case the second term gives the whole result.

So, in rough terms, I'm thinking of something like this:

offset = Sx * Sxy / Sxx
if (offset == 0 || isinf(offset))
{
double m_Sx, m_Sxy, m_Sxx, m_offset;
int n_Sx, n_Sxy, n_Sxx, n_offset;

m_Sx = frexp(Sx, &n_Sx);
m_Sxy = frexp(Sxy, &n_Sxy);
m_Sxx = frexp(Sxx, &n_Sxx);

m_offset = m_Sx * m_Sxy / m_Sxx;
n_offset = n_Sx + n_Sxy - n_Sxx;
offset = ldexp(m_offset, n_offset);
}

PG_RETURN_FLOAT8((Sy - offset) / N);

Regards,
Dean

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dean Rasheed (#8)
Re: [PATCH] Fix overflow and underflow in regr_r2()

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

I just had a thought: a simpler (and probably faster and more
accurate) solution would be to use frexp() and ldexp(), which are both
part of C99, so ought to be OK.

Seems like a plan. We're already relying on ldexp() in pg_prng.c,
so I doubt there's a portability issue. Reading the man page for
frexp(), we might want to special-case Inf and NaN inputs to avoid
assuming what it will do with those. But that would only be needed
in the slow path where we're recovering from overflow/underflow.

regards, tom lane

#10Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tom Lane (#9)
Re: [PATCH] Fix overflow and underflow in regr_r2()

On Sat, 16 May 2026 at 22:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

I just had a thought: a simpler (and probably faster and more
accurate) solution would be to use frexp() and ldexp(), which are both
part of C99, so ought to be OK.

Seems like a plan. We're already relying on ldexp() in pg_prng.c,
so I doubt there's a portability issue. Reading the man page for
frexp(), we might want to special-case Inf and NaN inputs to avoid
assuming what it will do with those. But that would only be needed
in the slow path where we're recovering from overflow/underflow.

OK, here's a more complete patch along those lines, intended to apply
on top of the regr_r2() patch.

I did wonder whether the final subtraction step could overflow, before
we divide by N, but I don't think it can. Except for horizontal and
vertical lines, Sxx and Syy grow quadratically compared to Sx and Sy,
so they overflow first, and it doesn't look possible for the intercept
to exceed around 1e169, other than for a horizontal line.

Regards,
Dean

Attachments:

regr_intercept.patchtext/x-patch; charset=US-ASCII; name=regr_intercept.patchDownload+44-2
#11Chengpeng Yan
chengpeng_yan@outlook.com
In reply to: Dean Rasheed (#2)
Re: [PATCH] Fix overflow and underflow in regr_r2()

Hi,

On May 16, 2026, at 17:39, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

corr() already has a stabilized calculation for the same Sxx * Syy
denominator scale. This patch factors that into a helper and lets
regr_r2() use it as a fallback when one of its direct products has
rounded to zero or infinity. Otherwise, regr_r2() keeps the existing
direct formula.

The comments need work -- in particular float8_regr_r2() needs a
comment explaining the new overflow/underflow checks, similar to the
comment in float8_corr(). In fact, doing that, I think it's preferable
to just keep this change local to float8_regr_r2(), rather than
refactoring into a helper function for just a few lines of code.

This new check in float8_regr_r2():

+   if (Sxy == 0 && !isnan(Sxx) && !isnan(Syy))
+       PG_RETURN_FLOAT8(0.0);

seems pointless. It's optimising for a special case that will very
rarely occur in practice, and which is handled fine by the general
code. We don't want to slow down the common code path for such rare
special cases.

I noticed that this new overflow test case:

+SELECT regr_r2(1e154::float8 * g, 1e154::float8 * g)
+  FROM generate_series(1, 2) g;
+ regr_r2
+---------
+       1
+(1 row)

only produces 1 because it's run with a reduced extra_float_digits
value. I think it's better to use the test values "1e100 + g * 1e95",
which still trigger the overflow on HEAD, but more reliably produce 1,
regardless of the extra_float_digits setting, making it less likely
that there will be variations between platforms. That's also more
consistent with the other nearby test cases.

Thanks for the thorough review and feedback — I learned a lot from it!

Attached is a v2 patch with those changes, plus a little more tidying
up of the regression tests.

v2 LGTM. Thanks for the updates and test cleanup.

--
Best regards,
Chengpeng Yan

#12Chengpeng Yan
chengpeng_yan@outlook.com
In reply to: Dean Rasheed (#10)
Re: [PATCH] Fix overflow and underflow in regr_r2()

Hi,

On May 17, 2026, at 17:16, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

OK, here's a more complete patch along those lines, intended to apply
on top of the regr_r2() patch.

Thanks for the regr_intercept.patch. The approach looks good to me.

I only noticed a few small things:

1. The patch file seems to have a format issue and doesn't apply
directly. `git apply` reports:

```
error: git apply: bad git-diff - expected /dev/null on line 2
```

2. `dy` seems a bit hard to understand. Perhaps `offset`, as used in the
earlier sketch, would be clearer?

3. Do we need to add tests for the underflow path, and perhaps for the
Inf/NaN guard?

Other than that, this looks good to me.

--
Best regards,
Chengpeng Yan

#13Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Chengpeng Yan (#12)
Re: [PATCH] Fix overflow and underflow in regr_r2()

On Sat, 23 May 2026 at 03:42, Chengpeng Yan <chengpeng_yan@outlook.com> wrote:

Thanks for the regr_intercept.patch. The approach looks good to me.

Thanks for reviewing, and sorry for the delay getting back to you.

2. `dy` seems a bit hard to understand. Perhaps `offset`, as used in the
earlier sketch, would be clearer?

[Shrug] I think dy is common enough to denote a difference in
y-values, and it seems clear enough, given the large comment above it.

3. Do we need to add tests for the underflow path, and perhaps for the
Inf/NaN guard?

Yeah, I think it makes sense to include a test with underflow, since
that really can lead to a large relative error. I don't think it's
worth testing the Inf/NaN guard, since that's more about avoiding
operating on technically uninitialised variables, and I don't believe
that it actually affects the results.

I've add this test case:

SELECT regr_intercept(y, x) FROM (VALUES (-1e-131, 0), (2e-131,
3e-131)) v(x, y);

Here, directly computing Sx * Sxy / Sxx causes an underflow to zero,
while the correct result should be 1e-131. Since Sy is 3e-131, this
makes a noticeable difference to the final result (without the patch,
it returns an intercept of 1.5e-131, whereas with the patch, it
correctly returns 1e-131).

If there are no objections from the RMT, I'll push both of these (to
HEAD only) in a couple of days or so.

Regards,
Dean

Attachments:

v2-0001-Improve-overflow-underflow-handling-in-regr_r2.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Improve-overflow-underflow-handling-in-regr_r2.patchDownload+87-23
v2-0002-Improve-overflow-underflow-handling-in-regr_inter.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Improve-overflow-underflow-handling-in-regr_inter.patchDownload+51-3
#14Chengpeng Yan
chengpeng_yan@outlook.com
In reply to: Dean Rasheed (#13)
Re: [PATCH] Fix overflow and underflow in regr_r2()

Hi,

On May 28, 2026, at 20:37, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

[Shrug] I think dy is common enough to denote a difference in
y-values, and it seems clear enough, given the large comment above it.

Makes sense, thanks for the clarification.

3. Do we need to add tests for the underflow path, and perhaps for the
Inf/NaN guard?

Yeah, I think it makes sense to include a test with underflow, since
that really can lead to a large relative error. I don't think it's
worth testing the Inf/NaN guard, since that's more about avoiding
operating on technically uninitialised variables, and I don't believe
that it actually affects the results.

I've add this test case:

SELECT regr_intercept(y, x) FROM (VALUES (-1e-131, 0), (2e-131,
3e-131)) v(x, y);

Here, directly computing Sx * Sxy / Sxx causes an underflow to zero,
while the correct result should be 1e-131. Since Sy is 3e-131, this
makes a noticeable difference to the final result (without the patch,
it returns an intercept of 1.5e-131, whereas with the patch, it
correctly returns 1e-131).

If there are no objections from the RMT, I'll push both of these (to
HEAD only) in a couple of days or so.

Regards,
Dean
<v2-0001-Improve-overflow-underflow-handling-in-regr_r2.patch><v2-0002-Improve-overflow-underflow-handling-in-regr_inter.patch>

v2 LGTM.

I ran the regression tests locally on Apple Silicon as well, and all
tests passed.

--
Best regards,
Chengpeng Yan

#15Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Chengpeng Yan (#14)
Re: [PATCH] Fix overflow and underflow in regr_r2()

On Fri, 29 May 2026 at 05:23, Chengpeng Yan <chengpeng_yan@outlook.com> wrote:

On May 28, 2026, at 20:37, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

If there are no objections from the RMT, I'll push both of these (to
HEAD only) in a couple of days or so.

v2 LGTM.

I ran the regression tests locally on Apple Silicon as well, and all
tests passed.

OK, I have pushed both patches. Thanks for reviewing and testing.

Regards,
Dean