BUG #19340: Wrong result from CORR() function

Started by PG Bug reporting form6 months ago24 messagesbugs

noreply@postgresql.org

6 months ago

The following bug has been logged on the website:

Bug reference: 19340
Logged by: Oleg Ivanov
Email address: o15611@gmail.com
PostgreSQL version: 18.1
Operating system: all
Description:

postgres=# SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,24) ;
corr
------

(1 row)

postgres=# SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,25) ;
corr
---------------------
-0.1877294297321991
(1 row)

postgres=# SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,31) ;
corr
----------------------
-0.03366532289960463
(1 row)

postgres=# SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,32) ;
corr
----------------------
0.037939234274452456
(1 row)

Another example:

postgres=# SELECT corr( 0.9 , 0.91 ) FROM generate_series(1,36) ;
corr
------

(1 row)

postgres=# SELECT corr( 0.9 , 0.91 ) FROM generate_series(1,37) ;
corr
---------------------
0.37167954745944803
(1 row)

postgres=# SELECT corr( 0.9 , 0.91 ) FROM generate_series(1,113) ;
corr
----------------------
0.022884787550167176
(1 row)

postgres=# SELECT corr( 0.9 , 0.91 ) FROM generate_series(1,114) ;
corr
-----------------------
-0.004981175111303341
(1 row)

In the Oracle Database:
SQL> select corr( 0.09 , 0.09000001 ) FROM (select rownum as id from dual
connect by level <=330);

CORR(0.09,0.09000001)
---------------------

If argument is the constant, function CORR() must give a 0 or NaN.
Consequences of this bug: statistic functions are used to make business
descision. Wrong and completely different results can lead to make mistakes.

Laurenz Albe

laurenz.albe@cybertec.at

6 months ago

In reply to: PG Bug reporting form (#1)

Re: BUG #19340: Wrong result from CORR() function

On Tue, 2025-12-02 at 07:57 +0000, PG Bug reporting form wrote:

The following bug has been logged on the website:

Bug reference: 19340
Logged by: Oleg Ivanov
Email address: o15611@gmail.com
PostgreSQL version: 18.1
Operating system: all
Description:

postgres=# SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,24) ;
corr
------

(1 row)

If argument is the constant, function CORR() must give a 0 or NaN.
Consequences of this bug: statistic functions are used to make business
descision. Wrong and completely different results can lead to make mistakes.

The documentation is pretty clear about that:

In all cases, null is returned if the computation is meaningless, for example when N is zero.

Yours,
Laurenz Albe

Oleg Ivanov

o15611@gmail.com

6 months ago

In reply to: Laurenz Albe (#2)

Re: BUG #19340: Wrong result from CORR() function

Yes, must be NULL in all the queries I have provided!
But PostgreSQL curr() returns numbers, wich is incorrect.

On Tue, Dec 2, 2025 at 2:30 PM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

Show quoted text

On Tue, 2025-12-02 at 07:57 +0000, PG Bug reporting form wrote:

The following bug has been logged on the website:

Bug reference: 19340
Logged by: Oleg Ivanov
Email address: o15611@gmail.com
PostgreSQL version: 18.1
Operating system: all
Description:

postgres=# SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,24) ;
corr
------

(1 row)

If argument is the constant, function CORR() must give a 0 or NaN.
Consequences of this bug: statistic functions are used to make business
descision. Wrong and completely different results can lead to make

mistakes.

The documentation is pretty clear about that:

In all cases, null is returned if the computation is meaningless, for
example when N is zero.

Yours,
Laurenz Albe

Tom Lane

tgl@sss.pgh.pa.us

6 months ago

In reply to: Oleg Ivanov (#3)

Re: BUG #19340: Wrong result from CORR() function

Oleg Ivanov <o15611@gmail.com> writes:

Yes, must be NULL in all the queries I have provided!
But PostgreSQL curr() returns numbers, wich is incorrect.

Yeah, looks like roundoff error to me. In your example

SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,25) ;

at the end of float8_corr we have

3754 PG_RETURN_FLOAT8(Sxy / sqrt(Sxx * Syy));
(gdb) i locals
transarray = <optimized out>
transvalues = 0x1b96da8
N = 25
Sxx = 3.2869204384208827e-34
Syy = 9.3266240309214617e-33
Sxy = -3.2869204384208827e-34

where ideally those three values would be zero (and we would have
fallen out with a NULL result at the preceding line).

It's fundamentally impossible to guarantee exact results with
floating-point arithmetic, so if you are expecting that you need
to readjust your expectations. But having said that, it does
seem a bit sad that we can't detect constant-input cases exactly.
I wonder whether it'd be worth carrying additional state to
check that explicitly (instead of assuming that "if (Sxx == 0 ||
Syy == 0)" will catch it).

You might find the previous discussion interesting:

/messages/by-id/153313051300.1397.9594490737341194671@wrigleys.postgresql.org

regards, tom lane

Laurenz Albe

laurenz.albe@cybertec.at

6 months ago

In reply to: Oleg Ivanov (#3)

Re: BUG #19340: Wrong result from CORR() function

On Tue, 2025-12-02 at 17:05 +0300, Oleg Ivanov wrote:

Yes, must be NULL in all the queries I have provided!
But PostgreSQL curr() returns numbers, wich is incorrect.

I see.
I guess these are rounding errors, which are to be expected with
floating point numbers. But perhaps we could do better.

Yours,
Laurenz Albe

Dean Rasheed

dean.a.rasheed@gmail.com

6 months ago

In reply to: Tom Lane (#4)

Re: BUG #19340: Wrong result from CORR() function

On Tue, 2 Dec 2025 at 17:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:

It's fundamentally impossible to guarantee exact results with
floating-point arithmetic, so if you are expecting that you need
to readjust your expectations. But having said that, it does
seem a bit sad that we can't detect constant-input cases exactly.

Yes, indeed. I tested the following query:

SELECT n,
(SELECT variance(1.3::float8) FROM generate_series(1, n)),
(SELECT corr(1.3, 1.3) FROM generate_series(1, n))
FROM generate_series(1, 10) g(n);

In v11 (with the old algorithm) this produces

n | variance | corr
----+----------------------+------
1 | |
2 | 0 |
3 | 0 |
4 | 0 |
5 | 3.5527136788005e-16 | 1
6 | 2.368475785867e-16 | 1
7 | 3.38353683695286e-16 | 1
8 | 0 |
9 | 0 |
10 | 0 |
(10 rows)

whereas in HEAD (with the Youngs-Cramer algorithm) it produces

n | variance | corr
----+------------------------+------
1 | |
2 | 0 |
3 | 0 |
4 | 0 |
5 | 0 |
6 | 5.259072701473412e-33 | 1
7 | 4.382560584561177e-33 | 1
8 | 3.756480501052437e-33 | 1
9 | 3.2869204384208825e-33 | 1
10 | 6.817316464872942e-33 | 1
(10 rows)

so the errors in the variance are smaller, but any non-zero error
makes the correlation completely wrong.

I wonder whether it'd be worth carrying additional state to
check that explicitly (instead of assuming that "if (Sxx == 0 ||
Syy == 0)" will catch it).

I wondered the same thing. It's not nice to have to do that, but
clearly the existing test for constant inputs is no good. The question
is, do we really want to spend extra cycles on every query just to
catch this odd corner case?

Regards,
Dean

Tom Lane

tgl@sss.pgh.pa.us

6 months ago

In reply to: Dean Rasheed (#6)

Re: BUG #19340: Wrong result from CORR() function

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On Tue, 2 Dec 2025 at 17:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wonder whether it'd be worth carrying additional state to
check that explicitly (instead of assuming that "if (Sxx == 0 ||
Syy == 0)" will catch it).

I wondered the same thing. It's not nice to have to do that, but
clearly the existing test for constant inputs is no good. The question
is, do we really want to spend extra cycles on every query just to
catch this odd corner case?

I experimented with the attached patch, which is very incomplete;
I just carried it far enough to be able to run performance checks on
the modified code, and so all the binary statistics aggregates except
corr() are broken. I observe about 2% slowdown on this test case:

SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,100000000);

I think that any real-world usage is going to expend more effort
obtaining the input data than this test does, so 2% should be a
conservative upper bound on the cost. Seems to me that getting
NULL-or-not right is probably worth a percent or so.

If anyone feels differently, another idea could be to use a
separate state transition function for corr() that skips the
accumulation steps that corr() doesn't use. But I agree with
the pre-existing decision to use just one transition function
for all the binary aggregates.

If this seems like a reasonable approach, I'll see about finishing
out the patch.

regards, tom lane

Dean Rasheed

dean.a.rasheed@gmail.com

6 months ago

In reply to: Tom Lane (#7)

Re: BUG #19340: Wrong result from CORR() function

On Tue, 2 Dec 2025 at 20:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I experimented with the attached patch, which is very incomplete;
I just carried it far enough to be able to run performance checks on
the modified code, and so all the binary statistics aggregates except
corr() are broken. I observe about 2% slowdown on this test case:

I played around with having just 2 extra array elements, constX and
constY equal to the common value if all the values are the same, and
NaN otherwise. For me, that was slightly faster, which I put down to
floating point comparison being faster than converting back and forth
between floating point and integer. Either way, it seems like a
difference that no one is likely to notice.

Doing it that way does lead to one difference though: all-NaN inputs
leads to a NaN result, whereas your patch produces NULL for that case.

Regards,
Dean

Tom Lane

tgl@sss.pgh.pa.us

6 months ago

In reply to: Dean Rasheed (#8)

Re: BUG #19340: Wrong result from CORR() function

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

I played around with having just 2 extra array elements, constX and
constY equal to the common value if all the values are the same, and
NaN otherwise.

Hmm.

Doing it that way does lead to one difference though: all-NaN inputs
leads to a NaN result, whereas your patch produces NULL for that case.

Yeah, I did it as I did precisely because I wanted all-NaN-input to be
seen as a constant. But you could make an argument that NaN is not
really a fixed value but has more kinship to the "we don't know what
the value is" interpretation of SQL NULL. In that case your proposal
is semantically reasonable on the grounds that maybe the NaNs aren't
really all equal, and I agree it ought to be a little faster than
mine.

regards, tom lane

#10

Dean Rasheed

dean.a.rasheed@gmail.com

6 months ago

In reply to: Tom Lane (#9)

Re: BUG #19340: Wrong result from CORR() function

On Tue, 2 Dec 2025 at 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

I played around with having just 2 extra array elements, constX and
constY equal to the common value if all the values are the same, and
NaN otherwise.

Hmm.

Doing it that way does lead to one difference though: all-NaN inputs
leads to a NaN result, whereas your patch produces NULL for that case.

Yeah, I did it as I did precisely because I wanted all-NaN-input to be
seen as a constant. But you could make an argument that NaN is not
really a fixed value but has more kinship to the "we don't know what
the value is" interpretation of SQL NULL.

I think that's more consistent with the general policy that most
(all?) math functions have, where if any input is NaN, the result is
NaN.

Regards,
Dean

#11

Tom Lane

tgl@sss.pgh.pa.us

6 months ago

In reply to: Tom Lane (#9)

Re: BUG #19340: Wrong result from CORR() function

I tried the attached realization of your idea, and it does seem
very marginally faster than what I did; but it's like a 1.8%
slowdown instead of 2%. Might be different on a different
machine of course. But I think we should choose on the basis
of which semantics we like better, rather than such a tiny
performance difference.

I'm coming around to the conclusion that your way is better,
though. It seems good that "any NaN in the input results in
NaN output", which your way does and mine doesn't.

regards, tom lane

#12

Tom Lane

tgl@sss.pgh.pa.us

6 months ago

In reply to: Tom Lane (#11)

Re: BUG #19340: Wrong result from CORR() function

I wrote:

I'm coming around to the conclusion that your way is better,
though. It seems good that "any NaN in the input results in
NaN output", which your way does and mine doesn't.

Poking further at this, I found that my v2 patch fails that principle
in one case:

regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,1000) g;
corr
------

(1 row)

We see that Y is constant and therefore return NULL, despite the
other NaN input.

I think we can fix that along these lines:

@@ -3776,8 +3776,12 @@ float8_corr(PG_FUNCTION_ARGS)
if (N < 1.0)
PG_RETURN_NULL();

-	/* per spec, return NULL for horizontal and vertical lines */
-	if (!isnan(commonX) || !isnan(commonY))
+	/*
+	 * per spec, return NULL for horizontal and vertical lines; but not if the
+	 * result would otherwise be NaN
+	 */
+	if ((!isnan(commonX) || !isnan(commonY)) &&
+		(!isnan(Sxx) && !isnan(Syy)))
 		PG_RETURN_NULL();

/* at this point, Sxx and Syy cannot be zero or negative */

(don't think it should be necessary to also check Sxy)

BTW, HEAD is inconsistent: it will return NaN for this example, but
only because it's confused by roundoff error into thinking that Y
isn't constant. With few enough inputs, it produces NULL too:

regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,3) g;
corr
------

(1 row)

regards, tom lane

#13

Dean Rasheed

dean.a.rasheed@gmail.com

6 months ago

In reply to: Tom Lane (#12)

Re: BUG #19340: Wrong result from CORR() function

On Wed, 3 Dec 2025 at 01:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Poking further at this, I found that my v2 patch fails that principle
in one case:

regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,1000) g;
corr
------

(1 row)

We see that Y is constant and therefore return NULL, despite the
other NaN input.

I think we can fix that along these lines:

@@ -3776,8 +3776,12 @@ float8_corr(PG_FUNCTION_ARGS)
if (N < 1.0)
PG_RETURN_NULL();
-       /* per spec, return NULL for horizontal and vertical lines */
-       if (!isnan(commonX) || !isnan(commonY))
+       /*
+        * per spec, return NULL for horizontal and vertical lines; but not if the
+        * result would otherwise be NaN
+        */
+       if ((!isnan(commonX) || !isnan(commonY)) &&
+               (!isnan(Sxx) && !isnan(Syy)))
PG_RETURN_NULL();
/* at this point, Sxx and Syy cannot be zero or negative */

I think that would be more readable as 2 separate "if" statements,
something like:

/* if any input is NaN, return NaN */
if (isnan(Sxx) || isnan(Syy))
PG_RETURN_FLOAT8(get_float8_nan());

/* per spec, return NULL for horizontal and vertical lines */
if (!isnan(commonX) || !isnan(commonY))
PG_RETURN_NULL();

so that we're more explicit about the NaN-handling semantics.

Also, given that any NaN input results in NaN output, regardless of
whether or not the inputs are all the same, the values for commonX and
commonY don't matter if any input is NaN. I think that allows the
accumulator function's tests to be simplified (no need to test if new
values are NaN).

Another case to consider is this:

SELECT corr(1.3 + x*1e-16, 1.3 + x*1e-16) FROM generate_series(1, 3) x;

Here Sxx and Syy end up being zero, so the current code returns NULL,
but with the patch it returns NaN (because Sxy is also zero). It's not
quite obvious what to do in this case (the correct answer is 1, but
there' no way of reliably computing that with double precision
arithmetic). My first thought is that we should probably try to limit
NaN results to NaN inputs, and so it would be better to continue to
return NULL for cases like this where either Sxx or Syy are zero, even
though the inputs weren't quite constant.

Then there's this:

SELECT corr(1e-100 + x*1e-105, 1e-100 + x*1e-105) FROM generate_series(1, 3) x;

Here Sxx, Syy, and Sxy are all non-zero (roughly 2e-210), but the
product Sxx * Syy underflows to zero. So in both HEAD and with the
patch, this returns Infinity. That's not good, given that the
correlation coefficient is supposed to lie in the range [-1,1].

The correct answer in this case is also 1, which could be achieved by
taking the square roots of Sxx and Syy separately, before multiplying,
but it might also be sensible to clamp the result to the range [-1,1].

Regard,
Dean

#14

Tom Lane

tgl@sss.pgh.pa.us

6 months ago

In reply to: Dean Rasheed (#13)

Re: BUG #19340: Wrong result from CORR() function

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On Wed, 3 Dec 2025 at 01:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Poking further at this, I found that my v2 patch fails that principle
in one case:

regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,1000) g;
corr
------

(1 row)

We see that Y is constant and therefore return NULL, despite the
other NaN input.

I think that would be more readable as 2 separate "if" statements,

Actually, in the light of morning I think this behavior is probably
fine. This example treads on two different edge-cases, one where
we're supposed to return NULL and one where we're supposed to return
NaN, and it's not that open-and-shut which rule should win. A
counterexample here is float8_covar_samp: that returns NULL if there
are less than two input values, and will still do so if there is one
input that is NaN. Should we change that? I don't think so.
The fact that HEAD returns NULL for this corr() case as long as
there is not enough input to create a roundoff problem also suggests
to me that that's the behavior to stick with.

So I'm now thinking that the NULL-output rule should win in cases
where they are both applicable. However, I don't know if we want
to take that so far as to say that constant-NaN input should
produce a NULL; the argument that NaN isn't really a constant
still has some force in my mind. What do you think?

Another case to consider is this:

SELECT corr(1.3 + x*1e-16, 1.3 + x*1e-16) FROM generate_series(1, 3) x;

Here Sxx and Syy end up being zero, so the current code returns NULL,
but with the patch it returns NaN (because Sxy is also zero). It's not
quite obvious what to do in this case (the correct answer is 1, but
there' no way of reliably computing that with double precision
arithmetic). My first thought is that we should probably try to limit
NaN results to NaN inputs, and so it would be better to continue to
return NULL for cases like this where either Sxx or Syy are zero, even
though the inputs weren't quite constant.

Good example. I had wondered whether to retain the final test for zero
Sxx/Syy, and this shows that we do need that (along with a comment
explaining that roundoff error could produce zeroes even though we
know the inputs weren't constant).

Then there's this:

SELECT corr(1e-100 + x*1e-105, 1e-100 + x*1e-105) FROM generate_series(1, 3) x;

Here Sxx, Syy, and Sxy are all non-zero (roughly 2e-210), but the
product Sxx * Syy underflows to zero. So in both HEAD and with the
patch, this returns Infinity. That's not good, given that the
correlation coefficient is supposed to lie in the range [-1,1].

The correct answer in this case is also 1, which could be achieved by
taking the square roots of Sxx and Syy separately, before multiplying,
but it might also be sensible to clamp the result to the range [-1,1].

I like the idea of taking the square roots separately. I believe
sqrt() is a hardware operation on pretty much any machine people still
care about, and we're doing this part only once per aggregation.
So the cost seems fairly minimal.

regards, tom lane

#15

Tom Lane

tgl@sss.pgh.pa.us

6 months ago

In reply to: Dean Rasheed (#13)

Re: BUG #19340: Wrong result from CORR() function

Attached is a fleshed-out patch proposal that fixes the related
aggregates and adds test cases.

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

Also, given that any NaN input results in NaN output, regardless of
whether or not the inputs are all the same, the values for commonX and
commonY don't matter if any input is NaN. I think that allows the
accumulator function's tests to be simplified (no need to test if new
values are NaN).

I'm not convinced about that. I have it as

if (newvalX != commonX || isnan(newvalX))
commonX = get_float8_nan();

and I think you're saying we could just write

if (newvalX != commonX)
commonX = get_float8_nan();

My concern is that if newvalX is NaN but commonX isn't, then
I believe that the non-NaN-aware "!=" test is supposed to return
false, which'd cause us to not update commonX to NaN as required.
Maybe we could make it work by writing

if (!(newvalX == commonX))
commonX = get_float8_nan();

but I don't have a lot of faith in C compilers getting that right.

Then there's this:
SELECT corr(1e-100 + x*1e-105, 1e-100 + x*1e-105) FROM generate_series(1, 3) x;
The correct answer in this case is also 1, which could be achieved by
taking the square roots of Sxx and Syy separately, before multiplying,
but it might also be sensible to clamp the result to the range [-1,1].

Poking at this, I soon found a test case where even with the separate
sqrt() calls we'd produce a result slightly outside [-1, 1] (running
this test over more values of x is sufficient). So now I think we
should do both the separate sqrt and the clamp.

regards, tom lane

#16

Tom Lane

tgl@sss.pgh.pa.us

6 months ago

In reply to: Tom Lane (#15)

Re: BUG #19340: Wrong result from CORR() function

I wrote:

Poking at this, I soon found a test case where even with the separate
sqrt() calls we'd produce a result slightly outside [-1, 1] (running
this test over more values of x is sufficient). So now I think we
should do both the separate sqrt and the clamp.

Per CI results, on some platforms the roundoff error is different from
what I observe, producing a value just less than 1 rather than just
more. That doesn't invalidate needing the clamp, but it does mean
that we can't use that test case just like that. I'm inclined to
remove the change of extra_float_digits, but keep the test case.

regards, tom lane

#17

Dean Rasheed

dean.a.rasheed@gmail.com

6 months ago

In reply to: Tom Lane (#15)

Re: BUG #19340: Wrong result from CORR() function

On Wed, 3 Dec 2025 at 22:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Attached is a fleshed-out patch proposal that fixes the related
aggregates and adds test cases.

Looking at float8_regr_accum(), I think it would be preferable to
arrange for it to leave Sxx, Syy, and Sxy zero until distinct X and Y
values are seen. I.e., something like this:

if (newvalX != commonX || isnan(newvalX))
commonX = get_float8_nan();
if (newvalY != commonY || isnan(newvalY))
commonY = get_float8_nan();

if (isnan(commonX) || isnan(commonY))
{
tmpX = newvalX * N - Sx;
tmpY = newvalY * N - Sy;
scale = 1.0 / (N * transvalues[0]);
if (isnan(commonX))
Sxx += tmpX * tmpX * scale;
if (isnan(commonY))
Syy += tmpY * tmpY * scale;
if (isnan(commonX) && isnan(commonY))
Sxy += tmpX * tmpY * scale;

... Overflow check ...
}

This would mean that float8_corr(), float8_regr_r2(),
float8_regr_slope(), and float8_regr_intercept() would not need to
look at commonX or commonY, and could simply rely on Sxx == 0 or Syy
== 0 to detect horizontal and vertical lines.

Aside from making the code simpler, this would guarantee that the
aggregate functions regr_sxx() and regr_syy() would return exactly
zero for all-constant X and Y inputs respectively, and that
regr_sxy(), covar_pop(), and covar_samp() would return exactly zero if
either the X or the Y inputs were all constant.

Something else that occurred to me was that float8_regr_avgx() and
float8_regr_avgy() might as well make use of commonX and commonY,
since we're calculating them, so they would return exact averages if
all the X or Y values were the same, rather than results with possible
rounding errors.

I also wonder if it would be worth doing something similar for the
single-variable aggregates so that var_pop(), var_samp(),
stddev_pop(), and stddev_samp() would all return exactly zero, and
avg() would return the exact common value, if all the inputs were
constant.

Regards,
Dean

#18

Dean Rasheed

dean.a.rasheed@gmail.com

6 months ago

In reply to: Tom Lane (#15)

Re: BUG #19340: Wrong result from CORR() function

On Wed, 3 Dec 2025 at 22:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Poking at this, I soon found a test case where even with the separate
sqrt() calls we'd produce a result slightly outside [-1, 1] (running
this test over more values of x is sufficient). So now I think we
should do both the separate sqrt and the clamp.

I'm starting to have doubts about having 2 sqrt() calls. The problem
is that it seems to produce a noticeable reduction in accuracy in
quite a few cases. This is especially noticeable with fully-correlated
data. For example:

SELECT n, (SELECT corr(x, x) FROM generate_series(1, n) x)
FROM generate_series(1, 10) g(n);

n | corr
----+--------------------
1 |
2 | 0.9999999999999998
3 | 0.9999999999999998
4 | 0.9999999999999998
5 | 0.9999999999999998
6 | 1
7 | 0.9999999999999999
8 | 1
9 | 0.9999999999999999
10 | 1
(10 rows)

Now I'm not sure that the current code can be expected to get cases
like this exactly right 100% of the time, but it's pretty close. For
example, if I do this:

WITH t1 AS (
SELECT n, random() * 1000 AS r FROM generate_series(1, 1000000) n
), t2 AS (
SELECT corr(r, r) FROM t1 GROUP BY n % 10000
)
SELECT count(*), count(*) FILTER (WHERE corr != 1) FROM t2;

on HEAD it produced corr = 1 every time I ran it, whereas the patch
gives rounding errors roughly 25% of the time, which seems likely to
be noticed.

Perhaps we should only use 2 sqrt()'s if the product Sxx * Syy overflows.

Regards,
Dean

#19

Tom Lane

tgl@sss.pgh.pa.us

6 months ago

In reply to: Dean Rasheed (#17)

Re: BUG #19340: Wrong result from CORR() function

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

Looking at float8_regr_accum(), I think it would be preferable to
arrange for it to leave Sxx, Syy, and Sxy zero until distinct X and Y
values are seen. I.e., something like this:

That seems like a good idea. I was initially worried that the extra
isnan() checks would slow down aggregation noticeably in the normal
case where we soon discover that the inputs aren't all equal. They
don't seem to though. For me, the attached v2 clocks in at only 0.9%
slower than HEAD, IOW actually faster than v1; which I suspect just
proves that what we're trying to measure here is comparable to the
noise threshold.

I took your other suggestions too, except for:

I also wonder if it would be worth doing something similar for the
single-variable aggregates so that var_pop(), var_samp(),
stddev_pop(), and stddev_samp() would all return exactly zero, and
avg() would return the exact common value, if all the inputs were
constant.

I'm less excited about this, because for all those aggregates,
you have the option of using the numeric variant if you're
dissatisified with the accuracy of the float8 variant. Also,
given that the per-input work is substantially less, the overhead
of tracking the common input value would probably be noticeably
greater. If somebody else wants to investigate that, I won't
stand in the way, but I don't want to do it.

regards, tom lane

#20

Tom Lane

tgl@sss.pgh.pa.us

6 months ago

In reply to: Tom Lane (#19)

Re: BUG #19340: Wrong result from CORR() function

I wrote:

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

Looking at float8_regr_accum(), I think it would be preferable to
arrange for it to leave Sxx, Syy, and Sxy zero until distinct X and Y
values are seen. I.e., something like this:

That seems like a good idea. I was initially worried that the extra
isnan() checks would slow down aggregation noticeably in the normal
case where we soon discover that the inputs aren't all equal.

BTW, re-reading the patch, I now think we should drop the initial

if (isnan(commonX) || isnan(commonY))

test, instead bulling ahead with computing tmpX/tmpY/scale, and
only skip the updates of Sxx/Syy/Sxy when we have constant inputs.
Using that initial test is optimizing for constant inputs at the
expense of non-constant inputs, which seems like the wrong way
to bet.

regards, tom lane

#21

Dean Rasheed

dean.a.rasheed@gmail.com

6 months ago

In reply to: Tom Lane (#20)

#22

Tom Lane

tgl@sss.pgh.pa.us

6 months ago

In reply to: Dean Rasheed (#21)

#23