Aggregate Function corr does not always return the correct value

Started by Maxim Orlov5 months ago4 messages
#1Maxim Orlov
orlovmg@gmail.com

Hi!

One of the clients complained as to why the query for calculating the
correlation coefficient with the CORR function yielded such weird
results. After a little analysis, it was discovered that they were
calculating the correlation coefficient for two sets, one of which is
more or less random and the other of which is simply a set of constant
values (0.09 if that matters). As a result, they were attaining
unexpected results. However, as far as I am aware, they should have
received NULL because it is impossible to calculate the standard
deviation for such a set.

It turns out that for some values, the function does not return NULL.
Here is a reproducing of the problem:

==============================
postgres=# WITH dataset AS (SELECT x, 0.125 AS y FROM generate_series(0, 5)
AS x) SELECT corr(x, y) FROM dataset;
corr
------

(1 row)

postgres=# WITH dataset AS (SELECT x, 0.1 AS y FROM generate_series(0, 5)
AS x) SELECT corr(x, y) FROM dataset;
corr
--------------------
0.6546536707079771
(1 row)

==============================

I'm not sure how to fix that yet, but the issue is that the Sxx and Syy
variables in the float8_corr function are very close to 0, but not zero,
so we can't return NULL.

--
Best regards,
Maxim Orlov.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Maxim Orlov (#1)
Re: Aggregate Function corr does not always return the correct value

Maxim Orlov <orlovmg@gmail.com> writes:

One of the clients complained as to why the query for calculating the
correlation coefficient with the CORR function yielded such weird
results. After a little analysis, it was discovered that they were
calculating the correlation coefficient for two sets, one of which is
more or less random and the other of which is simply a set of constant
values (0.09 if that matters). As a result, they were attaining
unexpected results. However, as far as I am aware, they should have
received NULL because it is impossible to calculate the standard
deviation for such a set.

[ shrug... ] Calculations with float8 are inherently inexact, so
it's unsurprising that we sometimes fail to detect that the input
is exactly a horizontal or vertical line. I don't think there is
anything to be done here that wouldn't end in making things worse.

regards, tom lane

#3DINESH  NAIR
Dinesh_Nair@iitmpravartak.net
In reply to: Tom Lane (#2)
Re: Aggregate Function corr does not always return the correct value

Hi,

Try casting it to numeric or use an expression that avoids floating-point rounding off :

WITH dataset AS (
SELECT x, CAST(0.125 AS numeric) AS y
FROM generate_series(0, 5) AS x
)
SELECT corr(x, y) FROM dataset;

Thanks & Regards

Dinesh Nair

________________________________
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Tuesday, August 26, 2025 11:04 PM
To: Maxim Orlov <orlovmg@gmail.com>
Cc: Postgres hackers <pgsql-hackers@lists.postgresql.org>
Subject: Re: Aggregate Function corr does not always return the correct value

Caution: This email was sent from an external source. Please verify the sender’s identity before clicking links or opening attachments.

Maxim Orlov <orlovmg@gmail.com> writes:

One of the clients complained as to why the query for calculating the
correlation coefficient with the CORR function yielded such weird
results. After a little analysis, it was discovered that they were
calculating the correlation coefficient for two sets, one of which is
more or less random and the other of which is simply a set of constant
values (0.09 if that matters). As a result, they were attaining
unexpected results. However, as far as I am aware, they should have
received NULL because it is impossible to calculate the standard
deviation for such a set.

[ shrug... ] Calculations with float8 are inherently inexact, so
it's unsurprising that we sometimes fail to detect that the input
is exactly a horizontal or vertical line. I don't think there is
anything to be done here that wouldn't end in making things worse.

regards, tom lane

#4Ranier Vilela
ranier.vf@gmail.com
In reply to: Tom Lane (#2)
Re: Aggregate Function corr does not always return the correct value

Em ter., 26 de ago. de 2025 às 14:34, Tom Lane <tgl@sss.pgh.pa.us> escreveu:

Maxim Orlov <orlovmg@gmail.com> writes:

One of the clients complained as to why the query for calculating the
correlation coefficient with the CORR function yielded such weird
results. After a little analysis, it was discovered that they were
calculating the correlation coefficient for two sets, one of which is
more or less random and the other of which is simply a set of constant
values (0.09 if that matters). As a result, they were attaining
unexpected results. However, as far as I am aware, they should have
received NULL because it is impossible to calculate the standard
deviation for such a set.

[ shrug... ] Calculations with float8 are inherently inexact, so
it's unsurprising that we sometimes fail to detect that the input
is exactly a horizontal or vertical line. I don't think there is
anything to be done here that wouldn't end in making things worse.

With the below checking

if (Sxx == 0.0 && Syy == 0.0)
PG_RETURN_NULL();

This test returns NaN
WITH dataset AS (SELECT x, 0.125 AS y FROM generate_series(0, 5) AS x)
SELECT corr(x, y) FROM dataset;

But I can't say if this answer (NaN) makes things worse.

best regards,
Ranier Vilela