BUG #14648: counts for queries using array unnesting is incorrect

Started by Sebastion Calbazaalmost 9 years ago7 messagesbugs
Jump to latest
#1Sebastion Calbaza
sebastian.calbaza@hgdata.com

The following bug has been logged on the website:

Bug reference: 14648
Logged by: Sebastian Calbaza
Email address: sebastian.calbaza@hgdata.com
PostgreSQL version: 9.6.2
Operating system: Ubuntu 14.04 AWS
Description:

Below is a query that we are using to calculate some counts:
* first version of the query lacks ```unnest(ids) as id,```, but second
one has it
* ```companies``` count value is incorrect for the second one, first
query has the correct value

```
mydb=# select count(company) as available,count(distinct
matchedCompany) as matchedCompanies,count(distinct company) as companies
from
(
SELECT F.urlx as company,
CB_RDPPKLCPQDVACOMSTEFANHGDATACOM_agg.url as matchedCompany
from
(select unnest(urls) as urlx from
hg_data_discovery_2017_04_10.GroupedFirmographics
where
(
(TRUE
AND TRUE
AND revenueRangeMin >= 1
AND employeesRangeMin >= 1
AND revenueRangeMax <= 1783792664
AND employeesRangeMax <= 4999000
)
OR
FALSE
)
) as I
inner join (
select unnest(urls) as urlx from
hg_data_discovery_2017_04_10.GroupedInstallsWithoutDateSignalScore where
productId IN (562) and signalScoreId IN (1,2,3)
) as F using(urlx)
left outer join CB_RDPPKLCPQDVACOMSTEFANHGDATACOM_agg on
CB_RDPPKLCPQDVACOMSTEFANHGDATACOM_agg.url=F.urlx where TRUE and TRUE
) as P
;
available | matchedcompanies | companies
-----------+------------------+-----------
496493 | 28503 | 495799
(1 row)

Time: 7974.053 ms
mydb=# select count(id) as people, count(company) as
available,count(distinct matchedCompany) as matchedCompanies,count(distinct
company) as companies from
(
SELECT unnest(ids) as id, F.urlx as company,
CB_RDPPKLCPQDVACOMSTEFANHGDATACOM_agg.url as matchedCompany
from
(select unnest(urls) as urlx from
hg_data_discovery_2017_04_10.GroupedFirmographics
where
(
(TRUE
AND TRUE
AND revenueRangeMin >= 1
AND employeesRangeMin >= 1
AND revenueRangeMax <= 1783792664
AND employeesRangeMax <= 4999000
)
OR
FALSE
)
) as I
inner join (
select unnest(urls) as urlx from
hg_data_discovery_2017_04_10.GroupedInstallsWithoutDateSignalScore where
productId IN (562) and signalScoreId IN (1,2,3)
) as F using(urlx)
left outer join CB_RDPPKLCPQDVACOMSTEFANHGDATACOM_agg on
CB_RDPPKLCPQDVACOMSTEFANHGDATACOM_agg.url=F.urlx where TRUE and TRUE
) as P
;
people | available | matchedcompanies | companies
--------+-----------+------------------+-----------
689905 | 689905 | 28503 | 28503

```

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Sebastion Calbaza (#1)
Re: BUG #14648: counts for queries using array unnesting is incorrect

sebastian.calbaza@hgdata.com writes:

Below is a query that we are using to calculate some counts:
* first version of the query lacks ```unnest(ids) as id,```, but second
one has it
* ```companies``` count value is incorrect for the second one, first
query has the correct value

My first suggestion would be to see if updating to 9.6.3 fixes it.
If not, please try to create a self-contained test case. These
queries are unreadable, and without the underlying data, nobody
else can even tell whether the answers are wrong or not.

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Sebastion Calbaza
sebastian.calbaza@hgdata.com
In reply to: Tom Lane (#2)
Re: BUG #14648: counts for queries using array unnesting is incorrect

Thanks for replying.... I know it's complicated to follow it (even if
formatted ) , but probably I did not emphasize enough that by using
unnest(ids) in the second query, the distinct count for company is not
computed correctly anymore.
Focusing on the used select clauses is important, they are just distinct
counts, it would have been logical to stay the same for both queries,
ignoring the unnest(ids)(the data set is in the milions of rows so you need
to take my word that the first query return the correct values :) ).

Anyway, will try to test it with latest version, also will try to provide
some sql test case for this, but again... There surely is an issue with the
unnesting arrays (at least in the relase that I'm using)

Seb

On Fri, May 12, 2017 at 5:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

sebastian.calbaza@hgdata.com writes:

Below is a query that we are using to calculate some counts:
* first version of the query lacks ```unnest(ids) as id,```, but

second

one has it
* ```companies``` count value is incorrect for the second one, first
query has the correct value

My first suggestion would be to see if updating to 9.6.3 fixes it.
If not, please try to create a self-contained test case. These
queries are unreadable, and without the underlying data, nobody
else can even tell whether the answers are wrong or not.

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems

regards, tom lane

#4David G. Johnston
david.g.johnston@gmail.com
In reply to: Sebastion Calbaza (#3)
Re: BUG #14648: counts for queries using array unnesting is incorrect

On Fri, May 12, 2017 at 7:55 AM, Sebastion Calbaza <
sebastian.calbaza@hgdata.com> wrote:

Thanks for replying.... I know it's complicated to follow it (even if
formatted ) , but probably I did not emphasize enough that by using
unnest(ids) in the second query, the distinct count for company is not
computed correctly anymore.
Focusing on the used select clauses is important, they are just distinct
counts, it would have been logical to stay the same for both queries,
ignoring the unnest(ids)(the data set is in the milions of rows so you need
to take my word that the first query return the correct values :) ).

​In the first query the number of distinct companies ​is greater than the
number of matched companies. In the second the numbers are equal. If the
only difference between the two is the "unnest(ids)" then its presence is
causing every unmatched company to be discarded from the result. Since
"SELECT unnest(null::text[])" is the empty set it would remove the
corresponding row from your subquery output. You might try writing the
following which will convert the empty set to a NULL and thus not discard
records.

(SELECT unnest(ids)) AS id, F.urlx as company ...

David J.

#5Sebastion Calbaza
sebastian.calbaza@hgdata.com
In reply to: David G. Johnston (#4)
Re: BUG #14648: counts for queries using array unnesting is incorrect

You are correct, this is how it behaves.
For my dev mind (probably for others ) this was pretty non intuitive. In
the end I used array_length and sum to compute the count.

Still, is this the intended behaviour (I guess so from a mathematical
point of view ?? )? If it is I can argue that is not too intuitive and
adding a doc note near the unnest function description would be good.

Thanks a lot,
Sebi C.

On Fri, May 12, 2017 at 6:46 PM, David G. Johnston <
david.g.johnston@gmail.com> wrote:

Show quoted text

On Fri, May 12, 2017 at 7:55 AM, Sebastion Calbaza <
sebastian.calbaza@hgdata.com> wrote:

Thanks for replying.... I know it's complicated to follow it (even if
formatted ) , but probably I did not emphasize enough that by using
unnest(ids) in the second query, the distinct count for company is not
computed correctly anymore.
Focusing on the used select clauses is important, they are just distinct
counts, it would have been logical to stay the same for both queries,
ignoring the unnest(ids)(the data set is in the milions of rows so you need
to take my word that the first query return the correct values :) ).

​In the first query the number of distinct companies ​is greater than the
number of matched companies. In the second the numbers are equal. If the
only difference between the two is the "unnest(ids)" then its presence is
causing every unmatched company to be discarded from the result. Since
"SELECT unnest(null::text[])" is the empty set it would remove the
corresponding row from your subquery output. You might try writing the
following which will convert the empty set to a NULL and thus not discard
records.

(SELECT unnest(ids)) AS id, F.urlx as company ...

David J.

#6David G. Johnston
david.g.johnston@gmail.com
In reply to: Sebastion Calbaza (#5)
Re: BUG #14648: counts for queries using array unnesting is incorrect

On Sun, May 14, 2017 at 11:29 PM, Sebastion Calbaza <
sebastian.calbaza@hgdata.com> wrote:

You are correct, this is how it behaves.
For my dev mind (probably for others ) this was pretty non intuitive. In
the end I used array_length and sum to compute the count.

Still, is this the intended behaviour (I guess so from a mathematical
point of view ?? )? If it is I can argue that is not too intuitive and
adding a doc note near the unnest function description would be good.

Yes, its intentional, and it isn't specific to just the unnest function so
documenting it just there doesn't seem correct. I don't know where it is
documented but I suspect that even just reading the documentation for this
would be of minimal help - I think its likely best learned after
experiencing the aforementioned problem.

David J.

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: David G. Johnston (#6)
Re: BUG #14648: counts for queries using array unnesting is incorrect

"David G. Johnston" <david.g.johnston@gmail.com> writes:

Yes, its intentional, and it isn't specific to just the unnest function so
documenting it just there doesn't seem correct. I don't know where it is
documented but I suspect that even just reading the documentation for this
would be of minimal help - I think its likely best learned after
experiencing the aforementioned problem.

https://www.postgresql.org/docs/current/static/xfunc-sql.html#XFUNC-SQL-FUNCTIONS-RETURNING-SET

The last example in section 36.4.8 covers this specifically.

I've felt more than once that having these sorts of details about function
semantics underneath the "extending SQL" chapter isn't right, but I'm not
sure what organization would be better. A lot of the examples would be
tough to do without use of custom-made functions.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs