BUG #14648: counts for queries using array unnesting is incorrect
The following bug has been logged on the website:
Bug reference: 14648
Logged by: Sebastian Calbaza
Email address: sebastian.calbaza@hgdata.com
PostgreSQL version: 9.6.2
Operating system: Ubuntu 14.04 AWS
Description:
Below is a query that we are using to calculate some counts:
* first version of the query lacks ```unnest(ids) as id,```, but second
one has it
* ```companies``` count value is incorrect for the second one, first
query has the correct value
```
mydb=# select count(company) as available,count(distinct
matchedCompany) as matchedCompanies,count(distinct company) as companies
from
(
SELECT F.urlx as company,
CB_RDPPKLCPQDVACOMSTEFANHGDATACOM_agg.url as matchedCompany
from
(select unnest(urls) as urlx from
hg_data_discovery_2017_04_10.GroupedFirmographics
where
(
(TRUE
AND TRUE
AND revenueRangeMin >= 1
AND employeesRangeMin >= 1
AND revenueRangeMax <= 1783792664
AND employeesRangeMax <= 4999000
)
OR
FALSE
)
) as I
inner join (
select unnest(urls) as urlx from
hg_data_discovery_2017_04_10.GroupedInstallsWithoutDateSignalScore where
productId IN (562) and signalScoreId IN (1,2,3)
) as F using(urlx)
left outer join CB_RDPPKLCPQDVACOMSTEFANHGDATACOM_agg on
CB_RDPPKLCPQDVACOMSTEFANHGDATACOM_agg.url=F.urlx where TRUE and TRUE
) as P
;
available | matchedcompanies | companies
-----------+------------------+-----------
496493 | 28503 | 495799
(1 row)
Time: 7974.053 ms
mydb=# select count(id) as people, count(company) as
available,count(distinct matchedCompany) as matchedCompanies,count(distinct
company) as companies from
(
SELECT unnest(ids) as id, F.urlx as company,
CB_RDPPKLCPQDVACOMSTEFANHGDATACOM_agg.url as matchedCompany
from
(select unnest(urls) as urlx from
hg_data_discovery_2017_04_10.GroupedFirmographics
where
(
(TRUE
AND TRUE
AND revenueRangeMin >= 1
AND employeesRangeMin >= 1
AND revenueRangeMax <= 1783792664
AND employeesRangeMax <= 4999000
)
OR
FALSE
)
) as I
inner join (
select unnest(urls) as urlx from
hg_data_discovery_2017_04_10.GroupedInstallsWithoutDateSignalScore where
productId IN (562) and signalScoreId IN (1,2,3)
) as F using(urlx)
left outer join CB_RDPPKLCPQDVACOMSTEFANHGDATACOM_agg on
CB_RDPPKLCPQDVACOMSTEFANHGDATACOM_agg.url=F.urlx where TRUE and TRUE
) as P
;
people | available | matchedcompanies | companies
--------+-----------+------------------+-----------
689905 | 689905 | 28503 | 28503
```
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
sebastian.calbaza@hgdata.com writes:
Below is a query that we are using to calculate some counts:
* first version of the query lacks ```unnest(ids) as id,```, but second
one has it
* ```companies``` count value is incorrect for the second one, first
query has the correct value
My first suggestion would be to see if updating to 9.6.3 fixes it.
If not, please try to create a self-contained test case. These
queries are unreadable, and without the underlying data, nobody
else can even tell whether the answers are wrong or not.
https://wiki.postgresql.org/wiki/Guide_to_reporting_problems
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Thanks for replying.... I know it's complicated to follow it (even if
formatted ) , but probably I did not emphasize enough that by using
unnest(ids) in the second query, the distinct count for company is not
computed correctly anymore.
Focusing on the used select clauses is important, they are just distinct
counts, it would have been logical to stay the same for both queries,
ignoring the unnest(ids)(the data set is in the milions of rows so you need
to take my word that the first query return the correct values :) ).
Anyway, will try to test it with latest version, also will try to provide
some sql test case for this, but again... There surely is an issue with the
unnesting arrays (at least in the relase that I'm using)
Seb
On Fri, May 12, 2017 at 5:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
sebastian.calbaza@hgdata.com writes:
Below is a query that we are using to calculate some counts:
* first version of the query lacks ```unnest(ids) as id,```, butsecond
one has it
* ```companies``` count value is incorrect for the second one, first
query has the correct valueMy first suggestion would be to see if updating to 9.6.3 fixes it.
If not, please try to create a self-contained test case. These
queries are unreadable, and without the underlying data, nobody
else can even tell whether the answers are wrong or not.https://wiki.postgresql.org/wiki/Guide_to_reporting_problems
regards, tom lane
On Fri, May 12, 2017 at 7:55 AM, Sebastion Calbaza <
sebastian.calbaza@hgdata.com> wrote:
Thanks for replying.... I know it's complicated to follow it (even if
formatted ) , but probably I did not emphasize enough that by using
unnest(ids) in the second query, the distinct count for company is not
computed correctly anymore.
Focusing on the used select clauses is important, they are just distinct
counts, it would have been logical to stay the same for both queries,
ignoring the unnest(ids)(the data set is in the milions of rows so you need
to take my word that the first query return the correct values :) ).
In the first query the number of distinct companies is greater than the
number of matched companies. In the second the numbers are equal. If the
only difference between the two is the "unnest(ids)" then its presence is
causing every unmatched company to be discarded from the result. Since
"SELECT unnest(null::text[])" is the empty set it would remove the
corresponding row from your subquery output. You might try writing the
following which will convert the empty set to a NULL and thus not discard
records.
(SELECT unnest(ids)) AS id, F.urlx as company ...
David J.
You are correct, this is how it behaves.
For my dev mind (probably for others ) this was pretty non intuitive. In
the end I used array_length and sum to compute the count.
Still, is this the intended behaviour (I guess so from a mathematical
point of view ?? )? If it is I can argue that is not too intuitive and
adding a doc note near the unnest function description would be good.
Thanks a lot,
Sebi C.
On Fri, May 12, 2017 at 6:46 PM, David G. Johnston <
david.g.johnston@gmail.com> wrote:
Show quoted text
On Fri, May 12, 2017 at 7:55 AM, Sebastion Calbaza <
sebastian.calbaza@hgdata.com> wrote:Thanks for replying.... I know it's complicated to follow it (even if
formatted ) , but probably I did not emphasize enough that by using
unnest(ids) in the second query, the distinct count for company is not
computed correctly anymore.
Focusing on the used select clauses is important, they are just distinct
counts, it would have been logical to stay the same for both queries,
ignoring the unnest(ids)(the data set is in the milions of rows so you need
to take my word that the first query return the correct values :) ).In the first query the number of distinct companies is greater than the
number of matched companies. In the second the numbers are equal. If the
only difference between the two is the "unnest(ids)" then its presence is
causing every unmatched company to be discarded from the result. Since
"SELECT unnest(null::text[])" is the empty set it would remove the
corresponding row from your subquery output. You might try writing the
following which will convert the empty set to a NULL and thus not discard
records.(SELECT unnest(ids)) AS id, F.urlx as company ...
David J.
On Sun, May 14, 2017 at 11:29 PM, Sebastion Calbaza <
sebastian.calbaza@hgdata.com> wrote:
You are correct, this is how it behaves.
For my dev mind (probably for others ) this was pretty non intuitive. In
the end I used array_length and sum to compute the count.Still, is this the intended behaviour (I guess so from a mathematical
point of view ?? )? If it is I can argue that is not too intuitive and
adding a doc note near the unnest function description would be good.
Yes, its intentional, and it isn't specific to just the unnest function so
documenting it just there doesn't seem correct. I don't know where it is
documented but I suspect that even just reading the documentation for this
would be of minimal help - I think its likely best learned after
experiencing the aforementioned problem.
David J.
"David G. Johnston" <david.g.johnston@gmail.com> writes:
Yes, its intentional, and it isn't specific to just the unnest function so
documenting it just there doesn't seem correct. I don't know where it is
documented but I suspect that even just reading the documentation for this
would be of minimal help - I think its likely best learned after
experiencing the aforementioned problem.
https://www.postgresql.org/docs/current/static/xfunc-sql.html#XFUNC-SQL-FUNCTIONS-RETURNING-SET
The last example in section 36.4.8 covers this specifically.
I've felt more than once that having these sorts of details about function
semantics underneath the "extending SQL" chapter isn't right, but I'm not
sure what organization would be better. A lot of the examples would be
tough to do without use of custom-made functions.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs