BUG #14471: PostgreSQL 9.6 "NOT IN" in select causes crash with "ERROR: unknown error"
The following bug has been logged on the website:
Bug reference: 14471
Logged by: Josef Machytka
Email address: josef.machytka@gmail.com
PostgreSQL version: 9.6.1
Operating system: Debian jessie
Description:
I checked stackoverflow and I found several questions complaining about
problems with "NOT IN" causing problems in selects.
Looks like PostgreSQL 9.6 now very probably has some bug connected with "NOT
IN" command. We have select which actually crashes after several hours (!)
of run with message "ERROR: unknown error". Although according to explain
plan it should be done in ~40 minutes.
It is also possible that problem happens only when working with partitioned
tables. We have several sets of partitioned tables. Partitions are by months
- data are used for billing calculations.
Our select which crashes pg 9.6 is like this:
select <some columns>
from table_1_partition t1
inner join table_2 t2 --also partitioned table
on ....
where (t1.col1, t2.col1, t1.col2) NOT IN
(select col1, col2, col3
from table_3 --also partitioned table
group by col1, col2, col3)
When I remove NOT IN part everything works OK. Of course I replaced this
problematic part with NOT EXISTS construct. But "unknown error" crash with
"NOT IN" command is very unfortunate.
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
josef.machytka@gmail.com writes:
Looks like PostgreSQL 9.6 now very probably has some bug connected with "NOT
IN" command. We have select which actually crashes after several hours (!)
of run with message "ERROR: unknown error". Although according to explain
plan it should be done in ~40 minutes.
This is an interesting report, but without enough information to replicate
the problem, we're unlikely to be able to help you. A fragment of a
query, with zero information about the underlying tables, is far from
enough.
The only occurrences of the string "unknown error" that I can find in the
source code are in dblink and postgres_fdw (both reflecting cases where
the remote server did not return an error message, itself a "shouldn't
happen" situation). If you were using either, you didn't say so; but if
you were, maybe taking a look in the remote server's log would be useful.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Yes, I am sorry, dblink is involved - I just did not see it as significant.
We start several processes in parallel to speed up whole billing
calculation otherwise it would take 10+ hours to calculate everything in
serial.
Ok, so at least "unknown error" is explained.
But problem with "NOT IN" remains.
When I replaced "NOT IN" with "NOT EXISTS" query ended after ~3 hours
without any problems. Even over dblink.
Partitions have from 1M to 3M records each and now we use partitions for
present year only.
So it is not small task but also not something really big (we work with
much more data when we calculate statistics from web metrics).
And I run billing on 2 different GCE instances with pg 9.6.1 to have
comparison. Behaviour was the same.
Unfortunately data are confident so it would be quite hard to give you some
access to test it even with anonymous data because even structure of the
query contains a lot of know-how.
Therefore I could sent only so crazy looking skeleton of the query. Sorry
about it.
On 21 December 2016 at 16:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
josef.machytka@gmail.com writes:
Looks like PostgreSQL 9.6 now very probably has some bug connected with
"NOT
IN" command. We have select which actually crashes after several hours
(!)
of run with message "ERROR: unknown error". Although according to
explain
plan it should be done in ~40 minutes.
This is an interesting report, but without enough information to replicate
the problem, we're unlikely to be able to help you. A fragment of a
query, with zero information about the underlying tables, is far from
enough.The only occurrences of the string "unknown error" that I can find in the
source code are in dblink and postgres_fdw (both reflecting cases where
the remote server did not return an error message, itself a "shouldn't
happen" situation). If you were using either, you didn't say so; but if
you were, maybe taking a look in the remote server's log would be useful.regards, tom lane
Josef Machytka <josef.machytka@gmail.com> writes:
Yes, I am sorry, dblink is involved - I just did not see it as significant.
We start several processes in parallel to speed up whole billing
calculation otherwise it would take 10+ hours to calculate everything in
serial.
Ok, so at least "unknown error" is explained.
Well, we have a theory about where it came from, but still not enough
information to improve the behavior. Did you look to see what happened
on the remote server?
But problem with "NOT IN" remains.
When I replaced "NOT IN" with "NOT EXISTS" query ended after ~3 hours
without any problems. Even over dblink.
You do know that NOT IN and NOT EXISTS behave quite differently with
respect to nulls? I'm suspicious that the real problem here is that
your query is just wrong when written with NOT IN, and it specifies
some unreasonable amount of computation. Possibly something is running
out of memory and not dealing with the case very well, leading to the
unhelpful error message.
FWIW, just about every bug report I've ever seen about NOT IN boiled
down to the complainant's subquery returning one or more nulls and
the complainant not understanding what will happen if it does.
Unfortunately, that's not a bug, it's the behavior required by the
SQL standard.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
OK, this sounds very interesting.
We already know about many different problems with NULL values because we
use heavily different GROUP BYs and WINDOW functions so we replace NULL's
everywhere with 'unknown' or similar.
But maybe there is some problem in data import. I will check data.
Thanks and I will let you know what I have found.
On 21 December 2016 at 17:11, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Josef Machytka <josef.machytka@gmail.com> writes:
Yes, I am sorry, dblink is involved - I just did not see it as
significant.
We start several processes in parallel to speed up whole billing
calculation otherwise it would take 10+ hours to calculate everything in
serial.
Ok, so at least "unknown error" is explained.Well, we have a theory about where it came from, but still not enough
information to improve the behavior. Did you look to see what happened
on the remote server?But problem with "NOT IN" remains.
When I replaced "NOT IN" with "NOT EXISTS" query ended after ~3 hours
without any problems. Even over dblink.You do know that NOT IN and NOT EXISTS behave quite differently with
respect to nulls? I'm suspicious that the real problem here is that
your query is just wrong when written with NOT IN, and it specifies
some unreasonable amount of computation. Possibly something is running
out of memory and not dealing with the case very well, leading to the
unhelpful error message.FWIW, just about every bug report I've ever seen about NOT IN boiled
down to the complainant's subquery returning one or more nulls and
the complainant not understanding what will happen if it does.
Unfortunately, that's not a bug, it's the behavior required by the
SQL standard.regards, tom lane
Hi again.
So I checked all logs and it turned out one of databases involved vent into
recovery mode because some of its connections was killed.
Probably out of memory killer did it. So it caused chain reaction. And
billing query was canceled "due to administration command" and it was shown
through dblink as "unknown error".
Which looked quite horrible...
Regarding NOT IN - you were right there are NULL values.
Best regards
On 21 December 2016 at 17:11, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Josef Machytka <josef.machytka@gmail.com> writes:
Yes, I am sorry, dblink is involved - I just did not see it as
significant.
We start several processes in parallel to speed up whole billing
calculation otherwise it would take 10+ hours to calculate everything in
serial.
Ok, so at least "unknown error" is explained.Well, we have a theory about where it came from, but still not enough
information to improve the behavior. Did you look to see what happened
on the remote server?But problem with "NOT IN" remains.
When I replaced "NOT IN" with "NOT EXISTS" query ended after ~3 hours
without any problems. Even over dblink.You do know that NOT IN and NOT EXISTS behave quite differently with
respect to nulls? I'm suspicious that the real problem here is that
your query is just wrong when written with NOT IN, and it specifies
some unreasonable amount of computation. Possibly something is running
out of memory and not dealing with the case very well, leading to the
unhelpful error message.FWIW, just about every bug report I've ever seen about NOT IN boiled
down to the complainant's subquery returning one or more nulls and
the complainant not understanding what will happen if it does.
Unfortunately, that's not a bug, it's the behavior required by the
SQL standard.regards, tom lane
Josef Machytka <josef.machytka@gmail.com> writes:
So I checked all logs and it turned out one of databases involved vent into
recovery mode because some of its connections was killed.
Probably out of memory killer did it. So it caused chain reaction. And
billing query was canceled "due to administration command" and it was shown
through dblink as "unknown error".
OK. In that case I'm betting that what dblink saw was a network error on
its next attempt to send a command. I've not dug into the code to verify
this positively, but I think the behavior will be improved by Joe Conway's
recent patch,
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2f802d95b4904dbed3dfdca1b3a607cd085d2e20
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On 12/23/2016 01:08 AM, Josef Machytka wrote:
Hi again.
So I checked all logs and it turned out one of databases involved vent
into recovery mode because some of its connections was killed.
Probably out of memory killer did it. So it caused chain reaction. And
billing query was canceled "due to administration command" and it was
shown through dblink as "unknown error".
Which looked quite horrible...
FYI, I committed a fix for this since the initial report.
Before:
----------
test=# -- kill the remote side manually
test=# select * from dblink('dbname=test','select pg_sleep(30)');
ERROR: unknown error
CONTEXT: Error occurred on dblink connection named "unnamed": could not
execute query.
After:
----------
test=# -- kill the remote side manually
test=# select * from dblink('dbname=test','select pg_sleep(30)');
ERROR: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
CONTEXT: Error occurred on dblink connection named "unnamed": could not
execute query.
Joe
--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development