IS NOT DISTINCT FROM + Indexing

Started by Jonathan S. Katzover 11 years ago10 messages
#1Jonathan S. Katz
jonathan.katz@excoventures.com

Hi,

I'm curious if there is a reason why "IS NOT DISTINCT FROM" is not an indexable operation in a B-tree index, as it is effectively testing for equality albeit with some "magic" for NULLs? Here is an example of what I mean, running tests on 9.3.4:

-- create a table of integers
CREATE TABLE numbers AS
SELECT x FROM generate_series(1,1000000) x;

-- create a b-tree index
CREATE INDEX numbers_x_idx ON numbers (x);

-- find x = 500
SELECT * FROM numbers WHERE x = 500;
x
-----
500
(1 row)

-- query plan
EXPLAIN SELECT * FROM numbers WHERE x = 500;
QUERY PLAN
----------------------------------------------------------------------------------
Index Only Scan using numbers_x_idx on numbers (cost=0.42..8.44 rows=1 width=4)
Index Cond: (x = 500)
(2 rows)

-- now find x IS NOT DISTINCT FROM 500
SELECT * FROM numbers WHERE x IS NOT DISTINCT FROM 500;
x
-----
500
(1 row)

-- but the query plan is...
EXPLAIN SELECT * FROM numbers WHERE x IS NOT DISTINCT FROM 500;
QUERY PLAN
-----------------------------------------------------------
Seq Scan on numbers (cost=0.00..16925.00 rows=1 width=4)
Filter: (NOT (x IS DISTINCT FROM 500))

With NULLs being indexable, I was wondering if there was some reason why IS NOT DISTINCT FROM could not use the index?

Thanks,

Jonathan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Peter Geoghegan
pg@heroku.com
In reply to: Jonathan S. Katz (#1)
Re: IS NOT DISTINCT FROM + Indexing

On Mon, Jul 21, 2014 at 4:16 PM, Jonathan S. Katz
<jonathan.katz@excoventures.com> wrote:

With NULLs being indexable, I was wondering if there was some reason why IS NOT DISTINCT FROM could not use the index?

FWIW this works:

postgres=# explain analyze select * from orders where orderid in (5, null);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using orders_pkey on orders (cost=0.29..12.60 rows=1
width=60) (actual time=0.019..0.021 rows=1 loops=1)
Index Cond: (orderid = ANY ('{5,NULL}'::integer[]))
Planning time: 0.100 ms
Execution time: 0.416 ms
(4 rows)

I think that it would almost be a Simple Matter of Programming to make
IS NOT DISTINCT FROM indexable. Under the hood, IS DISTINCT FROM isn't
very different to using the equality operator:

/*
* DistinctExpr - expression node for "x IS DISTINCT FROM y"
*
* Except for the nodetag, this is represented identically to an OpExpr
* referencing the "=" operator for x and y.
* We use "=", not the more obvious "<>", because more datatypes have "="
* than "<>". This means the executor must invert the operator result.
* Note that the operator function won't be called at all if either input
* is NULL, since then the result can be determined directly.
*/
typedef OpExpr DistinctExpr;

We're already inverting the equals operator. But that isn't
necessarily how a B-Tree index represents equality (that is, a
particular B-Tree operator class could have a non-'=' operator that it
thinks of as equality-ish - in general that could even be the default
B-Tree opclass and there may not be an equals operator). The fact that
most types think of the '=' equals operator as equality is just a
convention, and so technically IS DISTINCT FROM doesn't invert B-Tree
operation 3. See "31.14. Interfacing Extensions To Indexes" for
details. The equals operator '=' isn't really supposed to be magic, it
just is in some places.

Right now the executor is directly inverting the equality operator to
make this work (and has done so since long before NULLs were
indexable). This is a bit of a kludge. I guess it just works that way
because there is no convenient place to insert the special inversion
of the operator, and the special NULL handling that currently appears
within ExecEvalDistinct().

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#2)
Re: IS NOT DISTINCT FROM + Indexing

On 2014-07-21 16:51:32 -0700, Peter Geoghegan wrote:

On Mon, Jul 21, 2014 at 4:16 PM, Jonathan S. Katz
<jonathan.katz@excoventures.com> wrote:

With NULLs being indexable, I was wondering if there was some reason why IS NOT DISTINCT FROM could not use the index?

FWIW this works:

postgres=# explain analyze select * from orders where orderid in (5, null);

I rather doubt it will. x in (y1, ... yn) is essentially expanded to x =
y1 OR x = y2, ... OR x = yn. I.e. the NULL comparison will be done using
normal equality comparison and thus not return a row with a NULL
orderid. Am I missing something?

I think that it would almost be a Simple Matter of Programming to make
IS NOT DISTINCT FROM indexable. Under the hood, IS DISTINCT FROM isn't
very different to using the equality operator:

But yea, it probably wouldn't take very much for that.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Peter Geoghegan
pg@heroku.com
In reply to: Andres Freund (#3)
Re: IS NOT DISTINCT FROM + Indexing

On Mon, Jul 21, 2014 at 4:57 PM, Andres Freund <andres@anarazel.de> wrote:

I rather doubt it will. x in (y1, ... yn) is essentially expanded to x =
y1 OR x = y2, ... OR x = yn. I.e. the NULL comparison will be done using
normal equality comparison and thus not return a row with a NULL
orderid. Am I missing something?

I was a little bit incautious in my use of words. The point is that a
scanKey could easily represent a ScalarArrayOpExpr with NULLs and
non-NULLs.
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jonathan S. Katz (#1)
Re: IS NOT DISTINCT FROM + Indexing

"Jonathan S. Katz" <jonathan.katz@excoventures.com> writes:

I'm curious if there is a reason why "IS NOT DISTINCT FROM" is not an
indexable operation in a B-tree index,

The short reason why not is that it's not an operator (where "operator"
is defined as "something with a pg_operator entry"), and all our indexing
infrastructure is built around the notion that indexable clauses are of
the form "indexed_column indexable_operator comparison_value".

You could certainly imagine ways to fix that, but nobody's put in the
probably-nontrivial effort required to do so. The btree code itself
would likely be the easiest part to fix, as it sort of thinks nulls
are real values already.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Jonathan S. Katz
jonathan.katz@excoventures.com
In reply to: Tom Lane (#5)
Re: IS NOT DISTINCT FROM + Indexing

On Jul 21, 2014, at 9:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Jonathan S. Katz" <jonathan.katz@excoventures.com> writes:

I'm curious if there is a reason why "IS NOT DISTINCT FROM" is not an
indexable operation in a B-tree index,

The short reason why not is that it's not an operator (where "operator"
is defined as "something with a pg_operator entry"), and all our indexing
infrastructure is built around the notion that indexable clauses are of
the form "indexed_column indexable_operator comparison_value".

You could certainly imagine ways to fix that, but nobody's put in the
probably-nontrivial effort required to do so. The btree code itself
would likely be the easiest part to fix, as it sort of thinks nulls
are real values already.

What got me thinking this initially problem is that I know "IS NULL" is indexable and I was unsure of how adding "IS NOT DISTINCT FROM" would be too different from that - of course, this is from my perspective from primarily operating on the surface. It sounds like the IS NULL work is in the btree code?

Even if it is trivial, it would be tough for me personally to hack on without some hand-holding. I did want to ask about it because it can be useful in simplifying some queries when you have do deal with NULLs (though in reality I tend to use IS DISTINCT FROM much more, though in things like triggers) and would be useful with exclusion constraints (though with those it sounds like it would have to be an operator?).

If it is a small project someone is interested, I would be happy to contribute by testing.

Jonathan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jonathan S. Katz (#6)
Re: IS NOT DISTINCT FROM + Indexing

"Jonathan S. Katz" <jonathan.katz@excoventures.com> writes:

On Jul 21, 2014, at 9:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The short reason why not is that it's not an operator (where "operator"
is defined as "something with a pg_operator entry"), and all our indexing
infrastructure is built around the notion that indexable clauses are of
the form "indexed_column indexable_operator comparison_value".

What got me thinking this initially problem is that I know "IS NULL" is indexable and I was unsure of how adding "IS NOT DISTINCT FROM" would be too different from that - of course, this is from my perspective from primarily operating on the surface. It sounds like the IS NULL work is in the btree code?

We hacked in IS [NOT] NULL as a potentially indexable construct, but the
key thing that made that possible without major redesign is that IS [NOT]
NULL is datatype independent, so there's no need to identify any
particular underlying operator or opclass. I'm not sure what we'd do to
handle IS [NOT] DISTINCT FROM, but that particular approach ain't gonna
cut it.

Another point is that people are unlikely to be satisfied with planner
optimization for IS NOT DISTINCT FROM that doesn't support it as a join
clause (i.e., tab1.col1 IS NOT DISTINCT FROM tab2.col2); which is an issue
that doesn't arise for IS [NOT] NULL, as it has only one argument. So
that brings you into not just indexability but hashing and merging
support. I hasten to say that that doesn't necessarily have to happen
in a version-zero patch; but trying to make IS NOT DISTINCT FROM into
a first-class construct is a big project.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Jonathan S. Katz
jonathan.katz@excoventures.com
In reply to: Tom Lane (#7)
Re: IS NOT DISTINCT FROM + Indexing

On Jul 22, 2014, at 12:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Jonathan S. Katz" <jonathan.katz@excoventures.com> writes:

On Jul 21, 2014, at 9:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The short reason why not is that it's not an operator (where "operator"
is defined as "something with a pg_operator entry"), and all our indexing
infrastructure is built around the notion that indexable clauses are of
the form "indexed_column indexable_operator comparison_value".

What got me thinking this initially problem is that I know "IS NULL" is indexable and I was unsure of how adding "IS NOT DISTINCT FROM" would be too different from that - of course, this is from my perspective from primarily operating on the surface. It sounds like the IS NULL work is in the btree code?

We hacked in IS [NOT] NULL as a potentially indexable construct, but the
key thing that made that possible without major redesign is that IS [NOT]
NULL is datatype independent, so there's no need to identify any
particular underlying operator or opclass. I'm not sure what we'd do to
handle IS [NOT] DISTINCT FROM, but that particular approach ain't gonna
cut it.

Another point is that people are unlikely to be satisfied with planner
optimization for IS NOT DISTINCT FROM that doesn't support it as a join
clause (i.e., tab1.col1 IS NOT DISTINCT FROM tab2.col2); which is an issue
that doesn't arise for IS [NOT] NULL, as it has only one argument. So
that brings you into not just indexability but hashing and merging
support. I hasten to say that that doesn't necessarily have to happen
in a version-zero patch; but trying to make IS NOT DISTINCT FROM into
a first-class construct is a big project.

Well that definitely answers "how hard would it be." - before embarking on something laborious (as even just indexing is nontrivial), I think it would be good to figure out how people are using IS [NOT] DISTINCT FROM and if there is interest in having it be indexable, let alone used in a JOIN optimization. It could become a handy tool to simplify the SQL in queries that are returning a lot of NULL / NOT NULL data mixed together.

Jonathan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jonathan S. Katz (#8)
Re: IS NOT DISTINCT FROM + Indexing

Jonathan S. Katz wrote:

Well that definitely answers "how hard would it be." - before
embarking on something laborious (as even just indexing is
nontrivial), I think it would be good to figure out how people are
using IS [NOT] DISTINCT FROM and if there is interest in having it be
indexable, let alone used in a JOIN optimization. It could become a
handy tool to simplify the SQL in queries that are returning a lot of
NULL / NOT NULL data mixed together.

FWIW my project (abandoned for now, but I intend to get back to it
someday) to catalog NOT NULL constraints in pg_constraint had me
rewriting them into some kind of IS DISTINCT FROM NULL expression. (It
was IS NOT NULL initially until somebody pointed out that that wouldn't
work for composite type columns). I'm not sure how that interacts with
what you're doing here, maybe not at all; I thought I'd mention it.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Kevin Grittner
kgrittn@ymail.com
In reply to: Jonathan S. Katz (#8)
Re: IS NOT DISTINCT FROM + Indexing

Jonathan S. Katz <jonathan.katz@excoventures.com> wrote:

before embarking on something laborious (as even just indexing
is nontrivial), I think it would be good to figure out how people
are using IS [NOT] DISTINCT FROM and if there is interest in
having it be indexable, let alone used in a JOIN optimization.
It could become a handy tool to simplify the SQL in queries that
are returning a lot of NULL / NOT NULL data mixed together.

To prevent subtle inconsistencies, I think we would need to limit
support to data types with a btree opclass which uses "=" as the
equality operator on indexes using that opclass (either by default
or explicitly).  That limitation would still allow a lot of useful
cases.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers