Wrapping a where clause to preserve rows with nulls

Started by Adrian Garcia Badaraccoover 1 year ago6 messagesgeneral
Jump to latest
#1Adrian Garcia Badaracco
adrian@adriangb.com

I have a query where I have to run a where clause generated by another
system (i.e., I can't modify that where clause. The where clause may return
`null`, but I actually want to keep rows that return `null` (and rows that
return `true` but not rows that return `false`).

I thought it would be as simple as wrapping in `(...) is not false` but
that seems to prevent index usage.

For example, let's say that given the table:

CREATE TABLE test_index (value INTEGER);
CREATE INDEX idx_value ON test_index(value);

And the predicate `value = 5000`, if I run the original query that excludes
rows where the predicate is null it uses the index:

SELECT *
FROM test_index
WHERE value = 5000;

But as soon as I tack on an `IS NOT FALSE` the index is not used:

SELECT *
FROM test_index
WHERE (value = 5000) IS NOT FALSE;

This was surprising to me. I was hoping this might be able to use the index.

Is there any way to include the rows where the predicate evaluates to null
while still using an index?

#2Adrian Garcia Badaracco
adrian@adriangb.com
In reply to: Adrian Garcia Badaracco (#1)
Re: Wrapping a where clause to preserve rows with nulls

I'll note that the clause is arbitrary in the sense that I don't generate
it and cannot edit it but it's basically a bunch of boolean comparisons
chained i.e. `col_a >= 1 and col_b <=5 and col_c ...` so I can in general
add an index on say col_a and it does get used.

On Wed, Dec 18, 2024 at 9:47 PM Adrian Garcia Badaracco <adrian@adriangb.com>
wrote:

Show quoted text

I have a query where I have to run a where clause generated by another
system (i.e., I can't modify that where clause. The where clause may return
`null`, but I actually want to keep rows that return `null` (and rows that
return `true` but not rows that return `false`).

I thought it would be as simple as wrapping in `(...) is not false` but
that seems to prevent index usage.

For example, let's say that given the table:

CREATE TABLE test_index (value INTEGER);
CREATE INDEX idx_value ON test_index(value);

And the predicate `value = 5000`, if I run the original query that
excludes rows where the predicate is null it uses the index:

SELECT *
FROM test_index
WHERE value = 5000;

But as soon as I tack on an `IS NOT FALSE` the index is not used:

SELECT *
FROM test_index
WHERE (value = 5000) IS NOT FALSE;

This was surprising to me. I was hoping this might be able to use the
index.

Is there any way to include the rows where the predicate evaluates to null
while still using an index?

#3David G. Johnston
david.g.johnston@gmail.com
In reply to: Adrian Garcia Badaracco (#1)
Re: Wrapping a where clause to preserve rows with nulls

On Wednesday, December 18, 2024, Adrian Garcia Badaracco <
adrian@adriangb.com> wrote:

Is there any way to include the rows where the predicate evaluates to null
while still using an index?

That seems quite unlikely. Your definition of equality is incompatible
with the system’s standard definition while requiring that the standard
equals operator be used in the query. Null values and non-null values are
not considered equal, or, put another way, always considered distinct from
each other. A btree index, which handles =, can’t be told to behave
differently and so cannot fulfill your desire to produce rows where the
stored value is null; it can only produce those equal to 5000.

David J.

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: David G. Johnston (#3)
Re: Wrapping a where clause to preserve rows with nulls

"David G. Johnston" <david.g.johnston@gmail.com> writes:

On Wednesday, December 18, 2024, Adrian Garcia Badaracco <
adrian@adriangb.com> wrote:

Is there any way to include the rows where the predicate evaluates to null
while still using an index?

... A btree index, which handles =, can’t be told to behave
differently and so cannot fulfill your desire to produce rows where the
stored value is null; it can only produce those equal to 5000.

Not in a single scan, no. But multiple scans are possible:

regression=# create table t (id int unique);
CREATE TABLE
regression=# explain select * from t where id = 5000 or id is null;
QUERY PLAN
------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=8.42..18.98 rows=14 width=4)
Recheck Cond: ((id IS NULL) OR (id = 5000))
-> BitmapOr (cost=8.42..8.42 rows=14 width=0)
-> Bitmap Index Scan on t_id_key (cost=0.00..4.25 rows=13 width=0)
Index Cond: (id IS NULL)
-> Bitmap Index Scan on t_id_key (cost=0.00..4.16 rows=1 width=0)
Index Cond: (id = 5000)
(7 rows)

The OP was quite unclear about what semantics he wants for
multiple-variable WHERE clauses, but maybe something like this
would work:

WHERE (original-clause) OR x IS NULL OR y IS NULL OR ...

where each variable mentioned in original-clause is allowed
to also be NULL. Or perhaps what is wanted is

WHERE (original-clause) OR (x IS NULL AND y IS NULL AND ...)

??

regards, tom lane

#5Adrian Garcia Badaracco
adrian@adriangb.com
In reply to: Tom Lane (#4)
Re: Wrapping a where clause to preserve rows with nulls

Thank you for the great idea Tom. While yes I can't modify the original
WHERE clause I do think I'll be able to introspect it or get the system
generating it to tell me which columns it references and then add an OR x
is NULL OR y is NULL ...

For context, just in case it's interesting, I store Parquet statistics in a
Postgres table and run the output of this thing on them:
https://github.com/apache/datafusion/blob/f92442ea8e8944c78f8e40d6648d049ff8e335ec/datafusion/physical-optimizer/src/pruning.rs#L146-L456
Hence why I can't really control the WHERE clause (at least not without
re-implementing a bunch of finicky error prone code).

On Wed, Dec 18, 2024 at 10:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

"David G. Johnston" <david.g.johnston@gmail.com> writes:

On Wednesday, December 18, 2024, Adrian Garcia Badaracco <
adrian@adriangb.com> wrote:

Is there any way to include the rows where the predicate evaluates to

null

while still using an index?

... A btree index, which handles =, can’t be told to behave
differently and so cannot fulfill your desire to produce rows where the
stored value is null; it can only produce those equal to 5000.

Not in a single scan, no. But multiple scans are possible:

regression=# create table t (id int unique);
CREATE TABLE
regression=# explain select * from t where id = 5000 or id is null;
QUERY PLAN

------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=8.42..18.98 rows=14 width=4)
Recheck Cond: ((id IS NULL) OR (id = 5000))
-> BitmapOr (cost=8.42..8.42 rows=14 width=0)
-> Bitmap Index Scan on t_id_key (cost=0.00..4.25 rows=13
width=0)
Index Cond: (id IS NULL)
-> Bitmap Index Scan on t_id_key (cost=0.00..4.16 rows=1
width=0)
Index Cond: (id = 5000)
(7 rows)

The OP was quite unclear about what semantics he wants for
multiple-variable WHERE clauses, but maybe something like this
would work:

WHERE (original-clause) OR x IS NULL OR y IS NULL OR ...

where each variable mentioned in original-clause is allowed
to also be NULL. Or perhaps what is wanted is

WHERE (original-clause) OR (x IS NULL AND y IS NULL AND ...)

??

regards, tom lane

#6Adrian Garcia Badaracco
adrian@adriangb.com
In reply to: Adrian Garcia Badaracco (#5)
Re: Wrapping a where clause to preserve rows with nulls

Well, there is a wrinkle: if the predicate returns `false` but one of the
columns is null then the whole thing ends up `true` when I'd want it to be
`false`. Say col_a = [1] and col_b = [null]:

WHERE (col_a < 1 AND col_b > 1) OR col_a IS NULL OR col_b IS NULL -> WHERE
(false AND null) OR false OR true -> WHERE false OR false OR true -> true.

That's still a pretty good solution for now.

On Wed, Dec 18, 2024 at 10:41 PM Adrian Garcia Badaracco <
adrian@adriangb.com> wrote:

Show quoted text

Thank you for the great idea Tom. While yes I can't modify the original
WHERE clause I do think I'll be able to introspect it or get the system
generating it to tell me which columns it references and then add an OR x
is NULL OR y is NULL ...

For context, just in case it's interesting, I store Parquet statistics in
a Postgres table and run the output of this thing on them:
https://github.com/apache/datafusion/blob/f92442ea8e8944c78f8e40d6648d049ff8e335ec/datafusion/physical-optimizer/src/pruning.rs#L146-L456
Hence why I can't really control the WHERE clause (at least not without
re-implementing a bunch of finicky error prone code).

On Wed, Dec 18, 2024 at 10:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

"David G. Johnston" <david.g.johnston@gmail.com> writes:

On Wednesday, December 18, 2024, Adrian Garcia Badaracco <
adrian@adriangb.com> wrote:

Is there any way to include the rows where the predicate evaluates to

null

while still using an index?

... A btree index, which handles =, can’t be told to behave
differently and so cannot fulfill your desire to produce rows where the
stored value is null; it can only produce those equal to 5000.

Not in a single scan, no. But multiple scans are possible:

regression=# create table t (id int unique);
CREATE TABLE
regression=# explain select * from t where id = 5000 or id is null;
QUERY PLAN

------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=8.42..18.98 rows=14 width=4)
Recheck Cond: ((id IS NULL) OR (id = 5000))
-> BitmapOr (cost=8.42..8.42 rows=14 width=0)
-> Bitmap Index Scan on t_id_key (cost=0.00..4.25 rows=13
width=0)
Index Cond: (id IS NULL)
-> Bitmap Index Scan on t_id_key (cost=0.00..4.16 rows=1
width=0)
Index Cond: (id = 5000)
(7 rows)

The OP was quite unclear about what semantics he wants for
multiple-variable WHERE clauses, but maybe something like this
would work:

WHERE (original-clause) OR x IS NULL OR y IS NULL OR ...

where each variable mentioned in original-clause is allowed
to also be NULL. Or perhaps what is wanted is

WHERE (original-clause) OR (x IS NULL AND y IS NULL AND ...)

??

regards, tom lane