Wrapping a where clause to preserve rows with nulls
I have a query where I have to run a where clause generated by another
system (i.e., I can't modify that where clause. The where clause may return
`null`, but I actually want to keep rows that return `null` (and rows that
return `true` but not rows that return `false`).
I thought it would be as simple as wrapping in `(...) is not false` but
that seems to prevent index usage.
For example, let's say that given the table:
CREATE TABLE test_index (value INTEGER);
CREATE INDEX idx_value ON test_index(value);
And the predicate `value = 5000`, if I run the original query that excludes
rows where the predicate is null it uses the index:
SELECT *
FROM test_index
WHERE value = 5000;
But as soon as I tack on an `IS NOT FALSE` the index is not used:
SELECT *
FROM test_index
WHERE (value = 5000) IS NOT FALSE;
This was surprising to me. I was hoping this might be able to use the index.
Is there any way to include the rows where the predicate evaluates to null
while still using an index?
I'll note that the clause is arbitrary in the sense that I don't generate
it and cannot edit it but it's basically a bunch of boolean comparisons
chained i.e. `col_a >= 1 and col_b <=5 and col_c ...` so I can in general
add an index on say col_a and it does get used.
On Wed, Dec 18, 2024 at 9:47 PM Adrian Garcia Badaracco <adrian@adriangb.com>
wrote:
Show quoted text
I have a query where I have to run a where clause generated by another
system (i.e., I can't modify that where clause. The where clause may return
`null`, but I actually want to keep rows that return `null` (and rows that
return `true` but not rows that return `false`).I thought it would be as simple as wrapping in `(...) is not false` but
that seems to prevent index usage.For example, let's say that given the table:
CREATE TABLE test_index (value INTEGER);
CREATE INDEX idx_value ON test_index(value);And the predicate `value = 5000`, if I run the original query that
excludes rows where the predicate is null it uses the index:SELECT *
FROM test_index
WHERE value = 5000;But as soon as I tack on an `IS NOT FALSE` the index is not used:
SELECT *
FROM test_index
WHERE (value = 5000) IS NOT FALSE;This was surprising to me. I was hoping this might be able to use the
index.Is there any way to include the rows where the predicate evaluates to null
while still using an index?
On Wednesday, December 18, 2024, Adrian Garcia Badaracco <
adrian@adriangb.com> wrote:
Is there any way to include the rows where the predicate evaluates to null
while still using an index?
That seems quite unlikely. Your definition of equality is incompatible
with the system’s standard definition while requiring that the standard
equals operator be used in the query. Null values and non-null values are
not considered equal, or, put another way, always considered distinct from
each other. A btree index, which handles =, can’t be told to behave
differently and so cannot fulfill your desire to produce rows where the
stored value is null; it can only produce those equal to 5000.
David J.
"David G. Johnston" <david.g.johnston@gmail.com> writes:
On Wednesday, December 18, 2024, Adrian Garcia Badaracco <
adrian@adriangb.com> wrote:Is there any way to include the rows where the predicate evaluates to null
while still using an index?
... A btree index, which handles =, can’t be told to behave
differently and so cannot fulfill your desire to produce rows where the
stored value is null; it can only produce those equal to 5000.
Not in a single scan, no. But multiple scans are possible:
regression=# create table t (id int unique);
CREATE TABLE
regression=# explain select * from t where id = 5000 or id is null;
QUERY PLAN
------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=8.42..18.98 rows=14 width=4)
Recheck Cond: ((id IS NULL) OR (id = 5000))
-> BitmapOr (cost=8.42..8.42 rows=14 width=0)
-> Bitmap Index Scan on t_id_key (cost=0.00..4.25 rows=13 width=0)
Index Cond: (id IS NULL)
-> Bitmap Index Scan on t_id_key (cost=0.00..4.16 rows=1 width=0)
Index Cond: (id = 5000)
(7 rows)
The OP was quite unclear about what semantics he wants for
multiple-variable WHERE clauses, but maybe something like this
would work:
WHERE (original-clause) OR x IS NULL OR y IS NULL OR ...
where each variable mentioned in original-clause is allowed
to also be NULL. Or perhaps what is wanted is
WHERE (original-clause) OR (x IS NULL AND y IS NULL AND ...)
??
regards, tom lane
Thank you for the great idea Tom. While yes I can't modify the original
WHERE clause I do think I'll be able to introspect it or get the system
generating it to tell me which columns it references and then add an OR x
is NULL OR y is NULL ...
For context, just in case it's interesting, I store Parquet statistics in a
Postgres table and run the output of this thing on them:
https://github.com/apache/datafusion/blob/f92442ea8e8944c78f8e40d6648d049ff8e335ec/datafusion/physical-optimizer/src/pruning.rs#L146-L456
Hence why I can't really control the WHERE clause (at least not without
re-implementing a bunch of finicky error prone code).
On Wed, Dec 18, 2024 at 10:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
"David G. Johnston" <david.g.johnston@gmail.com> writes:
On Wednesday, December 18, 2024, Adrian Garcia Badaracco <
adrian@adriangb.com> wrote:Is there any way to include the rows where the predicate evaluates to
null
while still using an index?
... A btree index, which handles =, can’t be told to behave
differently and so cannot fulfill your desire to produce rows where the
stored value is null; it can only produce those equal to 5000.Not in a single scan, no. But multiple scans are possible:
regression=# create table t (id int unique);
CREATE TABLE
regression=# explain select * from t where id = 5000 or id is null;
QUERY PLAN------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=8.42..18.98 rows=14 width=4)
Recheck Cond: ((id IS NULL) OR (id = 5000))
-> BitmapOr (cost=8.42..8.42 rows=14 width=0)
-> Bitmap Index Scan on t_id_key (cost=0.00..4.25 rows=13
width=0)
Index Cond: (id IS NULL)
-> Bitmap Index Scan on t_id_key (cost=0.00..4.16 rows=1
width=0)
Index Cond: (id = 5000)
(7 rows)The OP was quite unclear about what semantics he wants for
multiple-variable WHERE clauses, but maybe something like this
would work:WHERE (original-clause) OR x IS NULL OR y IS NULL OR ...
where each variable mentioned in original-clause is allowed
to also be NULL. Or perhaps what is wanted isWHERE (original-clause) OR (x IS NULL AND y IS NULL AND ...)
??
regards, tom lane
Well, there is a wrinkle: if the predicate returns `false` but one of the
columns is null then the whole thing ends up `true` when I'd want it to be
`false`. Say col_a = [1] and col_b = [null]:
WHERE (col_a < 1 AND col_b > 1) OR col_a IS NULL OR col_b IS NULL -> WHERE
(false AND null) OR false OR true -> WHERE false OR false OR true -> true.
That's still a pretty good solution for now.
On Wed, Dec 18, 2024 at 10:41 PM Adrian Garcia Badaracco <
adrian@adriangb.com> wrote:
Show quoted text
Thank you for the great idea Tom. While yes I can't modify the original
WHERE clause I do think I'll be able to introspect it or get the system
generating it to tell me which columns it references and then add an OR x
is NULL OR y is NULL ...For context, just in case it's interesting, I store Parquet statistics in
a Postgres table and run the output of this thing on them:
https://github.com/apache/datafusion/blob/f92442ea8e8944c78f8e40d6648d049ff8e335ec/datafusion/physical-optimizer/src/pruning.rs#L146-L456
Hence why I can't really control the WHERE clause (at least not without
re-implementing a bunch of finicky error prone code).On Wed, Dec 18, 2024 at 10:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
On Wednesday, December 18, 2024, Adrian Garcia Badaracco <
adrian@adriangb.com> wrote:Is there any way to include the rows where the predicate evaluates to
null
while still using an index?
... A btree index, which handles =, can’t be told to behave
differently and so cannot fulfill your desire to produce rows where the
stored value is null; it can only produce those equal to 5000.Not in a single scan, no. But multiple scans are possible:
regression=# create table t (id int unique);
CREATE TABLE
regression=# explain select * from t where id = 5000 or id is null;
QUERY PLAN------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=8.42..18.98 rows=14 width=4)
Recheck Cond: ((id IS NULL) OR (id = 5000))
-> BitmapOr (cost=8.42..8.42 rows=14 width=0)
-> Bitmap Index Scan on t_id_key (cost=0.00..4.25 rows=13
width=0)
Index Cond: (id IS NULL)
-> Bitmap Index Scan on t_id_key (cost=0.00..4.16 rows=1
width=0)
Index Cond: (id = 5000)
(7 rows)The OP was quite unclear about what semantics he wants for
multiple-variable WHERE clauses, but maybe something like this
would work:WHERE (original-clause) OR x IS NULL OR y IS NULL OR ...
where each variable mentioned in original-clause is allowed
to also be NULL. Or perhaps what is wanted isWHERE (original-clause) OR (x IS NULL AND y IS NULL AND ...)
??
regards, tom lane