How to hint 2 coulms IS NOT DISTINCT FROM each other
Hi
I was wondering if there is a way to hint that two columns in two different tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if table_a.key = 'test' THEN table_b.key = 'test' .
The equals operator already does this but it does not handle NULLS very well (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and doesn't establish the same inference rules as equals.
Example:
CREATE TABLE a (
id INTEGER PRIMARY KEY,
key VARCHAR,
value VARCHAR
);
CREATE INDEX ON a (key);
INSERT INTO a
VALUES (1, 'test', 'test'), (2, 'foo', 'bar'), (3, null, null), (4, 'baz', 'qoz');
CREATE VIEW view_a AS (
SELECT table_a.id,
table_a.key,
table_a.value,
table_b.key as b_key
FROM a AS table_a
JOIN a AS table_b
ON table_a.id = table_b.id
);
CREATE VIEW view_a_eq AS (
SELECT table_a.id,
table_a.key,
table_a.value,
table_b.key as b_key
FROM a AS table_a
JOIN a AS table_b
ON table_a.id = table_b.id
AND table_a.key = table_b.key
);
CREATE VIEW view_a_distinct AS (
SELECT table_a.id,
table_a.key,
table_a.value,
table_b.key as b_key
FROM a AS table_a
JOIN a AS table_b
ON table_a.id = table_b.id
AND table_a.key IS NOT DISTINCT FROM table_b.key
);
EXPLAIN SELECT * FROM view_a WHERE key = 'test';
QUERY PLAN
------------------------------------------------------------------------------------
Hash Join (cost=12.69..34.42 rows=4 width=100)
Hash Cond: (table_b.id = table_a.id)
-> Seq Scan on a table_b (cost=0.00..18.50 rows=850 width=36)
-> Hash (cost=12.64..12.64 rows=4 width=68)
-> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68)
Recheck Cond: ((key)::text = 'test'::text)
-> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0)
Index Cond: ((key)::text = 'test'::text)
We only get index scan on table_a
EXPLAIN SELECT * FROM view_a_eq WHERE key = 'test';
QUERY PLAN
------------------------------------------------------------------------------------
Nested Loop (cost=8.36..25.53 rows=1 width=100)
Join Filter: (table_a.id = table_b.id)
-> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68)
Recheck Cond: ((key)::text = 'test'::text)
-> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0)
Index Cond: ((key)::text = 'test'::text)
-> Materialize (cost=4.18..12.66 rows=4 width=36)
-> Bitmap Heap Scan on a table_b (cost=4.18..12.64 rows=4 width=36)
Recheck Cond: ((key)::text = 'test'::text)
-> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0)
Index Cond: ((key)::text = 'test'::text)
We get index scan on both tables and the where clause is pushed all the way down
EXPLAIN SELECT * FROM view_a_distinct WHERE key = 'test';
QUERY PLAN
------------------------------------------------------------------------------------
Hash Join (cost=12.69..34.43 rows=1 width=100)
Hash Cond: (table_b.id = table_a.id)
Join Filter: (NOT ((table_a.key)::text IS DISTINCT FROM (table_b.key)::text))
-> Seq Scan on a table_b (cost=0.00..18.50 rows=850 width=36)
-> Hash (cost=12.64..12.64 rows=4 width=68)
-> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68)
Recheck Cond: ((key)::text = 'test'::text)
-> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0)
Index Cond: ((key)::text = 'test'::text)
Same as the first example
In these examples it really doesn't matter which plan is used. But for larger view it might be a great hint for the optimizer to know that since we are joining on a primary key we could hint that all other columns for the table is actually NOT DISTINCT from each other. This will result in the planner being able to push the WHERE condition down into the other joined tables.
It works well for the = operator, but it is easy to see the it fails as soon as we asks about nulls.
SELECT * FROM view_a_eq WHERE key IS NULL;
id | key | value | b_key
----+-----+-------+-------
(0 rows)
Maybe my examples are too simple, but I hope you can verify the same will hold for a much larger table with 2 different views on top of them.
On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
Hi
I was wondering if there is a way to hint that two columns in two different
tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
table_a.key = 'test' THEN table_b.key = 'test' .The equals operator already does this but it does not handle NULLS very well
(or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
doesn't establish the same inference rules as equals.
The whole idea behing Postgres' query planner is that you don't have
to use any hints. Late model versions of postgres handle nulls fine,
but nulls are never "equal" to anything else. I.e. where xxx is null
works with indexes. Where x=y does not, since null <> null.
Suggestion for getting help, put a large-ish aka production sized
amount of data into your db, run your queries with explain analyze and
feed them to https://explain.depesz.com/ and post the links here along
with the slow queries. A lot of times the fix is non-obvious if you're
coming from another db with a different set of troubleshooting skills
for slow queries.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Scott Marlowe <scott.marlowe@gmail.com> writes:
On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
I was wondering if there is a way to hint that two columns in two different
tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
table_a.key = 'test' THEN table_b.key = 'test' .The equals operator already does this but it does not handle NULLS very well
(or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
doesn't establish the same inference rules as equals.
The whole idea behing Postgres' query planner is that you don't have
to use any hints. Late model versions of postgres handle nulls fine,
but nulls are never "equal" to anything else. I.e. where xxx is null
works with indexes. Where x=y does not, since null <> null.
The bigger picture here is that if you've designed a data representation
that requires that a null be considered "equal to" another null, you're
really going to be fighting against the basic semantics of SQL. You'd
be best off to rethink the representation. We've not seen enough info
about your requirements to suggest just how, though.
regards, tom lane
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Fri, Oct 28, 2016 at 1:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Scott Marlowe <scott.marlowe@gmail.com> writes:
On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
I was wondering if there is a way to hint that two columns in two different
tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
table_a.key = 'test' THEN table_b.key = 'test' .The equals operator already does this but it does not handle NULLS very well
(or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
doesn't establish the same inference rules as equals.The whole idea behing Postgres' query planner is that you don't have
to use any hints. Late model versions of postgres handle nulls fine,
but nulls are never "equal" to anything else. I.e. where xxx is null
works with indexes. Where x=y does not, since null <> null.The bigger picture here is that if you've designed a data representation
that requires that a null be considered "equal to" another null, you're
really going to be fighting against the basic semantics of SQL. You'd
be best off to rethink the representation. We've not seen enough info
about your requirements to suggest just how, though.
Well, ISTM OP is simply asking, "why can't IS [NOT] DISTINCT FROM be
optimized"? It ought to be, at least in some cases. Internally
indexes handle nulls so you should be able to implement them to
satisfy those kinds of scans. I guess that's an easy thing to say
though.
merlin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Fri, Oct 28, 2016 at 1:20 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Fri, Oct 28, 2016 at 1:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Scott Marlowe <scott.marlowe@gmail.com> writes:
On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
I was wondering if there is a way to hint that two columns in two different
tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
table_a.key = 'test' THEN table_b.key = 'test' .The equals operator already does this but it does not handle NULLS very well
(or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
doesn't establish the same inference rules as equals.The whole idea behing Postgres' query planner is that you don't have
to use any hints. Late model versions of postgres handle nulls fine,
but nulls are never "equal" to anything else. I.e. where xxx is null
works with indexes. Where x=y does not, since null <> null.The bigger picture here is that if you've designed a data representation
that requires that a null be considered "equal to" another null, you're
really going to be fighting against the basic semantics of SQL. You'd
be best off to rethink the representation. We've not seen enough info
about your requirements to suggest just how, though.Well, ISTM OP is simply asking, "why can't IS [NOT] DISTINCT FROM be
optimized"? It ought to be, at least in some cases. Internally
indexes handle nulls so you should be able to implement them to
satisfy those kinds of scans. I guess that's an easy thing to say
though.
hm. ISTM "a IS NOT DISTINCT FROM b"...could be rewritten into
((a IS NULL AND b IS NULL) OR (a = b)) ... which is an indexable expression.
merlin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 10/28/2016 9:29 AM, Kim Rose Carlsen wrote:
JOIN a AS table_b
ON table_a.id = table_b.id
AND table_a.key = table_b.key
Anyways, to use an index for that join, you'd need a composite index on
id *AND* key, not two separate indexes.
--
john r pierce, recycling bits in santa cruz
The bigger picture here is that if you've designed a data representation
that requires that a null be considered "equal to" another null, you're
really going to be fighting against the basic semantics of SQL. You'd
be best off to rethink the representation. We've not seen enough info
about your requirements to suggest just how, though.
Sometimes I do wake up in the night scared and afraid that I have used NULLs the wrong way
my whole life. I usually use NULLs to denote a value has not been provided. In my dreams I fear
I should have used empty string instead but if that is true, then I no longer know what is right and
wrong :).
In this specific case, its a resource management table. If a resource is not allocated to any
customers, then customer_id is set to null else the customer_id is set.
________________________________
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Friday, October 28, 2016 8:17:01 PM
To: Scott Marlowe
Cc: Kim Rose Carlsen; pgsql-general@postgresql.org
Subject: Re: [GENERAL] How to hint 2 coulms IS NOT DISTINCT FROM each other
Scott Marlowe <scott.marlowe@gmail.com> writes:
On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
I was wondering if there is a way to hint that two columns in two different
tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
table_a.key = 'test' THEN table_b.key = 'test' .The equals operator already does this but it does not handle NULLS very well
(or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
doesn't establish the same inference rules as equals.
The whole idea behing Postgres' query planner is that you don't have
to use any hints. Late model versions of postgres handle nulls fine,
but nulls are never "equal" to anything else. I.e. where xxx is null
works with indexes. Where x=y does not, since null <> null.
The bigger picture here is that if you've designed a data representation
that requires that a null be considered "equal to" another null, you're
really going to be fighting against the basic semantics of SQL. You'd
be best off to rethink the representation. We've not seen enough info
about your requirements to suggest just how, though.
regards, tom lane
JOIN a AS table_b
ON table_a.id = table_b.id
AND table_a.key = table_b.key
Anyways, to use an index for that join, you'd need a composite index on id *AND* key, not two separate indexes.
Its not as much as for using the index, but to be able to push the where clause inside both JOINED tables.
The = operator already does this. It gives the planner the option to join the table in using either id = id or key = key. It can deduce that if I have a WHERE condition with table_a.key = 'Something', then table_b.key must also be 'Something'. It can then decide to filter table_b on key. When using IS NOT DISTINCT FROM, the planner is not considering the same options. Now its like it doesn't know table_a.key is same the same as table_b.key.
I would somehow expect the IS NOT DISTINCT FROM operator to do the same. As it establish the same rules.. If a = 'test' and a IS NOT DISTINCT FROM b then b = 'test' also
-
Kim Carlsen
On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
Hi
I was wondering if there is a way to hint that two columns in two different
tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
table_a.key = 'test' THEN table_b.key = 'test' .The equals operator already does this but it does not handle NULLS very well
(or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
doesn't establish the same inference rules as equals.The whole idea behing Postgres' query planner is that you don't have
to use any hints. Late model versions of postgres handle nulls fine,
but nulls are never "equal" to anything else. I.e. where xxx is null
works with indexes. Where x=y does not, since null <> null.Suggestion for getting help, put a large-ish aka production sized
amount of data into your db, run your queries with explain analyze and
feed them to https://explain.depesz.com/ and post the links here along
with the slow queries. A lot of times the fix is non-obvious if you're
coming from another db with a different set of troubleshooting skills
for slow queries.
The problem is how to reduce the problem into its core, without introducing
all the unnecessary.
Maybe simplifying the problem, also makes it impossible to say where I go
wrong. It might be that I try to push too much logic into the SQL layer
and Im adding too many layers of abstraction to accomplish what I want.
So let me try and elaborate a little more.
I have couple a tables describing resources (circuits) and allocation
of resources to customers and products.
First layer is a view called view_circuit. This view (left) join any table
the circuit table reference through a foreign key (it gives exactly the same
rows and columns as circuit table + some extra information like customer_id).
Second layer is 2 views
1) a view describing if the circuit is active or inactive, lets call it
view_circuit_product_main
2) a view describing line_speed about the circuit, lets call it
view_circuit_product
These views use aggregations (both GROUP BY and SELECT DISTINCT ON (...))
if this has any relevance.
Third layer
Next step is to add a view that tells both (joins the two views together
on circuit_id). lets call the new view view_circuit_with_status
This view is defined as
CREATE VIEW view_circuit_with_status AS (
SELECT r.*,
s.circuit_status,
s.customer_id AS s_customer_id,
p.line_speed,
p.customer_id AS p_customer_id
FROM view_circuit r
JOIN view_circuit_product_main s
ON r.circuit_id = s.circuit_id
AND r.customer_id IS NOT DISTINCT FROM s.customer_id
JOIN view_circuit_product p
ON r.circuit_id = p.circuit_id
AND r.customer_id IS NOT DISTINCT FROM s.customer_id
);
SELECT * FROM view_circuit_with_status WHERE customer_id = 1;
Since customer_id is exposed through view_circuit the planner assumes view_circuit.customer_id = 1 and from there attempts to join
view_circuit_product_main and view_circuit_product using circuit_id.
This is not running optimal.
However if we change our query to allow the inference rule to take place, the query is executed very fast.
SELECT * FROM view_circuit_with_status WHERE customer_id = 1 AND s_customer_id = 1 AND p_customer_id = 1;
If a circuit is not assigned to any customers customer_id is set to NULL. This is the reason I can't use = operator. If I do use = then I can't find circuit which are unassigned, but the query do run effective.
I can see this still ends up being quite abstract, but the point is it would be quite beneficial if IS NOT DISTINCT used the same rules as = operator.
I have attached the 2 query plans
Bad plan: https://explain.depesz.com/s/SZN
Good plan: https://explain.depesz.com/s/61Ro
-
Kim Carlsen
Do you use potatoes for long posts here?
On Fri, Oct 28, 2016 at 2:54 PM, Kim Rose Carlsen <krc@hiper.dk> wrote:
On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
Hi
I was wondering if there is a way to hint that two columns in two
different
tables IS NOT DISTINCT FROM each other. So that the optimizer may assume
if
table_a.key = 'test' THEN table_b.key = 'test' .The equals operator already does this but it does not handle NULLS very
well
(or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
doesn't establish the same inference rules as equals.The whole idea behing Postgres' query planner is that you don't have
to use any hints. Late model versions of postgres handle nulls fine,
but nulls are never "equal" to anything else. I.e. where xxx is null
works with indexes. Where x=y does not, since null <> null.Suggestion for getting help, put a large-ish aka production sized
amount of data into your db, run your queries with explain analyze and
feed them to https://explain.depesz.com/ and post the links here along
with the slow queries. A lot of times the fix is non-obvious if you're
coming from another db with a different set of troubleshooting skills
for slow queries.The problem is how to reduce the problem into its core, without introducing
all the unnecessary.Maybe simplifying the problem, also makes it impossible to say where I go
wrong. It might be that I try to push too much logic into the SQL layer
and Im adding too many layers of abstraction to accomplish what I want.
So let me try and elaborate a little more.I have couple a tables describing resources (circuits) and allocation
of resources to customers and products.First layer is a view called view_circuit. This view (left) join any table
the circuit table reference through a foreign key (it gives exactly the same
rows and columns as circuit table + some extra information like
customer_id).Second layer is 2 views
1) a view describing if the circuit is active or inactive, lets call it
view_circuit_product_main
2) a view describing line_speed about the circuit, lets call it
view_circuit_productThese views use aggregations (both GROUP BY and SELECT DISTINCT ON (...))
if this has any relevance.Third layer
Next step is to add a view that tells both (joins the two views together
on circuit_id). lets call the new view view_circuit_with_statusThis view is defined as
CREATE VIEW view_circuit_with_status AS (
SELECT r.*,
s.circuit_status,
s.customer_id AS s_customer_id,
p.line_speed,
p.customer_id AS p_customer_id
FROM view_circuit r
JOIN view_circuit_product_main s
ON r.circuit_id = s.circuit_id
AND r.customer_id IS NOT DISTINCT FROM s.customer_id
JOIN view_circuit_product p
ON r.circuit_id = p.circuit_id
AND r.customer_id IS NOT DISTINCT FROM s.customer_id
);SELECT * FROM view_circuit_with_status WHERE customer_id = 1;
Since customer_id is exposed through view_circuit the planner assumes
view_circuit.customer_id = 1 and from there attempts to join
view_circuit_product_main and view_circuit_product using circuit_id.
This is not running optimal.However if we change our query to allow the inference rule to take place,
the query is executed very fast.SELECT * FROM view_circuit_with_status WHERE customer_id = 1 AND
s_customer_id = 1 AND p_customer_id = 1;If a circuit is not assigned to any customers customer_id is set to NULL.
This is the reason I can't use = operator. If I do use = then I can't find
circuit which are unassigned, but the query do run effective.I can see this still ends up being quite abstract, but the point is it would
be quite beneficial if IS NOT DISTINCT used the same rules as = operator.I have attached the 2 query plans
Bad plan: https://explain.depesz.com/s/SZN
Good plan: https://explain.depesz.com/s/61Ro
try this :-D
create or replace function indf(anyelement, anyelement) returns anyelement as
$$
select $1 = $2 or ($1 is null and $2 is null);
$$ language sql;
CREATE VIEW view_circuit_with_status AS (
SELECT r.*,
s.circuit_status,
s.customer_id AS s_customer_id,
p.line_speed,
p.customer_id AS p_customer_id
FROM view_circuit r
JOIN view_circuit_product_main s
ON r.circuit_id = s.circuit_id
AND indf(r.customer_id, s.customer_id)
JOIN view_circuit_product p
ON r.circuit_id = p.circuit_id
AND indf(r.customer_id, s.customer_id)
merlin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Fri, Oct 28, 2016 at 12:54 PM, Kim Rose Carlsen <krc@hiper.dk> wrote:
CREATE VIEW view_circuit_with_status AS (
SELECT r.*,
s.circuit_status,
s.customer_id AS s_customer_id,
p.line_speed,
p.customer_id AS p_customer_id
FROM view_circuit r
JOIN view_circuit_product_main s
ON r.circuit_id = s.circuit_id
AND r.customer_id IS NOT DISTINCT FROM s.customer_id
JOIN view_circuit_product p
ON r.circuit_id = p.circuit_id
AND r.customer_id IS NOT DISTINCT FROM s.customer_id
);SELECT * FROM view_circuit_with_status WHERE customer_id = 1;
A couple of random thoughts:
1.
TABLE circuit (circuit_id)
TABLE customer (customer_id)
TABLE circuit_customer (circuit_id, customer_id)
2.
INSERT INTO customer (0, 'Customer number standing in for customer not
assigned');
While IS DISTINCT FROM could very well use some more love I surmise the
reason it hasn't gotten it is because most data model set things up so that
it never needs to be used.
David J.
try this :-D
create or replace function indf(anyelement, anyelement) returns anyelement as
$$
select $1 = $2 or ($1 is null and $2 is null);
$$ language sql;CREATE VIEW view_circuit_with_status AS (
SELECT r.*,
s.circuit_status,
s.customer_id AS s_customer_id,
p.line_speed,
p.customer_id AS p_customer_id
FROM view_circuit r
JOIN view_circuit_product_main s
ON r.circuit_id = s.circuit_id
AND indf(r.customer_id, s.customer_id)
JOIN view_circuit_product p
ON r.circuit_id = p.circuit_id
AND indf(r.customer_id, s.customer_id)merlin
This doesn't do much good. This doesn't tell the planner that the 3 customer_ids are actually of same value, and it therefore can't filter them as it sees fit.
On Sat, Oct 29, 2016 at 6:55 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
try this :-D
create or replace function indf(anyelement, anyelement) returns anyelement
as
$$
select $1 = $2 or ($1 is null and $2 is null);
$$ language sql;CREATE VIEW view_circuit_with_status AS (
SELECT r.*,
s.circuit_status,
s.customer_id AS s_customer_id,
p.line_speed,
p.customer_id AS p_customer_id
FROM view_circuit r
JOIN view_circuit_product_main s
ON r.circuit_id = s.circuit_id
AND indf(r.customer_id, s.customer_id)
JOIN view_circuit_product p
ON r.circuit_id = p.circuit_id
AND indf(r.customer_id, s.customer_id)merlin
This doesn't do much good. This doesn't tell the planner that the 3
customer_ids are actually of same value, and it therefore can't filter them
as it sees fit.
You do know you can index on a function, and the planner then keeps
stats on it when you run analyze right?
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
This doesn't do much good. This doesn't tell the planner that the 3
customer_ids are actually of same value, and it therefore can't filter them
as it sees fit.
You do know you can index on a function, and the planner then keeps
stats on it when you run analyze right?
Yes, but I don't think it will make any difference. I don't think I can solve this with
an index lookup. I think my savior is the inference that the 2 columns are of
same value and the planner are free to choose which order to do the filter and join
with this extra information.
I have tried creating a function called
zero_if_null(int) : int that just select COALESCE($1, 0)
and adding a index on (zero_if_null(customer_id)) on table that contains customer_id. The only thing I get from is the planner now only knows how to compare customer_id, but it still doesn't know that they are of same value, only I know that and I want to declare it for the planner.
I could probably rewrite the whole view in one query, and then fix it with a proper index. But I think I will loose alot of readability.
I could also change the structure to save an explicit state, instead of a calculated state. But then I get some redundancy I need to make sure always stays the same.
In the end one of these will probably be the solution.
I guess the question is more or less,
why doesn't IS NOT DISTINCT FROM behave the same way as = operator, are there any alternatives? And a plausible use case for when it would be useful.
On Sat, Oct 29, 2016 at 8:27 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
This doesn't do much good. This doesn't tell the planner that the 3
customer_ids are actually of same value, and it therefore can't filter
them
as it sees fit.You do know you can index on a function, and the planner then keeps
stats on it when you run analyze right?Yes, but I don't think it will make any difference. I don't think I can
solve this with
an index lookup. I think my savior is the inference that the 2 columns are
of
same value and the planner are free to choose which order to do the filter
and join
with this extra information.I have tried creating a function called
zero_if_null(int) : int that just select COALESCE($1, 0)
and adding a index on (zero_if_null(customer_id)) on table that contains
customer_id. The only thing I get from is the planner now only knows how to
compare customer_id, but it still doesn't know that they are of same value,
only I know that and I want to declare it for the planner.I could probably rewrite the whole view in one query, and then fix it with a
proper index. But I think I will loose alot of readability.I could also change the structure to save an explicit state, instead of a
calculated state. But then I get some redundancy I need to make sure always
stays the same.In the end one of these will probably be the solution.
I guess the question is more or less,
why doesn't IS NOT DISTINCT FROM behave the same way as = operator, are
there any alternatives? And a plausible use case for when it would be
useful.
Well, the *behavior* is mandated by the sql standard. Our
implementation is slow however. I'm surprised the attached function
didn't help, it can be inlined and I was able to get bitmap or which
is pretty good. As I said upthread I think INDF could theoretically
run as fast as equality -- it just doesn't today.
As your joins are written you could probably convert this by reserving
a customer_id to the work that you're trying to do with null, say, 0,
or -1. This is a somewhat dubious practice but seems a better fit for
your use case. I don't think INDF is good in this usage.
merlin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Sat, Oct 29, 2016 at 8:27 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
I have tried creating a function called
zero_if_null(int) : int that just select COALESCE($1, 0)
and adding a index on (zero_if_null(customer_id)) on table that contains
customer_id. The only thing I get from is the planner now only knows how to
compare customer_id, but it still doesn't know that they are of same value,
only I know that and I want to declare it for the planner.
Well, the *behavior* is mandated by the sql standard. Our
implementation is slow however.
Sorry I'm not following, what behavior is mandated by the sql standard?
I'm surprised the attached function
didn't help, it can be inlined and I was able to get bitmap or which
is pretty good. As I said upthread I think INDF could theoretically
run as fast as equality -- it just doesn't today.
It might be harsh to say that it doesn't help at all. I does half the running time,
but I need it to run an order of magnitude faster. Here is the plan with the
empty_if_null (customer_id is actually varchar)
https://explain.depesz.com/s/M1LV with empty_if_null + functional index
https://explain.depesz.com/s/eOL with indf
As your joins are written you could probably convert this by reserving
a customer_id to the work that you're trying to do with null, say, 0,
or -1. This is a somewhat dubious practice but seems a better fit for
your use case. I don't think INDF is good in this usage.merlin
This will work well, I think.
But I'm not sure I can mentally accept an unfilled value should not be
null (eg. 0, '', '0000-01-01'). But I can see how the equals operator will
work well with this.
It might raise another problem, that the nulls are generated through LEFT
JOINS where now rows are defined. Then the 0 or -1 value need to be
a computed value. Won't this throw of index lookups? (I might be
more confused in this area).
On Mon, Oct 31, 2016 at 9:28 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
On Sat, Oct 29, 2016 at 8:27 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
I have tried creating a function called
zero_if_null(int) : int that just select COALESCE($1, 0)
and adding a index on (zero_if_null(customer_id)) on table that contains
customer_id. The only thing I get from is the planner now only knows how
to
compare customer_id, but it still doesn't know that they are of same
value,
only I know that and I want to declare it for the planner.Well, the *behavior* is mandated by the sql standard. Our
implementation is slow however.Sorry I'm not following, what behavior is mandated by the sql standard?
The semantics of IS DISTINCT FROM, basically, equality with special
consideration for nulls.
But I'm not sure I can mentally accept an unfilled value should not be
null (eg. 0, '', '0000-01-01'). But I can see how the equals operator will
work well with this.
Yeah, this is a dubious tactic and I would normally only consider
using it for surrogate identifiers.
It might raise another problem, that the nulls are generated through LEFT
JOINS where now rows are defined. Then the 0 or -1 value need to be
a computed value. Won't this throw of index lookups? (I might be
more confused in this area).
Not following this. BTW, if you want a fast plan over the current
data without consideration of aesthetics, try this:
CREATE VIEW view_circuit_with_status AS (
SELECT r.*,
s.circuit_status,
s.customer_id AS s_customer_id,
p.line_speed,
p.customer_id AS p_customer_id
FROM view_circuit r
JOIN view_circuit_product_main s
ON r.circuit_id = s.circuit_id
AND r.customer_id, s.customer_id
JOIN view_circuit_product p
ON r.circuit_id = p.circuit_id
AND r.customer_id, s.customer_id
UNION ALL SELECT r.*,
s.circuit_status,
s.customer_id AS s_customer_id,
p.line_speed,
p.customer_id AS p_customer_id
FROM view_circuit r
JOIN view_circuit_product_main s
ON r.circuit_id = s.circuit_id
AND r.customer_id IS NULL
AND s.customer_id IS NULL
JOIN view_circuit_product p
ON r.circuit_id = p.circuit_id
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
It might raise another problem, that the nulls are generated through LEFT
JOINS where no rows are defined. Then the 0 or -1 value need to be
a computed value. Won't this throw off index lookups? (I might be
more confused in this area).Not following this.
The nulls are generated by something like this
SELECT c.circuit_id,
cc.customer_id
FROM circuit AS c
LEFT JOIN circuit_customer AS cc
ON c.circuit_id = cc.circuit_id
To make a magic '0' customer we would be required to use
COALESCE(cc.customer_id, '0')
I dont think the optimizer will do anything clever with the '0' we have
computed from null.
I could ofc. by default assign all unassigned circuits to '0' in
circuit_customer. I'm not a fan though.
BTW, if you want a fast plan over the current
data without consideration of aesthetics, try this:CREATE VIEW view_circuit_with_status AS (
SELECT r.*,
s.circuit_status,
s.customer_id AS s_customer_id,
p.line_speed,
p.customer_id AS p_customer_id
FROM view_circuit r
JOIN view_circuit_product_main s
ON r.circuit_id = s.circuit_id
AND r.customer_id, s.customer_id
JOIN view_circuit_product p
ON r.circuit_id = p.circuit_id
AND r.customer_id, s.customer_id
UNION ALL SELECT r.*,
s.circuit_status,
s.customer_id AS s_customer_id,
p.line_speed,
p.customer_id AS p_customer_id
FROM view_circuit r
JOIN view_circuit_product_main s
ON r.circuit_id = s.circuit_id
AND r.customer_id IS NULL
AND s.customer_id IS NULL
JOIN view_circuit_product p
ON r.circuit_id = p.circuit_id>
I will have to figure something out, but this specific case is still problematic
since we would like to filter this view using different criteria's, like circuit_no,
products or customers.
But with all these detours, I assume that a change to IS NOT DISTINCT FROM,
is difficult or not wanted?
On Fri, Nov 4, 2016 at 8:08 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
It might raise another problem, that the nulls are generated through LEFT
JOINS where no rows are defined. Then the 0 or -1 value need to be
a computed value. Won't this throw off index lookups? (I might be
more confused in this area).Not following this.
The nulls are generated by something like this
SELECT c.circuit_id,
cc.customer_id
FROM circuit AS c
LEFT JOIN circuit_customer AS cc
ON c.circuit_id = cc.circuit_idTo make a magic '0' customer we would be required to use
COALESCE(cc.customer_id, '0')
I dont think the optimizer will do anything clever with the '0' we have
computed from null.
It would if you explicitly indexed it as such;
CREATE INDEX ON circuit_customer((COALESCE(customer_id, '0'));
I could ofc. by default assign all unassigned circuits to '0' in
circuit_customer. I'm not a fan though.
hm, why not? null generally means 'unknown' and that's why it fails
any equality test.
BTW, if you want a fast plan over the current
data without consideration of aesthetics, try this:CREATE VIEW view_circuit_with_status AS (
SELECT r.*,
s.circuit_status,
s.customer_id AS s_customer_id,
p.line_speed,
p.customer_id AS p_customer_id
FROM view_circuit r
JOIN view_circuit_product_main s
ON r.circuit_id = s.circuit_id
AND r.customer_id, s.customer_id
JOIN view_circuit_product p
ON r.circuit_id = p.circuit_id
AND r.customer_id, s.customer_id
UNION ALL SELECT r.*,
s.circuit_status,
s.customer_id AS s_customer_id,
p.line_speed,
p.customer_id AS p_customer_id
FROM view_circuit r
JOIN view_circuit_product_main s
ON r.circuit_id = s.circuit_id
AND r.customer_id IS NULL
AND s.customer_id IS NULL
JOIN view_circuit_product p
ON r.circuit_id = p.circuit_id>I will have to figure something out, but this specific case is still
problematic
since we would like to filter this view using different criteria's, like
circuit_no,
products or customers.
the above is logically equivalent to IS NOT DISTINCT FROM; you should
be able to query it as you would have done the original view.
But with all these detours, I assume that a change to IS NOT DISTINCT FROM,
is difficult or not wanted?
Well, not exactly. In your case you are trying to treat null as a
specific value and pass it through join operations.
TBH, this is a pretty dubious approach: null is not supposed to be
equal to anything and any join vs null should come up empty --
logically at least. INDF works around this of course but it's not a
recommended approach (my usage is generally restricted to, "has this
value changed since yesterday? etc").
I'm not an expert backend structures for indexing and optimization but
I know enough to suspect that optimizing INDF might cause
implementation headaches in various places, as do other irregular
syntactical approaches in SQL. I think minimally optimizing INDF
would require converting it to an operator on par with '=' which is a
pretty large infrastructure change for an edge optimization case.
The fact that there are solid optimization strategies already on the
table (UNION ALL, expr index COALESCE()) does not help.
merlin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 4 November 2016 at 14:41, Merlin Moncure <mmoncure@gmail.com> wrote:
On Fri, Nov 4, 2016 at 8:08 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
The nulls are generated by something like this
SELECT c.circuit_id,
cc.customer_id
FROM circuit AS c
LEFT JOIN circuit_customer AS cc
ON c.circuit_id = cc.circuit_idTo make a magic '0' customer we would be required to use
COALESCE(cc.customer_id, '0')
I dont think the optimizer will do anything clever with the '0' we have
computed from null.It would if you explicitly indexed it as such;
CREATE INDEX ON circuit_customer((COALESCE(customer_id, '0'));
Merlin, it's a LEFT JOIN. There probably are no NULLs in the
circuit_customer.customer_id column, so that COALESCE isn't going to
achieve anything at all.
I haven't been following this particular discussion in detail, so
unfortunately I can't contribute more than that remark at the moment.
--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general