Self-join optimisation

Started by Hywel Carveralmost 5 years ago6 messages
#1Hywel Carver
hywel@skillerwhale.com

Hi,

I asked this question in the Postgres Slack, and was recommended to ask
here instead.

A few times, I've been in a situation where I want to join a table to
itself on its primary key. That typically happens because I have some kind
of summary view, which I then want to join to the original table (using its
primary key) to flesh out the summary data with other columns. That's
executed as a join, which surprised me. But in this case, I could extend
the view to have all of the columns of the original table to avoid the join.

But there's another case that's harder to solve this way: combining views
together. Here's a trivial example:

CREATE TABLE users (id BIGINT PRIMARY KEY, varchar name);
CREATE VIEW only_some_users AS (SELECT * FROM users WHERE id < 10);
CREATE VIEW some_other_users AS (SELECT * FROM users WHERE id > 3);

EXPLAIN SELECT * FROM only_some_users
INNER JOIN some_other_users ON only_some_users.id = some_other_users.id;

Hash Join (cost=29.23..43.32 rows=90 width=144)

Hash Cond: (users.id = users_1.id)

-> Bitmap Heap Scan on users (cost=6.24..19.62 rows=270 width=72)

Recheck Cond: (id < 10)

-> Bitmap Index Scan on users_pkey (cost=0.00..6.18 rows=270
width=0)
Index Cond: (id < 10)

-> Hash (cost=19.62..19.62 rows=270 width=72)

-> Bitmap Heap Scan on users users_1 (cost=6.24..19.62 rows=270
width=72)
Recheck Cond: (id > 3)

-> Bitmap Index Scan on users_pkey (cost=0.00..6.18
rows=270 width=0)
Index Cond: (id > 3)

Is there a reason why Postgres doesn't have an optimisation built in to
optimise this JOIN? What I'm imagining is that a join between two aliases
for the same table on its primary key could be optimised by treating them
as the same table. I think the same would be true for self-joins on any
non-null columns covered by a uniqueness constraint.

If this is considered a desirable change, I'd be keen to work on it (with
some guidance).

Thanks,

Hywel
<https://files.slack.com/files-pri/TMKTMS7PB-F01E0TSQH3P/sw_horizontal_colour__1_.png&gt;

#2Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Hywel Carver (#1)
Re: Self-join optimisation

On Thu, 11 Mar 2021 at 15:15, Hywel Carver <hywel@skillerwhale.com> wrote:

Hi,

I asked this question in the Postgres Slack, and was recommended to ask here instead.

A few times, I've been in a situation where I want to join a table to itself on its primary key. That typically happens because I have some kind of summary view, which I then want to join to the original table (using its primary key) to flesh out the summary data with other columns. That's executed as a join, which surprised me. But in this case, I could extend the view to have all of the columns of the original table to avoid the join.

But there's another case that's harder to solve this way: combining views together. Here's a trivial example:

CREATE TABLE users (id BIGINT PRIMARY KEY, varchar name);
CREATE VIEW only_some_users AS (SELECT * FROM users WHERE id < 10);
CREATE VIEW some_other_users AS (SELECT * FROM users WHERE id > 3);

EXPLAIN SELECT * FROM only_some_users
INNER JOIN some_other_users ON only_some_users.id = some_other_users.id;

Hash Join (cost=29.23..43.32 rows=90 width=144)
Hash Cond: (users.id = users_1.id)
-> Bitmap Heap Scan on users (cost=6.24..19.62 rows=270 width=72)
Recheck Cond: (id < 10)
-> Bitmap Index Scan on users_pkey (cost=0.00..6.18 rows=270 width=0)
Index Cond: (id < 10)
-> Hash (cost=19.62..19.62 rows=270 width=72)
-> Bitmap Heap Scan on users users_1 (cost=6.24..19.62 rows=270 width=72)
Recheck Cond: (id > 3)
-> Bitmap Index Scan on users_pkey (cost=0.00..6.18 rows=270 width=0)
Index Cond: (id > 3)

Is there a reason why Postgres doesn't have an optimisation built in to optimise this JOIN? What I'm imagining is that a join between two aliases for the same table on its primary key could be optimised by treating them as the same table. I think the same would be true for self-joins on any non-null columns covered by a uniqueness constraint.

If this is considered a desirable change, I'd be keen to work on it (with some guidance).

There's currently a patch registered in the commitfest that could fix
this for you, called "Remove self join on a unique column" [0]https://commitfest.postgresql.org/31/1712/, thread at /messages/by-id/64486b0b-0404-e39e-322d-0801154901f3@postgrespro.ru.

With regards,

Matthias van de Meent

[0]: https://commitfest.postgresql.org/31/1712/, thread at /messages/by-id/64486b0b-0404-e39e-322d-0801154901f3@postgrespro.ru
/messages/by-id/64486b0b-0404-e39e-322d-0801154901f3@postgrespro.ru

#3Hywel Carver
hywel@skillerwhale.com
In reply to: Matthias van de Meent (#2)
Re: Self-join optimisation

Great! It looks like it's been in commitfest status for a few years. Is
there anything someone like me (outside the pgsql-hackers community) can do
to help it get reviewed this time around?

On Thu, Mar 11, 2021 at 2:32 PM Matthias van de Meent <
boekewurm+postgres@gmail.com> wrote:

Show quoted text

On Thu, 11 Mar 2021 at 15:15, Hywel Carver <hywel@skillerwhale.com> wrote:

Hi,

I asked this question in the Postgres Slack, and was recommended to ask

here instead.

A few times, I've been in a situation where I want to join a table to

itself on its primary key. That typically happens because I have some kind
of summary view, which I then want to join to the original table (using its
primary key) to flesh out the summary data with other columns. That's
executed as a join, which surprised me. But in this case, I could extend
the view to have all of the columns of the original table to avoid the join.

But there's another case that's harder to solve this way: combining

views together. Here's a trivial example:

CREATE TABLE users (id BIGINT PRIMARY KEY, varchar name);
CREATE VIEW only_some_users AS (SELECT * FROM users WHERE id < 10);
CREATE VIEW some_other_users AS (SELECT * FROM users WHERE id > 3);

EXPLAIN SELECT * FROM only_some_users
INNER JOIN some_other_users ON only_some_users.id = some_other_users.id;

Hash Join (cost=29.23..43.32 rows=90 width=144)
Hash Cond: (users.id = users_1.id)
-> Bitmap Heap Scan on users (cost=6.24..19.62 rows=270 width=72)
Recheck Cond: (id < 10)
-> Bitmap Index Scan on users_pkey (cost=0.00..6.18 rows=270

width=0)

Index Cond: (id < 10)
-> Hash (cost=19.62..19.62 rows=270 width=72)
-> Bitmap Heap Scan on users users_1 (cost=6.24..19.62

rows=270 width=72)

Recheck Cond: (id > 3)
-> Bitmap Index Scan on users_pkey (cost=0.00..6.18

rows=270 width=0)

Index Cond: (id > 3)

Is there a reason why Postgres doesn't have an optimisation built in to

optimise this JOIN? What I'm imagining is that a join between two aliases
for the same table on its primary key could be optimised by treating them
as the same table. I think the same would be true for self-joins on any
non-null columns covered by a uniqueness constraint.

If this is considered a desirable change, I'd be keen to work on it

(with some guidance).

There's currently a patch registered in the commitfest that could fix
this for you, called "Remove self join on a unique column" [0].

With regards,

Matthias van de Meent

[0] https://commitfest.postgresql.org/31/1712/, thread at

/messages/by-id/64486b0b-0404-e39e-322d-0801154901f3@postgrespro.ru

#4Justin Pryzby
pryzby@telsasoft.com
In reply to: Matthias van de Meent (#2)
Re: Self-join optimisation

On Thu, Mar 11, 2021 at 03:32:16PM +0100, Matthias van de Meent wrote:

On Thu, 11 Mar 2021 at 15:15, Hywel Carver <hywel@skillerwhale.com> wrote:

I asked this question in the Postgres Slack, and was recommended to ask here instead.

A few times, I've been in a situation where I want to join a table to itself on its primary key. That typically happens because I have some kind of summary view, which I then want to join to the original table (using its primary key) to flesh out the summary data with other columns. That's executed as a join, which surprised me. But in this case, I could extend the view to have all of the columns of the original table to avoid the join.

But there's another case that's harder to solve this way: combining views together. Here's a trivial example:

CREATE TABLE users (id BIGINT PRIMARY KEY, varchar name);
CREATE VIEW only_some_users AS (SELECT * FROM users WHERE id < 10);
CREATE VIEW some_other_users AS (SELECT * FROM users WHERE id > 3);

EXPLAIN SELECT * FROM only_some_users
INNER JOIN some_other_users ON only_some_users.id = some_other_users.id;

Hash Join (cost=29.23..43.32 rows=90 width=144)
Hash Cond: (users.id = users_1.id)
-> Bitmap Heap Scan on users (cost=6.24..19.62 rows=270 width=72)
Recheck Cond: (id < 10)
-> Bitmap Index Scan on users_pkey (cost=0.00..6.18 rows=270 width=0)
Index Cond: (id < 10)
-> Hash (cost=19.62..19.62 rows=270 width=72)
-> Bitmap Heap Scan on users users_1 (cost=6.24..19.62 rows=270 width=72)
Recheck Cond: (id > 3)
-> Bitmap Index Scan on users_pkey (cost=0.00..6.18 rows=270 width=0)
Index Cond: (id > 3)

Is there a reason why Postgres doesn't have an optimisation built in to optimise this JOIN? What I'm imagining is that a join between two aliases for the same table on its primary key could be optimised by treating them as the same table. I think the same would be true for self-joins on any non-null columns covered by a uniqueness constraint.

If this is considered a desirable change, I'd be keen to work on it (with some guidance).

There's currently a patch registered in the commitfest that could fix
this for you, called "Remove self join on a unique column" [0].

Maybe you'd want to test the patch and send a review.
https://commitfest.postgresql.org/32/1712/

--
Justin

#5Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Hywel Carver (#3)
Re: Self-join optimisation

On 3/11/21 3:39 PM, Hywel Carver wrote:

Great! It looks like it's been in commitfest status for a few years. Is
there anything someone like me (outside the pgsql-hackers community) can
do to help it get reviewed this time around?

Well, you could do a review, or at least test it with the queries your
application is actually running. And explain why your application is
doing queries like this, and why it can't be changed to changed to not
generate such queries.

The first couple of messages from the patch thread [1]/messages/by-id/64486b0b-0404-e39e-322d-0801154901f3@postgrespro.ru (particularly the
messages from May 2018) are a good explanation why patches like this are
tricky to get through.

The basic assumption is that such queries are a bit silly, and it'd be
probably easier to modify the application not to generate them instead
of undoing the harm in the database planner. The problem is this makes
the planner more expensive for everyone, including people who carefully
write "good" queries.

And we don't want to do that, so we need to find a way to make this
optimization very cheap (essentially "free" if not applicable), but
that's difficult because there may be cases where the self-joins are
intentional, and undoing them would make the query slower. And doing
good decision requires enough information, but this decision needs to
happen quite early in the planning, when we have very little info.

So while it seems like a simple optimization, it's actually quite tricky
to get right.

(Of course, there are cases where you may get such queries even if you
try writing good SQL, say when joining views etc.)

regards

[1]: /messages/by-id/64486b0b-0404-e39e-322d-0801154901f3@postgrespro.ru
/messages/by-id/64486b0b-0404-e39e-322d-0801154901f3@postgrespro.ru

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Hywel Carver
hywel@skillerwhale.com
In reply to: Tomas Vondra (#5)
Re: Self-join optimisation

Hi, thanks for your replies. I've tested the patch and it works for me in
the cases where I'd use it.

And explain why your application is doing queries like this, and why it

can't be changed to changed to not generate such queries.

Reading the thread, it looks like some of the requests for this feature are
coming from people using ORMs that generate bad queries. That's not been my
experience - I've always been able to find a way to construct the right
query through the ORM or just write correct SQL. When I've wanted this
feature has always been in relation to combining views.

For example, I was recently helping out a company that runs a booking
system for leisure activities, and their database has a view for how many
staff are available on a given day to supervise a given activity (e.g.
surfing), and a view for how much equipment is available on a given day
(e.g. how many surfboards). They also have a view for the equipment
requirements for a given activity (e.g. some boat trips require a minimum
of 2 boats and 4 oars). When they want to make bookings, they have to
combine data from these views, and the tables that create them.

It would definitely be possible to write one view that had all of this data
in (and maintain the other views too, which are needed elsewhere in the
site). And it could be made wide to have all of the rows from the source
tables. But it would, to me, feel like much better code to keep the
separate decomposed views and join them together for the booking query.
Right now, that query's performance suffers in a way that this feature
would fix. So the current choices are: accept worse performance with
decomposed views, write one very large and performant view but duplicate
some of the logic, or use their ORM to generate the SQL that they'd
normally put in a view.

On Thu, Mar 11, 2021 at 10:50 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

Show quoted text

On 3/11/21 3:39 PM, Hywel Carver wrote:

Great! It looks like it's been in commitfest status for a few years. Is
there anything someone like me (outside the pgsql-hackers community) can
do to help it get reviewed this time around?

Well, you could do a review, or at least test it with the queries your
application is actually running. And explain why your application is
doing queries like this, and why it can't be changed to changed to not
generate such queries.

The first couple of messages from the patch thread [1] (particularly the
messages from May 2018) are a good explanation why patches like this are
tricky to get through.

The basic assumption is that such queries are a bit silly, and it'd be
probably easier to modify the application not to generate them instead
of undoing the harm in the database planner. The problem is this makes
the planner more expensive for everyone, including people who carefully
write "good" queries.

And we don't want to do that, so we need to find a way to make this
optimization very cheap (essentially "free" if not applicable), but
that's difficult because there may be cases where the self-joins are
intentional, and undoing them would make the query slower. And doing
good decision requires enough information, but this decision needs to
happen quite early in the planning, when we have very little info.

So while it seems like a simple optimization, it's actually quite tricky
to get right.

(Of course, there are cases where you may get such queries even if you
try writing good SQL, say when joining views etc.)

regards

[1]

/messages/by-id/64486b0b-0404-e39e-322d-0801154901f3@postgrespro.ru

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company