Wasteful nested loop join when there is `limit` in the query

Started by WU Yanabout 1 year ago3 messagesgeneral

4wuyan@gmail.com

about 1 year ago

Hello everyone, I am still learning postgres planner and performance
optimization, so please kindly point out if I missed something obvious.

I've noticed that postgres joins all rows in two tables, even though
there's a `limit` in the query. It means lots of joined rows are discarded
eventually, and the performance can be better if postgres planner can do
`limit` before the `join`.

Here's my example. I am running postgres 17.3 in a container
```sh
docker run --rm -e POSTGRES_PASSWORD=mysecretpassword -d -p 5432:5432
postgres:17.3
```

```sql
create table department(
id int primary key,
name text);
create table employee(
id int primary key,
name text,
department_id int);
insert into department values
(1, 'sales'),
(2, 'support'),
(3, 'research'),
(4, 'management'),
(5, 'hr');
insert into employee values
(101, 'alice', 1),
(102, 'bob', 2),
(103, 'ccc', 3),
(104, 'ddd', 4),
(105, 'eee', 999);
analyze department;
analyze employee;

set enable_mergejoin = off;
set enable_hashjoin = off;

explain analyze
select *
from employee left outer join department
on employee.department_id = department.id
order by employee.name limit 2;
```

In this simple setup, I want to see the top 2 employees by name, and check
their department info. Here's the query plan:
```
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Limit (cost=2.49..2.49 rows=2 width=23) (actual time=0.037..0.038 rows=2
loops=1)
-> Sort (cost=2.49..2.50 rows=5 width=23) (actual time=0.036..0.037
rows=2 loops=1)
Sort Key: employee.name
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop Left Join (cost=0.00..2.44 rows=5 width=23)
(actual time=0.014..0.021 rows=5 loops=1)
Join Filter: (employee.department_id = department.id)
Rows Removed by Join Filter: 11
-> Seq Scan on employee (cost=0.00..1.05 rows=5 width=12)
(actual time=0.006..0.007 rows=5 loops=1)
-> Materialize (cost=0.00..1.07 rows=5 width=11) (actual
time=0.001..0.001 rows=3 loops=5)
-> Seq Scan on department (cost=0.00..1.05 rows=5
width=11) (actual time=0.001..0.002 rows=5 loops=1)
Planning Time: 0.248 ms
Execution Time: 0.062 ms
(12 rows)
```

Note `loops=5` in the plan. My understanding is, all 5 rows in table
employee are joined, although we only need 2. I think join before limit is
necessary if we do not know whether join will create multiple rows or
eliminate rows. However, in this setup, it's left outer join on a primary
key, so we know one employee will produce exactly one row. Similar
assumption can be made for `a inner join b on a.x = b.x` when a.x is a
foreign key referencing b.x and b.x has a unique constraint.

At the moment, my workaround is to rewrite the query with CTE. I need to
write `order by ... limit ...` twice. `loops=2` is observed.
```
postgres=# explain analyze
with t as (
select * from employee order by employee.name limit 2
) select *
from t left outer join department on t.department_id = department.id
order by t.name limit 2;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Limit (cost=1.10..2.32 rows=2 width=23) (actual time=0.045..0.049 rows=2
loops=1)
-> Nested Loop Left Join (cost=1.10..2.32 rows=2 width=23) (actual
time=0.044..0.047 rows=2 loops=1)
Join Filter: (employee.department_id = department.id)
Rows Removed by Join Filter: 1
-> Limit (cost=1.10..1.10 rows=2 width=12) (actual
time=0.032..0.033 rows=2 loops=1)
-> Sort (cost=1.10..1.11 rows=5 width=12) (actual
time=0.032..0.033 rows=2 loops=1)
Sort Key: employee.name
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on employee (cost=0.00..1.05 rows=5
width=12) (actual time=0.015..0.016 rows=5 loops=1)
-> Materialize (cost=0.00..1.07 rows=5 width=11) (actual
time=0.004..0.004 rows=2 loops=2)
-> Seq Scan on department (cost=0.00..1.05 rows=5
width=11) (actual time=0.004..0.005 rows=2 loops=1)
Planning Time: 0.196 ms
Execution Time: 0.103 ms
(13 rows)
```

I wonder whether there's a better way to influence the planner to push the
`limit` down to the `employee` table first, or this optimization strategy
is not yet in the planner. The demo is trivial, but in some real cases,
it's possible to select top 100 from 1 million rows. Sorting 1 million rows
is slow but necessary and acceptable, but wasteful nested loop join on
millions of rows can make the performance much slower. Even worse, the
wasteful join can happen multiple times when it's more than 2 tables
joining.

Thank you

Best regards
Yan

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: WU Yan (#1)

Re: Wasteful nested loop join when there is `limit` in the query

WU Yan <4wuyan@gmail.com> writes:

Hello everyone, I am still learning postgres planner and performance
optimization, so please kindly point out if I missed something obvious.

An index on employee.name would likely help here. Even if we had
an optimization for pushing LIMIT down through a join (which you
are right, we don't) it could not push the LIMIT through a sort step.
So you need presorted output from the scan of "employee". I think
this example would behave better with that. You may also need to
test with non-toy amounts of data to get the plan you think is
better: an example with only half a dozen rows is going to be
swamped by startup costs.

regards, tom lane

WU Yan

4wuyan@gmail.com

about 1 year ago

In reply to: Tom Lane (#2)

Re: Wasteful nested loop join when there is `limit` in the query

Thank you for your help, Tom.

You are right. I added an index on employee.name (by making it unique), and
then postgres can visit employee table in a pre-sorted manner, and can exit
early without joining more rows.

Just sharing the tweak I did to the example, if anyone else is interested
in a quick test. I also populated 1 million rows so the example is no
longer a toy demo.

```sql
drop table if exists department;
drop table if exists employee;

create table department(
id int primary key,
name text);
create table employee(
id int primary key,
name text unique,
department_id int);

INSERT INTO department (id, name)
SELECT i+1, 'department' || i+1
FROM generate_series(0, 9) AS i;

INSERT INTO employee (id, name, department_id)
SELECT i+1, 'name' || i+1, i % 10 +1
FROM generate_series(0, 999999) AS i;

analyze department;
analyze employee;

explain analyze
select *
from employee left outer join department
on employee.department_id = department.id
order by employee.name limit 10;
```

And here is the plan:
```
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..1.36 rows=10 width=34) (actual time=0.017..0.030
rows=10 loops=1)
-> Nested Loop Left Join (cost=0.57..78630.06 rows=1000000 width=34)
(actual time=0.016..0.028 rows=10 loops=1)
-> Index Scan using employee_name_key on employee
(cost=0.42..54855.68 rows=1000000 width=18) (actual time=0.008..0.015
rows=10 loops=1)
-> Memoize (cost=0.15..0.16 rows=1 width=16) (actual
time=0.001..0.001 rows=1 loops=10)
Cache Key: employee.department_id
Cache Mode: logical
Hits: 6 Misses: 4 Evictions: 0 Overflows: 0 Memory
Usage: 1kB
-> Index Scan using department_pkey on department
(cost=0.14..0.15 rows=1 width=16) (actual time=0.001..0.001 rows=1 loops=4)
Index Cond: (id = employee.department_id)
Planning Time: 0.189 ms
Execution Time: 0.045 ms
(11 rows)
```

Personally I still wish someday postgres can push down `limit` node
together with `sort` node when certain conditions are met, so that there's
no need to add an index :D

Thank you again for your help!

On Mon, 17 Feb 2025 at 18:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

WU Yan <4wuyan@gmail.com> writes:

Hello everyone, I am still learning postgres planner and performance
optimization, so please kindly point out if I missed something obvious.

An index on employee.name would likely help here. Even if we had
an optimization for pushing LIMIT down through a join (which you
are right, we don't) it could not push the LIMIT through a sort step.
So you need presorted output from the scan of "employee". I think
this example would behave better with that. You may also need to
test with non-toy amounts of data to get the plan you think is
better: an example with only half a dozen rows is going to be
swamped by startup costs.

regards, tom lane