Functionally dependent columns in SELECT DISTINCT

Started by Willow Charginover 1 year ago6 messagesgeneral

postgresql@wchargin.com

over 1 year ago

Hello! Postgres lets us omit columns from a GROUP BY clause if they are
functionally dependent on a grouped key, which is a nice quality-of-life
feature. I'm wondering if a similar relaxation could be permitted for
the SELECT DISTINCT list?

I have a query where I want to find the most recent few items from a
table that match some complex condition, where the condition involves
joining other tables. Here's an example, with two approaches:

-- Store some data: an "item" has one or more "parts".
CREATE TABLE items(id int PRIMARY KEY, create_time timestamptz);
CREATE TABLE parts(part_id int PRIMARY KEY, item_id int);
INSERT INTO items(id, create_time)
SELECT i, now() - make_interval(secs => i)
FROM generate_series(1, 1000000) s(i);
INSERT INTO parts(item_id, part_id)
SELECT items.id, 2 * items.id + delta
FROM items, (VALUES(0), (1)) delta(delta);
CREATE INDEX ON items(create_time DESC);
CREATE INDEX ON parts(item_id);
ANALYZE items, parts;

-- Suppose we want to find the most recent few items with a part
-- whose part ID is threeven. Two approaches:

-- SELECT DISTINCT: fast, but superfluous column:
EXPLAIN ANALYZE
SELECT DISTINCT items.id, create_time
FROM items JOIN parts ON items.id = parts.item_id
WHERE part_id % 3 = 0
ORDER BY create_time DESC
LIMIT 5;
-- 4ms:
-- parallel index scan on items_create_time_idx
-- -> nested loop index scan parts_item_id_idx
-- -> incremental sort -> gather merge -> unique -> limit

-- GROUP BY: slow, but functional dependency recognized:
EXPLAIN ANALYZE
SELECT items.id
FROM items JOIN parts ON items.id = parts.item_id
WHERE part_id % 3 = 0
GROUP BY items.id
ORDER BY create_time DESC
LIMIT 5;
-- 400ms:
-- parallel seq scan on parts
-- -> parallel hash join on item_id via seq scan on items
-- -> sort -> group -> gather merge -> group -> sort -> limit

These timings are Postgres 14.5 on a Linux i7-1165G7. With Postgres 16.3
on an Apple M3 Pro, the shape is the same: the GROUP BY is about 300ms,
and the SELECT DISTINCT is way faster still, at 0.07ms. (It declines to
parallelize, which seems to help.)

I want to use the faster approach, and it works without issue so far.
But that extra column in the SELECT list is a bit inconvenient.

My questions are:

- Do I understand right that these kinds of queries are equivalent?

- If so, does the SQL standard permit Postgres to recognize functional
dependency in this case, so that users may omit the order column
column from the `SELECT DISTINCT` list? (I don't have a copy of the
standard to check myself.)

- Might future versions of Postgres allow this?

thanks!
~Willow

Thomas Kellerer

shammat@gmx.net

over 1 year ago

In reply to: Willow Chargin (#1)

Re: Functionally dependent columns in SELECT DISTINCT

Willow Chargin schrieb am 13.09.2024 um 07:20:

Hello! Postgres lets us omit columns from a GROUP BY clause if they are
functionally dependent on a grouped key, which is a nice quality-of-life
feature. I'm wondering if a similar relaxation could be permitted for
the SELECT DISTINCT list?

I have a query where I want to find the most recent few items from a
table that match some complex condition, where the condition involves
joining other tables. Here's an example, with two approaches:

What about using DISTINCT ON () ?
SELECT DISTINCT ON (items.id) items.*
FROM items
JOIN parts ON items.id = parts.item_id
WHERE part_id % 3 = 0
ORDER BY items.id,items.create_time DESC
LIMIT 5;

This gives me this plan: https://explain.depesz.com/s/QHr6 on 16.2 (Windows, i7-1260P)

Willow Chargin

postgresql@wchargin.com

over 1 year ago

In reply to: Thomas Kellerer (#2)

Re: Functionally dependent columns in SELECT DISTINCT

On Thu, Sep 12, 2024 at 11:13 PM <shammat@gmx.net> wrote:

What about using DISTINCT ON () ?
SELECT DISTINCT ON (items.id) items.*
FROM items
JOIN parts ON items.id = parts.item_id
WHERE part_id % 3 = 0
ORDER BY items.id,items.create_time DESC
LIMIT 5;

This gives me this plan: https://explain.depesz.com/s/QHr6 on 16.2 (Windows, i7-1260P)

Ordering by items.id changes the answer, though. In the example I gave,
items.id and items.create_time happened to be in the same order, but
that needn't hold. In reality I really do want the ID columns of the
*most recent* items.

You can see the difference if you build the test dataset a bit
differently:

INSERT INTO items(id, create_time)
SELECT i, now() - make_interval(secs => random() * 1e6)
FROM generate_series(1, 1000000) s(i);

We want the returned create_times to be all recent, and the IDs now
should look roughly random.

David G. Johnston

david.g.johnston@gmail.com

over 1 year ago

In reply to: Willow Chargin (#3)

Re: Functionally dependent columns in SELECT DISTINCT

On Friday, September 13, 2024, Willow Chargin <postgresql@wchargin.com>
wrote:

In reality I really do want the ID columns of the
*most recent* items.

Use a window function to rank them and pull out rank=1, or use a lateral
subquery to surgically (fetch first 1) retrieve the first row when sorted
by recency descending.

David J.

Willow Chargin

postgresql@wchargin.com

over 1 year ago

In reply to: David G. Johnston (#4)

Re: Functionally dependent columns in SELECT DISTINCT

Thanks both for your suggestions so far.

On Fri, Sep 13, 2024 at 8:43 AM David G. Johnston
<david.g.johnston@gmail.com> wrote:

On Friday, September 13, 2024, Willow Chargin <postgresql@wchargin.com> wrote:

In reality I really do want the ID columns of the
*most recent* items.

Use a window function to rank them and pull out rank=1

Hmm, like this? noting that it's rank<=5, not rank=1:

-- 1. rank all item-part combinations, densely since an item may
have multiple parts
-- 2. limit by rank, still retaining multiple copies of each item
-- 3. de-duplicate IDs
SELECT DISTINCT id FROM (
SELECT id, dense_rank FROM (
SELECT
items.id,
dense_rank() OVER (ORDER BY create_time DESC)
FROM items JOIN parts ON items.id = parts.item_id
WHERE part_id % 3 = 0
) q
WHERE dense_rank <= 5
) q

I've done this before, but my experience is that it's usually far slower
because the rank is computed eagerly even for rows that don't match the
rank bound. And indeed here it takes 20% longer than even the slower
GROUP BY from before: https://explain.depesz.com/s/mQIi

or use a lateral subquery to surgically (fetch first 1) retrieve the first row when sorted by recency descending.

I'm not sure that I see how to apply this when I need top-k, not top-1.

David G. Johnston

david.g.johnston@gmail.com

over 1 year ago

In reply to: Willow Chargin (#5)

Re: Functionally dependent columns in SELECT DISTINCT

or use a lateral subquery to surgically (fetch first 1) retrieve the

first row when sorted by recency descending.

I'm not sure that I see how to apply this when I need top-k, not top-1.

Fetch first k

It's just a modern limit clause.

David J.