SELECT DISTINCT <constant> scans the table?

Started by Markus Demleitnerover 4 years ago3 messagesgeneral

Markus Demleitner

msdemlei@ari.uni-heidelberg.de

over 4 years ago

Dear list,

This feels like a FAQ, but neither the postgres docs nor web searches
got me a single step towards some sort of insight.

Maximally stripped down, my problem is that

select distinct 300 from <bigtable>

seqscans <bigtable> (at least in PostgreSQL 11.14). To me, it seems
obvious that this ought be be just one row containing 300 once
Postgres has established that <bigtable> is nonempty.

Why do things not work like this? Am I missing something major?

The reason I'm interested in this is of course a bit more involved.
I have a view that looks somewhat like this:

CREATE VIEW a_view AS (
SELECT 'abc' as coll, ...
FROM table1
UNION
SELECT 'def' as coll, ...
FROM table2
UNION
SELECT coll, ...
FROM table3
...)

and so on for perhaps 50 tables; where, as for table3 in this
example, the coll column is not simply a constant, there is an index
on the source column (these then are actually fast).

I'd now like to be able to say

SELECT DISTINCT coll FROM a_view

-- which takes forever once some of the tables involved are
sufficiently large.

I'd assume in an ideal world the query would be essentally
instantaneous -- merging the constants and doing a few index scans
for the tables that have non-constant coll. In reality, it's not.
The tables with constant coll are all completely scanned, either
sequentially or index-only.

Assuming this is the expected behaviour: Is there perhaps some trick
I could use to make postgres use the information that there are
constants in the select clauses?

Thanks,

Markus

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Markus Demleitner (#1)

Re: SELECT DISTINCT <constant> scans the table?

Markus Demleitner <msdemlei@ari.uni-heidelberg.de> writes:

Maximally stripped down, my problem is that

select distinct 300 from <bigtable>

seqscans <bigtable> (at least in PostgreSQL 11.14). To me, it seems
obvious that this ought be be just one row containing 300 once
Postgres has established that <bigtable> is nonempty.

Why do things not work like this? Am I missing something major?

That seems like the sort of optimization that we absolutely should
not spend cycles looking for. If it were a trivial change consuming
no detectable number of planning cycles, maybe it would be worth the
development and maintenance effort; though I'd be dubious about the
value. But the fact that it'd have to be transformed into something
testing whether the table is nonempty makes it fairly nontrivial.
I doubt it's worth the development cost plus the cycles imposed
on every other query.

regards, tom lane

Markus Demleitner

msdemlei@ari.uni-heidelberg.de

over 4 years ago

In reply to: Tom Lane (#2)

Re: SELECT DISTINCT <constant> scans the table?

Dear Tom,

On Thu, Dec 16, 2021 at 03:47:57PM -0500, Tom Lane wrote:

Markus Demleitner <msdemlei@ari.uni-heidelberg.de> writes:

Maximally stripped down, my problem is that

select distinct 300 from <bigtable>

seqscans <bigtable> (at least in PostgreSQL 11.14). To me, it seems
obvious that this ought be be just one row containing 300 once
Postgres has established that <bigtable> is nonempty.

That seems like the sort of optimization that we absolutely should
not spend cycles looking for. If it were a trivial change consuming
no detectable number of planning cycles, maybe it would be worth the
development and maintenance effort; though I'd be dubious about the
value. But the fact that it'd have to be transformed into something
testing whether the table is nonempty makes it fairly nontrivial.
I doubt it's worth the development cost plus the cycles imposed
on every other query.

I certainly understand that reasoning for this particular example.
However, in my actual use case, the one with the view consisting of a
large union containing constants from the original mail,

CREATE VIEW a_view AS (
SELECT 'abc' as coll, ...
FROM table1 -- with perhaps 1e6 rows
UNION
SELECT 'def' as coll, ...
FROM table2 -- with perhaps another 1e7 rows
UNION
SELECT coll, ... -- with an index on table3.coll
FROM table3
...)

being able to factor out constants would make a difference of
milliseconds versus a long time (~ a minute in my case, with about
1e8 total rows) when running SELECT DISTINCT coll FROM a_view.

Is there, perhaps, a non-obvious way to give the planner a nudge to
exploit the constant-ness of coll in table1 and table2?

Thanks,

Markus