Query planner skipping index depending on DISTINCT parameter order

Started by Дилян Палаузовover 8 years ago13 messagesbugs
Jump to latest
#1Дилян Палаузов
dilyan.palauzov@aegee.org

Hello,

I have this database in Pg 9.6.5, the schema comes from spamassassin:

┌────────────┬─────────┬──────────────────────────────┐
│ Column │ Type │ Modifiers │
├────────────┼─────────┼──────────────────────────────┤
│ id │ integer │ not null default 0 │
│ token │ bytea │ not null default '\x'::bytea │
│ spam_count │ integer │ not null default 0 │
│ ham_count │ integer │ not null default 0 │
│ atime │ integer │ not null default 0 │
└────────────┴─────────┴──────────────────────────────┘
Indexes:
"bayes_token_pkey" PRIMARY KEY, btree (id, token)
"bayes_token_idx1" btree (token)
"the_index" btree (id, token)

with 261000 rows. According to my understanding, "SELECT DISTINCT ON(token, id) token FROM bayes_token;" and "SELECT DISTINCT ON(id, token) token FROM bayes_token;" are supposed to deliver the same results, as only the "token" and "id" columns are exchanged in DISTINCT, so the query optimizer is supposed to generate the same query plan. But it does not:

EXPLAIN ANALYZE SELECT DISTINCT ON(token, id) token FROM bayes_token;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Unique (cost=32624.56..34588.73 rows=261889 width=10) (actual time=154.947..264.439 rows=261935 loops=1) │
│ -> Sort (cost=32624.56..33279.28 rows=261889 width=10) (actual time=154.945..215.378 rows=261935 loops=1) │
│ Sort Key: token, id │
│ Sort Method: external merge Disk: 5624kB │
│ -> Seq Scan on bayes_token (cost=0.00..4579.89 rows=261889 width=10) (actual time=0.015..35.282 rows=261935 loops=1) │
│ Planning time: 0.085 ms │
│ Execution time: 285.303 ms │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

EXPLAIN ANALYZE SELECT DISTINCT ON(id, token) token FROM bayes_token;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Unique (cost=0.42..13819.07 rows=262023 width=10) (actual time=0.028..207.695 rows=262018 loops=1) │
│ -> Index Only Scan using the_index on bayes_token (cost=0.42..12508.95 rows=262023 width=10) (actual time=0.026..147.882 rows=262018 loops=1) │
│ Heap Fetches: 261729 │
│ Planning time: 0.086 ms │
│ Execution time: 232.598 ms │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The_index and the SELECTs are invented for the sake of this demonstration.

Regards
Diluan

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Дилян Палаузов
dpa-postgres@aegee.org
In reply to: Дилян Палаузов (#1)
Query planner skipping index depending on DISTINCT parameter order (2)

Post Scriptum to my email from yesterday.

I deleted the_index, so that I have now

spamassassin=> \d bayes_token
Table "public.bayes_token"
┌────────────┬─────────┬──────────────────────────────┐
│ Column │ Type │ Modifiers │
├────────────┼─────────┼──────────────────────────────┤
│ id │ integer │ not null default 0 │
│ token │ bytea │ not null default '\x'::bytea │
│ spam_count │ integer │ not null default 0 │
│ ham_count │ integer │ not null default 0 │
│ atime │ integer │ not null default 0 │
└────────────┴─────────┴──────────────────────────────┘
Indexes:
"bayes_token_pkey" PRIMARY KEY, btree (id, token)
"bayes_token_idx1" btree (token)

and as you can see, there is still a btree index on (id, token). I expect the same query plans, as before deleting the_index, as bayes_token_pkey has the same information as the_index had. This does not happen on my system and I have not tweaked the default configuration files.

spamassassin=> EXPLAIN ANALYZE SELECT DISTINCT(id, token) token FROM bayes_token;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Unique (cost=21737.73..22555.11 rows=163476 width=32) (actual time=914.275..1206.239 rows=165523 loops=1) │
│ -> Sort (cost=21737.73..22146.42 rows=163476 width=32) (actual time=914.274..1125.674 rows=165523 loops=1) │
│ Sort Key: (ROW(id, token)) │
│ Sort Method: external merge Disk: 6616kB │
│ -> Seq Scan on bayes_token (cost=0.00..3668.76 rows=163476 width=32) (actual time=0.015..50.849 rows=165523 loops=1) │
│ Planning time: 0.079 ms │
│ Execution time: 1216.047 ms │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

What is the difference makes ROW between the one case here
Sort Key: (ROW(id, token))
and in the other below:
Sort Key: token, id

Regards
Dilyan

On 09/14/17 17:38, Дилян Палаузов wrote:

Hello,

I have this database in Pg 9.6.5, the schema comes from spamassassin:

┌────────────┬─────────┬──────────────────────────────┐
│ Column │ Type │ Modifiers │
├────────────┼─────────┼──────────────────────────────┤
│ id │ integer │ not null default 0 │
│ token │ bytea │ not null default '\x'::bytea │
│ spam_count │ integer │ not null default 0 │
│ ham_count │ integer │ not null default 0 │
│ atime │ integer │ not null default 0 │
└────────────┴─────────┴──────────────────────────────┘
Indexes:
"bayes_token_pkey" PRIMARY KEY, btree (id, token)
"bayes_token_idx1" btree (token)
"the_index" btree (id, token)

with 261000 rows. According to my understanding, "SELECT DISTINCT ON(token, id) token FROM bayes_token;" and "SELECT DISTINCT ON(id, token) token FROM bayes_token;" are supposed to deliver the same results, as only the "token" and "id" columns are exchanged in DISTINCT, so the query optimizer is supposed to generate the same query plan. But it does not:

EXPLAIN ANALYZE SELECT DISTINCT ON(token, id) token FROM bayes_token;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Unique (cost=32624.56..34588.73 rows=261889 width=10) (actual time=154.947..264.439 rows=261935 loops=1) │
│ -> Sort (cost=32624.56..33279.28 rows=261889 width=10) (actual time=154.945..215.378 rows=261935 loops=1) │
│ Sort Key: token, id │
│ Sort Method: external merge Disk: 5624kB │
│ -> Seq Scan on bayes_token (cost=0.00..4579.89 rows=261889 width=10) (actual time=0.015..35.282 rows=261935 loops=1) │
│ Planning time: 0.085 ms │
│ Execution time: 285.303 ms │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

EXPLAIN ANALYZE SELECT DISTINCT ON(id, token) token FROM bayes_token;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Unique (cost=0.42..13819.07 rows=262023 width=10) (actual time=0.028..207.695 rows=262018 loops=1) │
│ -> Index Only Scan using the_index on bayes_token (cost=0.42..12508.95 rows=262023 width=10) (actual time=0.026..147.882 rows=262018 loops=1) │
│ Heap Fetches: 261729 │
│ Planning time: 0.086 ms │
│ Execution time: 232.598 ms │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The_index and the SELECTs are invented for the sake of this demonstration.

Regards
Diluan

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Дилян Палаузов (#2)
Re: Query planner skipping index depending on DISTINCT parameter order (2)

Hello Dilyan,

You're right - we're currently not able to use the index if it's not
consistent with the DISTINCT ordering. That is, if you have index on
(a,b) and DISTINCT ON (b,a) we fail to leverage the index.

The reason for this simple - if you look at create_distinct_paths [1]https://github.com/postgres/postgres/blob/master/src/backend/optimizer/plan/planner.c#L4725,
which is where the Unique path comes from, you'll see it iterates over
all paths and compares the ordering using pathkeys_is_contained [2]https://github.com/postgres/postgres/blob/master/src/backend/optimizer/plan/planner.c#L4811.

That however only ensures the path matches the expected Unique ordering
(determined by the column list in DISTINCT ON clause), we don't try to
re-shuffle the columns in any way at this point.

So this is more a missing optimization than a bug, I'd guess. But it
seems worthwhile and possibly not extremely difficult to implement, so I
may look into it - but that's PG11 at the earliest.

But, looking at the code in create_distinct_paths, ISTM you can easily
convince the planner to use the index by simply adding a matching ORDER
BY clause. That is

SELECT DISTINCT ON(token, id) token FROM bayes_token
ORDER BY id, token;

should be able to use the index on (id,token).

[1]: https://github.com/postgres/postgres/blob/master/src/backend/optimizer/plan/planner.c#L4725
https://github.com/postgres/postgres/blob/master/src/backend/optimizer/plan/planner.c#L4725

[2]: https://github.com/postgres/postgres/blob/master/src/backend/optimizer/plan/planner.c#L4811
https://github.com/postgres/postgres/blob/master/src/backend/optimizer/plan/planner.c#L4811

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#4Дилян Палаузов
dilyan.palauzov@aegee.org
In reply to: Tomas Vondra (#3)
Re: Query planner skipping index depending on DISTINCT parameter order (2)

Hello Tomas,

thanks for your answer.

While adding an ORDER BY uses the index, it does not help, as the user
has to take care of the order of provided columns. Whether the care is
is taken in DISTINCT ON or in ORDER BY is secondary.

The behaviour behind DISTINCT and indexes surprises me, as the query
planner does reorder the columns for SELECT to determine the most
suitable index.

My proposal to reflect this:

diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -1069,7 +1069,10 @@ SELECT DISTINCT ON (location) location, time, report
     <para>
      Currently, <literal>FOR NO KEY UPDATE</>, <literal>FOR UPDATE</>,
      <literal>FOR SHARE</> and <literal>FOR KEY SHARE</> cannot be
-    specified with <literal>DISTINCT</literal>.
+    specified with <literal>DISTINCT</literal>.  Contrary to
+    <literal>SELECT</>, which reorders its parameters to find a best,
+    matching index, DISTINCT ON constructs an expression, e.g. from the
+    provided rows, and checks then if an index can serve the expression.
     </para>
    </refsect2>

Please ignore my email from Friday, I have forgotten an ON after DISTINCT.

Greetings
Dilyan

On 09/16/2017 10:23 AM, Tomas Vondra wrote:

Hello Dilyan,

You're right - we're currently not able to use the index if it's not
consistent with the DISTINCT ordering. That is, if you have index on
(a,b) and DISTINCT ON (b,a) we fail to leverage the index.

The reason for this simple - if you look at create_distinct_paths [1],
which is where the Unique path comes from, you'll see it iterates over
all paths and compares the ordering using pathkeys_is_contained [2].

That however only ensures the path matches the expected Unique ordering
(determined by the column list in DISTINCT ON clause), we don't try to
re-shuffle the columns in any way at this point.

So this is more a missing optimization than a bug, I'd guess. But it
seems worthwhile and possibly not extremely difficult to implement, so I
may look into it - but that's PG11 at the earliest.

But, looking at the code in create_distinct_paths, ISTM you can easily
convince the planner to use the index by simply adding a matching ORDER
BY clause. That is

SELECT DISTINCT ON(token, id) token FROM bayes_token
ORDER BY id, token;

should be able to use the index on (id,token).

[1]
https://github.com/postgres/postgres/blob/master/src/backend/optimizer/plan/planner.c#L4725

[2]
https://github.com/postgres/postgres/blob/master/src/backend/optimizer/plan/planner.c#L4811

regards

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Дилян Палаузов (#4)
Re: Query planner skipping index depending on DISTINCT parameter order (2)

On 09/17/2017 02:30 AM, Dilyan Palauzov wrote:

Hello Tomas,

thanks for your answer.

While adding an ORDER BY uses the index, it does not help, as the
user has to take care of the order of provided columns. Whether the
care is is taken in DISTINCT ON or in ORDER BY is secondary.

The behaviour behind DISTINCT and indexes surprises me, as the
query planner does reorder the columns for SELECT to determine the
most suitable index.

Well, I agree it's somewhat reasonable optimization. The thing is, the
planner/optimizer does not start with all features on day 1, it gets
improved over time. And no one implemented this bit yet.

My proposal to reflect this:

diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -1069,7 +1069,10 @@ SELECT DISTINCT ON (location) location, time, report
    <para>
     Currently, <literal>FOR NO KEY UPDATE</>, <literal>FOR UPDATE</>,
     <literal>FOR SHARE</> and <literal>FOR KEY SHARE</> cannot be
-    specified with <literal>DISTINCT</literal>.
+    specified with <literal>DISTINCT</literal>.  Contrary to
+    <literal>SELECT</>, which reorders its parameters to find a best,
+    matching index, DISTINCT ON constructs an expression, e.g. from the
+    provided rows, and checks then if an index can serve the expression.
    </para>
   </refsect2>

I don't think we want to change the docs like this. Notice that the
SELECT documentation does not mention indexes at all, and I'm pretty
sure we don't want to start doing that. The docs are user-level, deal
only explaining properties of the output relation, and not with
implementation-level details like index optimizations.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#5)
Re: Query planner skipping index depending on DISTINCT parameter order (2)

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On 09/17/2017 02:30 AM, Dilyan Palauzov wrote:

The behaviour behind DISTINCT and indexes surprises me, as the
query planner does reorder the columns for SELECT to determine the
most suitable index.

Well, I agree it's somewhat reasonable optimization. The thing is, the
planner/optimizer does not start with all features on day 1, it gets
improved over time. And no one implemented this bit yet.

For the DISTINCT ON case, the user-visible semantics are actually pretty
tightly tied to ORDER BY, so that it would not be very reasonable to
consider any other orderings than the given column order anyway.

For plain DISTINCT, yeah we could consider other orderings ... but
we're rather unlikely to find an index that matches all the output
columns, regardless of what order they're in. So it's just not that
exciting.

IOW, somebody might get around to this someday, but don't hold your
breath; there's lots of higher-value fruit to be reaching for.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#7Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#6)
Re: Query planner skipping index depending on DISTINCT parameter order (2)

On 09/17/2017 07:15 PM, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On 09/17/2017 02:30 AM, Dilyan Palauzov wrote:

The behaviour behind DISTINCT and indexes surprises me, as the
query planner does reorder the columns for SELECT to determine the
most suitable index.

Well, I agree it's somewhat reasonable optimization. The thing is, the
planner/optimizer does not start with all features on day 1, it gets
improved over time. And no one implemented this bit yet.

For the DISTINCT ON case, the user-visible semantics are actually pretty
tightly tied to ORDER BY, so that it would not be very reasonable to
consider any other orderings than the given column order anyway.

Tied in what sense? In the docs we explicitly say this:

https://www.postgresql.org/docs/10/static/sql-select.html#sql-distinct

SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of
each set of rows where the given expressions evaluate to equal. The
DISTINCT ON expressions are interpreted using the same rules as for
ORDER BY (see above). Note that the “first row” of each set is
unpredictable unless ORDER BY is used to ensure that the desired row
appears first.

which in my understanding is that while we use the same rules as ORDER
BY, we don't guarantee any particular ordering (i.e. which row we keep)
unless an explicit ORDER BY clause is used.

So if an ORDER BY is not specified, why couldn't we pick an arbitrary
ordering matching based on available indexes?

For plain DISTINCT, yeah we could consider other orderings ... but
we're rather unlikely to find an index that matches all the output
columns, regardless of what order they're in. So it's just not that
exciting.

Not necessarily. For example if we get the incremental sort in, we might
pick from a much wider set of indexes.

IOW, somebody might get around to this someday, but don't hold your
breath; there's lots of higher-value fruit to be reaching for.

Sure. But the perceived value really depends on the user - what's
worthless for one user may be quite valuable for another one.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#7)
Re: Query planner skipping index depending on DISTINCT parameter order (2)

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On 09/17/2017 07:15 PM, Tom Lane wrote:

For the DISTINCT ON case, the user-visible semantics are actually pretty
tightly tied to ORDER BY, so that it would not be very reasonable to
consider any other orderings than the given column order anyway.

Tied in what sense? In the docs we explicitly say this:

https://www.postgresql.org/docs/10/static/sql-select.html#sql-distinct

SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of
each set of rows where the given expressions evaluate to equal. The
DISTINCT ON expressions are interpreted using the same rules as for
ORDER BY (see above). Note that the “first row” of each set is
unpredictable unless ORDER BY is used to ensure that the desired row
appears first.

Right, so the behavior is undefined unless you have an ORDER BY clause
that includes the DISTINCT ON columns plus some more columns. That's
pretty tightly tied in my book.

So if an ORDER BY is not specified, why couldn't we pick an arbitrary
ordering matching based on available indexes?

The case is not of any real-world use, and so I'm unwilling to expend
the large amount of coding effort that would be needed to make the
planner behave this way.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#9Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#8)
Re: Query planner skipping index depending on DISTINCT parameter order (2)

On 09/19/2017 05:40 PM, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On 09/17/2017 07:15 PM, Tom Lane wrote:

For the DISTINCT ON case, the user-visible semantics are actually pretty
tightly tied to ORDER BY, so that it would not be very reasonable to
consider any other orderings than the given column order anyway.

Tied in what sense? In the docs we explicitly say this:

https://www.postgresql.org/docs/10/static/sql-select.html#sql-distinct

SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of
each set of rows where the given expressions evaluate to equal. The
DISTINCT ON expressions are interpreted using the same rules as for
ORDER BY (see above). Note that the “first row” of each set is
unpredictable unless ORDER BY is used to ensure that the desired row
appears first.

Right, so the behavior is undefined unless you have an ORDER BY
clause that includes the DISTINCT ON columns plus some more columns.
That's pretty tightly tied in my book.

Ah, OK. I thought you're suggesting we're required to produce the data
sorted by the DISTINCT ON columns. But you meant that ORDER BY clause is
required for well-defined result, which limits our options when picking
an index. Right?

So if an ORDER BY is not specified, why couldn't we pick an
arbitrary ordering matching based on available indexes?

The case is not of any real-world use, and so I'm unwilling to
expend the large amount of coding effort that would be needed to make
the planner behave this way.

I don't think the "no real-world use" is so clear. Imagine for example a
denormalized table where the user knows that for a given ID, the other
columns match too. Then

SELECT DISTINCT ON (id), x, y, z FROM t;
SELECT DISTINCT ON (id), x, y, z FROM t ORDER BY id, x, y, z;
SELECT DISTINCT id, x, y, z, FROM t;

will all produce the same result, except that the first case only needs
compare values in "id". If sorts on "x", "y" and "z" are expensive (say,
because those are text columns with non-C collations), that may be quite
a difference.

That being said, I'm not sure how much code would this be, and how much
overhead would it mean.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#10Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#6)
Re: Query planner skipping index depending on DISTINCT parameter order (2)

On 17 September 2017 at 18:15, Tom Lane <tgl@sss.pgh.pa.us> wrote:

For plain DISTINCT, yeah we could consider other orderings ... but
we're rather unlikely to find an index that matches all the output
columns, regardless of what order they're in. So it's just not that
exciting.

I don't follow this part. Are you saying a simple "select distinct a,b
from table" is unlikely to find a matching index on <b,a>? Don't we
already do exactly this for "select a,b from table group by a,b"? I
would have expected the two equivalent SQL statements to use exactly
the same infrastructure and thought it was only a matter of historical
legacy that they didn't.

--
greg

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#11Дилян Палаузов
dilyan.palauzov@aegee.org
In reply to: Bruce Momjian (#10)
Re: Query planner skipping index depending on DISTINCT parameter order (2)

Hello,

for the record, in this table

spamassassin=> \d bayes_token
Table "public.bayes_token"
Column Type Modifiers
────────── ─────── ────────────────────────────
id integer not null default 0
token bytea not null default '\x'::bytea
spam_count integer not null default 0
ham_count integer not null default 0
atime integer not null default 0
Indexes:
"bayes_token_pkey" PRIMARY KEY, btree (id, token)
"bayes_token_idx1" btree (token)

"SELECT id, token FROM bayes_token" and "SELECT id, token FROM bayes_token WHERE id > 0" do Seq Scan and need 35-50 Sec.

But "SELECT id, token FROM bayes_token WHERE id > 1" uses Index Only Scan and 0.04 seconds on pg 9.6.5 .

Greetings
Dilian

On 09/25/17 12:43, Greg Stark wrote:

On 17 September 2017 at 18:15, Tom Lane <tgl@sss.pgh.pa.us> wrote:

For plain DISTINCT, yeah we could consider other orderings ... but
we're rather unlikely to find an index that matches all the output
columns, regardless of what order they're in. So it's just not that
exciting.

I don't follow this part. Are you saying a simple "select distinct a,b
from table" is unlikely to find a matching index on <b,a>? Don't we
already do exactly this for "select a,b from table group by a,b"? I
would have expected the two equivalent SQL statements to use exactly
the same infrastructure and thought it was only a matter of historical
legacy that they didn't.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#12Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Дилян Палаузов (#11)
Re: Query planner skipping index depending on DISTINCT parameter order (2)

On 09/25/2017 08:10 PM, Дилян Палаузов wrote:

Hello,

for the record, in this table

spamassassin=> \d bayes_token
           Table "public.bayes_token"
  Column    Type            Modifiers
────────── ─────── ────────────────────────────
id         integer not null default 0
token      bytea   not null default '\x'::bytea
spam_count integer not null default 0
ham_count  integer not null default 0
atime      integer not null default 0
Indexes:
    "bayes_token_pkey" PRIMARY KEY, btree (id, token)
    "bayes_token_idx1" btree (token)

"SELECT id, token FROM bayes_token" and "SELECT id, token FROM
bayes_token WHERE id > 0" do Seq Scan and need 35-50 Sec.

But "SELECT id, token FROM bayes_token WHERE id > 1" uses Index Only
Scan and 0.04 seconds on pg 9.6.5 .

I don't quite see how this is related to the original topic in this
thread, which was discussing DISTINCT ON vs. indexes.

But more importantly, it's really difficult to give you answers without
you showing us EXPLAIN (and ideally EXPLAIN ANALYZE) for the queries.

FWIW, my guess is that there are many rows with (id > 0), and scanning
them using index only scan would be expensive. While there are only very
few rows with (id > 1), so the database uses IOS.

You can try pushing the database towards IOS by disabling sequential and
bitmap scans.

set enable_seqscan = off;
set enable_bitmapscan = off;

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Bruce Momjian (#10)
Re: Query planner skipping index depending on DISTINCT parameter order (2)

On 09/25/2017 12:43 PM, Greg Stark wrote:

On 17 September 2017 at 18:15, Tom Lane <tgl@sss.pgh.pa.us> wrote:

For plain DISTINCT, yeah we could consider other orderings ... but
we're rather unlikely to find an index that matches all the output
columns, regardless of what order they're in. So it's just not that
exciting.

I don't follow this part. Are you saying a simple "select distinct a,b
from table" is unlikely to find a matching index on <b,a>? Don't we
already do exactly this for "select a,b from table group by a,b"?

FWIW we don't do that:

create table t (a int, b int);
insert into t select mod(i, 1000), mod(i, 1000)
from generate_series(1,1000000) S(i);
create index on t (a,b);
vacuum analyze t;

explain (costs off) select a, b, count(*) from t group by a , b;
QUERY PLAN
--------------------------------------------
GroupAggregate
Group Key: a, b
-> Index Only Scan using t_a_b_idx on t
(3 rows)

explain (costs off) select a, b, count(*) from t group by b , a;
QUERY PLAN
---------------------------
GroupAggregate
Group Key: b, a
-> Sort
Sort Key: b, a
-> Seq Scan on t
(5 rows)

But yeah, fixing this is probably more worthwhile than the original
DISTINCT ON issue ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs