get_actual_variable_range vs idx_scan/idx_tup_fetch

Started by Marko Tiikkajaabout 11 years ago25 messages
#1Marko Tiikkaja
marko@joh.to

Hi,

This week we had one of the most annoying problems I've ever encountered
with postgres. We had a big index on multiple columns, say, foo(a, b,
c). According to pg_stat_all_indexes the index was being used *all the
time*. However, after looking into our queries more closely, it turns
out that it was only being used to look up statistics for the foo.a
column to estimate merge scan viability during planning. But this took
hours for two people to track down.

So what I'd like to have is a way to be able to distinguish between
indexes being used to answer queries, and ones being only used for stats
lookups during planning. Perhaps the easiest way would be adding a new
column or two into pg_stat_all_indexes, which we would increment in
get_actual_variable_range() when fetching data.

Any thoughts?

.marko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Tiikkaja (#1)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

Marko Tiikkaja <marko@joh.to> writes:

This week we had one of the most annoying problems I've ever encountered
with postgres. We had a big index on multiple columns, say, foo(a, b,
c). According to pg_stat_all_indexes the index was being used *all the
time*. However, after looking into our queries more closely, it turns
out that it was only being used to look up statistics for the foo.a
column to estimate merge scan viability during planning. But this took
hours for two people to track down.

So what I'd like to have is a way to be able to distinguish between
indexes being used to answer queries, and ones being only used for stats
lookups during planning.

Why? Used is used.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Marko Tiikkaja
marko@joh.to
In reply to: Tom Lane (#2)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

On 10/17/14, 11:47 PM, Tom Lane wrote:

Marko Tiikkaja <marko@joh.to> writes:

This week we had one of the most annoying problems I've ever encountered
with postgres. We had a big index on multiple columns, say, foo(a, b,
c). According to pg_stat_all_indexes the index was being used *all the
time*. However, after looking into our queries more closely, it turns
out that it was only being used to look up statistics for the foo.a
column to estimate merge scan viability during planning. But this took
hours for two people to track down.

So what I'd like to have is a way to be able to distinguish between
indexes being used to answer queries, and ones being only used for stats
lookups during planning.

Why? Used is used.

Because I don't need a 30GB index on foo(a,b,c) to look up statistics.
If I ever have a problem, I can replace it with a 5GB one on foo(a).

.marko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Marko Tiikkaja (#3)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

On 10/17/14, 4:49 PM, Marko Tiikkaja wrote:

On 10/17/14, 11:47 PM, Tom Lane wrote:

Marko Tiikkaja <marko@joh.to> writes:

This week we had one of the most annoying problems I've ever encountered
with postgres. We had a big index on multiple columns, say, foo(a, b,
c). According to pg_stat_all_indexes the index was being used *all the
time*. However, after looking into our queries more closely, it turns
out that it was only being used to look up statistics for the foo.a
column to estimate merge scan viability during planning. But this took
hours for two people to track down.

So what I'd like to have is a way to be able to distinguish between
indexes being used to answer queries, and ones being only used for stats
lookups during planning.

Why? Used is used.

Because I don't need a 30GB index on foo(a,b,c) to look up statistics. If I ever have a problem, I can replace it with a 5GB one on foo(a).

That problem can exist with user queries too. Perhaps it would be better to find a way to count scans that didn't use all the fields in the index.

I do also see value in differentiating planning use from real query processing; not doing that can certainly cause confusion. What I don't know is if the added stats bloat is worth it. If we do go down that road, I think it'd be better to add an indicator to EState. Aside from allowing stats for all planning access, it should make it less likely that someone adds a new access path and forgets to mark it as internal (especially if the added field defaults to an invalid value).
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Tiikkaja (#3)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

Marko Tiikkaja <marko@joh.to> writes:

On 10/17/14, 11:47 PM, Tom Lane wrote:

Marko Tiikkaja <marko@joh.to> writes:

So what I'd like to have is a way to be able to distinguish between
indexes being used to answer queries, and ones being only used for stats
lookups during planning.

Why? Used is used.

Because I don't need a 30GB index on foo(a,b,c) to look up statistics.
If I ever have a problem, I can replace it with a 5GB one on foo(a).

Well, the index might've been getting used in queries too in a way that
really only involved the first column. I think you're solving the wrong
problem here. The right problem is how to identify indexes that are
being used in a way that doesn't exploit all the columns. Which is not
necessarily wrong in itself --- what you'd want is to figure out when the
last column(s) are *never* used. The existing stats aren't terribly
helpful for that, I agree.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Marko Tiikkaja
marko@joh.to
In reply to: Tom Lane (#5)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

On 10/17/14, 11:59 PM, Tom Lane wrote:

Marko Tiikkaja <marko@joh.to> writes:

On 10/17/14, 11:47 PM, Tom Lane wrote:

Marko Tiikkaja <marko@joh.to> writes:

So what I'd like to have is a way to be able to distinguish between
indexes being used to answer queries, and ones being only used for stats
lookups during planning.

Why? Used is used.

Because I don't need a 30GB index on foo(a,b,c) to look up statistics.
If I ever have a problem, I can replace it with a 5GB one on foo(a).

Well, the index might've been getting used in queries too in a way that
really only involved the first column. I think you're solving the wrong
problem here. The right problem is how to identify indexes that are
being used in a way that doesn't exploit all the columns.

I'm not sure I agree with that. Even if there was some information the
planner could have extracted out of the index by using all columns (thus
appearing "fully used" in these hypothetical new statistics), I still
would've wanted the index gone. But in this particular case, an index
on foo(a) alone was not selective enough and it would have been a bad
choice for practically every query, so I'm not sure what good those
statistics were in the first place.

I think there's a big difference between "this index was used to look up
stuff for planning" and "this index was used to answer queries quickly".
In my mind the first one belongs to the category "this index was
considered", and the latter is "this index was actually useful". But
maybe I'm not seeing the big picture here.

.marko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Tiikkaja (#6)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

Marko Tiikkaja <marko@joh.to> writes:

On 10/17/14, 11:59 PM, Tom Lane wrote:

Well, the index might've been getting used in queries too in a way that
really only involved the first column. I think you're solving the wrong
problem here. The right problem is how to identify indexes that are
being used in a way that doesn't exploit all the columns.

I'm not sure I agree with that. Even if there was some information the
planner could have extracted out of the index by using all columns (thus
appearing "fully used" in these hypothetical new statistics), I still
would've wanted the index gone. But in this particular case, an index
on foo(a) alone was not selective enough and it would have been a bad
choice for practically every query, so I'm not sure what good those
statistics were in the first place.

Those stats were perfectly valid: what the planner is looking for is
accurate minimum and maximum values for the index's leading column, and
that's what it got. You're correct that a narrower index could have given
the same results with a smaller disk footprint, but the planner got the
results it needed from the index you provided for it to work with.

I think there's a big difference between "this index was used to look up
stuff for planning" and "this index was used to answer queries quickly".

I think that's utter nonsense. Even if there were any validity to the
position, it wouldn't be enough to justify doubling the stats footprint
in order to track system-driven accesses separately from query-driven
ones.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Marko Tiikkaja
marko@joh.to
In reply to: Tom Lane (#7)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

On 10/18/14, 12:15 AM, Tom Lane wrote:

Marko Tiikkaja <marko@joh.to> writes:

I think there's a big difference between "this index was used to look up
stuff for planning" and "this index was used to answer queries quickly".

I think that's utter nonsense.

Well you probably know a bit more about the optimizer than I do. But I
can't see a case where the stats provided by the index would be useful
for choosing between two (or more) plans that don't use the index in the
actual query. If you're saying that there are such cases, then clearly
I don't know something, and my thinking is in the wrong here.

.marko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#7)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

On Fri, Oct 17, 2014 at 06:15:37PM -0400, Tom Lane wrote:

Marko Tiikkaja <marko@joh.to> writes:

On 10/17/14, 11:59 PM, Tom Lane wrote:

Well, the index might've been getting used in queries too in a way that
really only involved the first column. I think you're solving the wrong
problem here. The right problem is how to identify indexes that are
being used in a way that doesn't exploit all the columns.

I'm not sure I agree with that. Even if there was some information the
planner could have extracted out of the index by using all columns (thus
appearing "fully used" in these hypothetical new statistics), I still
would've wanted the index gone. But in this particular case, an index
on foo(a) alone was not selective enough and it would have been a bad
choice for practically every query, so I'm not sure what good those
statistics were in the first place.

Those stats were perfectly valid: what the planner is looking for is
accurate minimum and maximum values for the index's leading column, and
that's what it got. You're correct that a narrower index could have given
the same results with a smaller disk footprint, but the planner got the
results it needed from the index you provided for it to work with.

Uh, why is the optimizer looking at the index on a,b,c and not just the
stats on column a, for example? I am missing something here.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#9)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

Bruce Momjian <bruce@momjian.us> writes:

On Fri, Oct 17, 2014 at 06:15:37PM -0400, Tom Lane wrote:

Those stats were perfectly valid: what the planner is looking for is
accurate minimum and maximum values for the index's leading column, and
that's what it got. You're correct that a narrower index could have given
the same results with a smaller disk footprint, but the planner got the
results it needed from the index you provided for it to work with.

Uh, why is the optimizer looking at the index on a,b,c and not just the
stats on column a, for example? I am missing something here.

Because it needs up-to-date min/max values in order to avoid being
seriously misled about selectivities of values near the endpoints.
See commit 40608e7f949fb7e4025c0ddd5be01939adc79eec.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#10)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

I wrote:

Because it needs up-to-date min/max values in order to avoid being
seriously misled about selectivities of values near the endpoints.
See commit 40608e7f949fb7e4025c0ddd5be01939adc79eec.

BTW, on re-reading that code I notice that it will happily seize upon
the first suitable index ("first" in OID order), regardless of how many
lower-order columns that index has got. This doesn't make any difference
I think for get_actual_variable_range's own purposes, because it's only
expecting to touch the endmost index page regardless. However, in light
of Marko's complaint maybe we should teach it to check all the indexes
and prefer the matching one with fewest columns? It would only take a
couple extra lines of code, and probably not that many added cycles
considering we're going to do an index access of some sort. But I'm
not sure if it's worth any extra effort --- I think in his example
case, there wasn't any narrower index anyway.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#10)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

On Fri, Oct 17, 2014 at 11:03:04PM -0400, Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

On Fri, Oct 17, 2014 at 06:15:37PM -0400, Tom Lane wrote:

Those stats were perfectly valid: what the planner is looking for is
accurate minimum and maximum values for the index's leading column, and
that's what it got. You're correct that a narrower index could have given
the same results with a smaller disk footprint, but the planner got the
results it needed from the index you provided for it to work with.

Uh, why is the optimizer looking at the index on a,b,c and not just the
stats on column a, for example? I am missing something here.

Because it needs up-to-date min/max values in order to avoid being
seriously misled about selectivities of values near the endpoints.
See commit 40608e7f949fb7e4025c0ddd5be01939adc79eec.

Oh, I had forgotten we did that. It is confusing that there is no way
via EXPLAIN to see the access, making the method of consulting pg_stat_*
and using EXPLAIN unreliable. Should we document this somewhere?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Marko Tiikkaja
marko@joh.to
In reply to: Tom Lane (#11)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

On 10/18/14, 5:16 AM, Tom Lane wrote:

BTW, on re-reading that code I notice that it will happily seize upon
the first suitable index ("first" in OID order), regardless of how many
lower-order columns that index has got. This doesn't make any difference
I think for get_actual_variable_range's own purposes, because it's only
expecting to touch the endmost index page regardless. However, in light
of Marko's complaint maybe we should teach it to check all the indexes
and prefer the matching one with fewest columns? It would only take a
couple extra lines of code, and probably not that many added cycles
considering we're going to do an index access of some sort. But I'm
not sure if it's worth any extra effort --- I think in his example
case, there wasn't any narrower index anyway.

Perhaps accidentally this would have helped in my case, actually, since
I could have created a new, smaller index CONCURRENTLY and then seen
that the usage of the other index stopped increasing. With the "pick
the smallest OID" behaviour that was not possible. Another idea had was
some way to tell the optimizer not to use that particular index for
stats lookups, but probably the use case for such a feature would be a
bit narrow.

All that said, I don't think my struggles justify the change you
described above. Not sure if it's a good idea or not.

.marko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Bruce Momjian
bruce@momjian.us
In reply to: Marko Tiikkaja (#13)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

On Sat, Oct 18, 2014 at 04:29:41PM +0200, Marko Tiikkaja wrote:

On 10/18/14, 5:16 AM, Tom Lane wrote:

BTW, on re-reading that code I notice that it will happily seize upon
the first suitable index ("first" in OID order), regardless of how many
lower-order columns that index has got. This doesn't make any difference
I think for get_actual_variable_range's own purposes, because it's only
expecting to touch the endmost index page regardless. However, in light
of Marko's complaint maybe we should teach it to check all the indexes
and prefer the matching one with fewest columns? It would only take a
couple extra lines of code, and probably not that many added cycles
considering we're going to do an index access of some sort. But I'm
not sure if it's worth any extra effort --- I think in his example
case, there wasn't any narrower index anyway.

Perhaps accidentally this would have helped in my case, actually,
since I could have created a new, smaller index CONCURRENTLY and
then seen that the usage of the other index stopped increasing.
With the "pick the smallest OID" behaviour that was not possible.
Another idea had was some way to tell the optimizer not to use that
particular index for stats lookups, but probably the use case for
such a feature would be a bit narrow.

Well, if the index is there, why not use it? I thought the problem was
just that you had no visibility into how those statistics were being
accessed. Most people think EXPLAIN shows all accesses, but obviously
now it doesn't.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Marko Tiikkaja
marko@joh.to
In reply to: Bruce Momjian (#14)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

On 10/18/14, 4:33 PM, Bruce Momjian wrote:

On Sat, Oct 18, 2014 at 04:29:41PM +0200, Marko Tiikkaja wrote:

Another idea had was some way to tell the optimizer not to use that
particular index for stats lookups, but probably the use case for
such a feature would be a bit narrow.

Well, if the index is there, why not use it? I thought the problem was
just that you had no visibility into how those statistics were being
accessed.

Yes, exactly; if I had had the option to disable the index from the
optimizer's point of view, I'd have seen that it's not used for looking
up any data by any queries, and thus I would have known that I can
safely drop it without slowing down queries. Which was the only thing I
cared about, and where the stats we provide failed me.

.marko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Bruce Momjian
bruce@momjian.us
In reply to: Marko Tiikkaja (#15)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

On Sat, Oct 18, 2014 at 04:38:37PM +0200, Marko Tiikkaja wrote:

On 10/18/14, 4:33 PM, Bruce Momjian wrote:

On Sat, Oct 18, 2014 at 04:29:41PM +0200, Marko Tiikkaja wrote:

Another idea had was some way to tell the optimizer not to use that
particular index for stats lookups, but probably the use case for
such a feature would be a bit narrow.

Well, if the index is there, why not use it? I thought the problem was
just that you had no visibility into how those statistics were being
accessed.

Yes, exactly; if I had had the option to disable the index from the
optimizer's point of view, I'd have seen that it's not used for
looking up any data by any queries, and thus I would have known that
I can safely drop it without slowing down queries. Which was the
only thing I cared about, and where the stats we provide failed me.

How many other cases do we have where the statistics are getting
incremented and there is no user visibility into the operation?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#16)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

Bruce Momjian <bruce@momjian.us> writes:

How many other cases do we have where the statistics are getting
incremented and there is no user visibility into the operation?

* system catalog accesses
* vacuum/analyze/cluster/etc

The fact that system-initiated accesses get counted in the statistics
is a feature, not a bug.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Tiikkaja (#15)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

Marko Tiikkaja <marko@joh.to> writes:

On 10/18/14, 4:33 PM, Bruce Momjian wrote:

Well, if the index is there, why not use it? I thought the problem was
just that you had no visibility into how those statistics were being
accessed.

Yes, exactly; if I had had the option to disable the index from the
optimizer's point of view, I'd have seen that it's not used for looking
up any data by any queries, and thus I would have known that I can
safely drop it without slowing down queries. Which was the only thing I
cared about, and where the stats we provide failed me.

This argument is *utterly* wrongheaded, because it assumes that the
planner's use of the index provided no benefit to your queries. If the
planner was touching the index at all then it was planning queries in
which knowledge of the extremal value was relevant to accurate selectivity
estimation. So it's quite likely that without the index you'd have gotten
different and inferior plans, whether or not those plans actually chose to
use the index.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Marko Tiikkaja
marko@joh.to
In reply to: Tom Lane (#18)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

On 10/18/14, 5:46 PM, Tom Lane wrote:

Marko Tiikkaja <marko@joh.to> writes:

Yes, exactly; if I had had the option to disable the index from the
optimizer's point of view, I'd have seen that it's not used for looking
up any data by any queries, and thus I would have known that I can
safely drop it without slowing down queries. Which was the only thing I
cared about, and where the stats we provide failed me.

This argument is *utterly* wrongheaded, because it assumes that the
planner's use of the index provided no benefit to your queries. If the
planner was touching the index at all then it was planning queries in
which knowledge of the extremal value was relevant to accurate selectivity
estimation. So it's quite likely that without the index you'd have gotten
different and inferior plans, whether or not those plans actually chose to
use the index.

Maybe. But at the same time that's a big problem: there's no way of
knowing whether the index is actually useful or not when it's used only
by the query planner.

.marko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Bruce Momjian
bruce@momjian.us
In reply to: Marko Tiikkaja (#19)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

On Sat, Oct 18, 2014 at 06:15:03PM +0200, Marko Tiikkaja wrote:

On 10/18/14, 5:46 PM, Tom Lane wrote:

Marko Tiikkaja <marko@joh.to> writes:

Yes, exactly; if I had had the option to disable the index from the
optimizer's point of view, I'd have seen that it's not used for looking
up any data by any queries, and thus I would have known that I can
safely drop it without slowing down queries. Which was the only thing I
cared about, and where the stats we provide failed me.

This argument is *utterly* wrongheaded, because it assumes that the
planner's use of the index provided no benefit to your queries. If the
planner was touching the index at all then it was planning queries in
which knowledge of the extremal value was relevant to accurate selectivity
estimation. So it's quite likely that without the index you'd have gotten
different and inferior plans, whether or not those plans actually chose to
use the index.

Maybe. But at the same time that's a big problem: there's no way of
knowing whether the index is actually useful or not when it's used
only by the query planner.

That is a good point. Without an index, the executor is going to do a
sequential scan, while a missing index to the optimizer just means worse
statistics.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#11)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

On 10/17/14, 10:16 PM, Tom Lane wrote:

I wrote:

Because it needs up-to-date min/max values in order to avoid being
seriously misled about selectivities of values near the endpoints.
See commit 40608e7f949fb7e4025c0ddd5be01939adc79eec.

BTW, on re-reading that code I notice that it will happily seize upon
the first suitable index ("first" in OID order), regardless of how many
lower-order columns that index has got. This doesn't make any difference
I think for get_actual_variable_range's own purposes, because it's only
expecting to touch the endmost index page regardless. However, in light
of Marko's complaint maybe we should teach it to check all the indexes
and prefer the matching one with fewest columns? It would only take a
couple extra lines of code, and probably not that many added cycles
considering we're going to do an index access of some sort. But I'm
not sure if it's worth any extra effort --- I think in his example
case, there wasn't any narrower index anyway.

The real cost here isn't the number of columns, it's the size of the index, no? So shouldn't we look at relpages instead? For example, you'd certainly want to use an index on (field_we_care_about, smallint_field) over an index on (field_we_care_about, big_honking_text_field).
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#21)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

Jim Nasby <Jim.Nasby@BlueTreble.com> writes:

On 10/17/14, 10:16 PM, Tom Lane wrote:

BTW, on re-reading that code I notice that it will happily seize upon
the first suitable index ("first" in OID order), regardless of how many
lower-order columns that index has got. This doesn't make any difference
I think for get_actual_variable_range's own purposes, because it's only
expecting to touch the endmost index page regardless. However, in light
of Marko's complaint maybe we should teach it to check all the indexes
and prefer the matching one with fewest columns?

The real cost here isn't the number of columns, it's the size of the index, no? So shouldn't we look at relpages instead? For example, you'd certainly want to use an index on (field_we_care_about, smallint_field) over an index on (field_we_care_about, big_honking_text_field).

Yeah, perhaps. I'd been wondering about adding a tie-breaking rule,
but that's a much simpler way to think about it. OTOH, that approach
could result in some instability in the choice of index: if you've got
both (field_we_care_about, some_int_field) and (field_we_care_about,
some_other_int_field) then it might switch between choosing those two
indexes from day to day depending on basically-chance issues like when
page splits occur. That would probably annoy Marko even more than the
current behavior :-(, because it would scatter the planner's usage
across multiple indexes for no very good reason.

The coding I'd been imagining at first would basically break ties in
column count according to index OID order, so its choices would be stable
as long as you did not add/drop indexes. That seems like a good property
to try to preserve.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#22)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

On 10/19/14, 2:09 PM, Tom Lane wrote:

Yeah, perhaps. I'd been wondering about adding a tie-breaking rule,
but that's a much simpler way to think about it. OTOH, that approach
could result in some instability in the choice of index: if you've got
both (field_we_care_about, some_int_field) and (field_we_care_about,
some_other_int_field) then it might switch between choosing those two
indexes from day to day depending on basically-chance issues like when
page splits occur. That would probably annoy Marko even more than the
current behavior:-(, because it would scatter the planner's usage
across multiple indexes for no very good reason.

The coding I'd been imagining at first would basically break ties in
column count according to index OID order, so its choices would be stable
as long as you did not add/drop indexes. That seems like a good property
to try to preserve.

Maybe a good alternative is:

ORDER BY int( table.reltuples / index.relpages / BLKSZ ) DESC, oid

By comparing on average tuple size throwing away the fraction presumably we'd throw away noise from page splits too.

We'd want to use table.reltuples for consistency sake, though theoretically in this case I'd think it should be the same for indexes we care about...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Bruce Momjian (#12)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

On 10/18/14, 8:58 AM, Bruce Momjian wrote:

On Fri, Oct 17, 2014 at 11:03:04PM -0400, Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

On Fri, Oct 17, 2014 at 06:15:37PM -0400, Tom Lane wrote:

Those stats were perfectly valid: what the planner is looking for is
accurate minimum and maximum values for the index's leading column, and
that's what it got. You're correct that a narrower index could have given
the same results with a smaller disk footprint, but the planner got the
results it needed from the index you provided for it to work with.

Uh, why is the optimizer looking at the index on a,b,c and not just the
stats on column a, for example? I am missing something here.

Because it needs up-to-date min/max values in order to avoid being
seriously misled about selectivities of values near the endpoints.
See commit 40608e7f949fb7e4025c0ddd5be01939adc79eec.

Oh, I had forgotten we did that. It is confusing that there is no way
via EXPLAIN to see the access, making the method of consulting pg_stat_*
and using EXPLAIN unreliable. Should we document this somewhere?

I think we should. The common (mis)conception is that pg_stats shows *user-driven* access, not access because of stuff the system is doing.

This is actually a huge problem for anyone who's trying to figure out how useful indexes are; they see usage and thing they have queries that are using the index when in reality they don't.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#20)
1 attachment(s)
Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

On Sat, Oct 18, 2014 at 02:20:45PM -0400, Bruce Momjian wrote:

On Sat, Oct 18, 2014 at 06:15:03PM +0200, Marko Tiikkaja wrote:

On 10/18/14, 5:46 PM, Tom Lane wrote:

Marko Tiikkaja <marko@joh.to> writes:

Yes, exactly; if I had had the option to disable the index from the
optimizer's point of view, I'd have seen that it's not used for looking
up any data by any queries, and thus I would have known that I can
safely drop it without slowing down queries. Which was the only thing I
cared about, and where the stats we provide failed me.

This argument is *utterly* wrongheaded, because it assumes that the
planner's use of the index provided no benefit to your queries. If the
planner was touching the index at all then it was planning queries in
which knowledge of the extremal value was relevant to accurate selectivity
estimation. So it's quite likely that without the index you'd have gotten
different and inferior plans, whether or not those plans actually chose to
use the index.

Maybe. But at the same time that's a big problem: there's no way of
knowing whether the index is actually useful or not when it's used
only by the query planner.

That is a good point. Without an index, the executor is going to do a
sequential scan, while a missing index to the optimizer just means worse
statistics.

I have applied the attached patch to document that the optimizer can
increase the index usage statistics.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

Attachments:

index.difftext/x-diff; charset=us-asciiDownload
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
new file mode 100644
index afcfb89..71d06ce
*** a/doc/src/sgml/monitoring.sgml
--- b/doc/src/sgml/monitoring.sgml
*************** postgres   27093  0.0  0.0  30096  2752
*** 1382,1389 ****
    </para>
  
    <para>
!    Indexes can be used via either simple index scans or <quote>bitmap</>
!    index scans.  In a bitmap scan
     the output of several indexes can be combined via AND or OR rules,
     so it is difficult to associate individual heap row fetches
     with specific indexes when a bitmap scan is used.  Therefore, a bitmap
--- 1382,1389 ----
    </para>
  
    <para>
!    Indexes can be used by simple index scans, <quote>bitmap</> index scans,
!    and the optimizer.  In a bitmap scan
     the output of several indexes can be combined via AND or OR rules,
     so it is difficult to associate individual heap row fetches
     with specific indexes when a bitmap scan is used.  Therefore, a bitmap
*************** postgres   27093  0.0  0.0  30096  2752
*** 1393,1398 ****
--- 1393,1401 ----
     <structname>pg_stat_all_tables</>.<structfield>idx_tup_fetch</>
     count for the table, but it does not affect
     <structname>pg_stat_all_indexes</>.<structfield>idx_tup_fetch</>.
+    The optimizer also accesses indexes to check for supplied constants
+    whose values are outside the recorded range of the optimizer statistics
+    because the optimizer statistics might be stale.
    </para>
  
    <note>