Stats for inheritance trees

Started by Tom Lanealmost 16 years ago23 messages
#1Tom Lane
Tom Lane
tgl@sss.pgh.pa.us

Following up on the discussion here
http://archives.postgresql.org/message-id/4B3875C6020000250002D97D@gw.wicourts.gov
I'd like to propose making the following changes that would allow saner
planning for queries involving inheritance:

1. Currently the primary key of pg_statistic is (starelid, staattnum)
indicating the table and column the stats entry is for. I propose
adding a bool stainherit to the pkey. "false" means the stats entry
is for just that table column, ie, the traditional interpretation.
"true" means the stats entry covers that column and all its inheritance
children. Such entries could be used directly by the planner in cases
where it currently punts and delivers a default estimate.

2. When ANALYZE is invoked on a table that has inheritance children,
it will perform its normal duties for just that table (creating or
updating entries with stainherit = false) and then perform a second
scan that covers that table and all its children. This will be used
to create or update entries with stainherit = true. It might be
possible to avoid scanning the parent table itself twice, but I won't
contort the code too much to avoid that, since in most practical
applications the parent is empty or small anyway.

3. Ideally autovacuum would know enough to perform ANALYZEs on
inheritance parents after enough churn has occurred in their child
table(s). I am not entirely clear about a good way to do that.
We could have it just directly force an ANALYZE on parent(s) of any
table it has chosen to ANALYZE, but that might be overkill --- in
particular leading to excess ANALYZEs when several children receive
a lot of updates.

Even without a really smart solution to #3, this would be a big step
forward for inheritance queries.

BTW, while at it I'm inclined to add a non-unique index on
pg_inherits.inhparent, so that find_inheritance_children won't have to
seqscan pg_inherits anymore. It's surprising people haven't complained
about that before. The code says

* XXX might be a good idea to create an index on pg_inherits' inhparent
* field, so that we can use an indexscan instead of sequential scan here.
* However, in typical databases pg_inherits won't have enough entries to
* justify an indexscan...

but we've long since learned that people stress databases in odd ways.

Comments?

regards, tom lane

#2Simon Riggs
Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#1)
Re: Stats for inheritance trees

On Mon, 2009-12-28 at 17:41 -0500, Tom Lane wrote:

Following up on the discussion here
http://archives.postgresql.org/message-id/4B3875C6020000250002D97D@gw.wicourts.gov
I'd like to propose making the following changes that would allow saner
planning for queries involving inheritance:

Sounds good.

1. Currently the primary key of pg_statistic is (starelid, staattnum)
indicating the table and column the stats entry is for. I propose
adding a bool stainherit to the pkey. "false" means the stats entry
is for just that table column, ie, the traditional interpretation.
"true" means the stats entry covers that column and all its inheritance
children. Such entries could be used directly by the planner in cases
where it currently punts and delivers a default estimate.

2. When ANALYZE is invoked on a table that has inheritance children,
it will perform its normal duties for just that table (creating or
updating entries with stainherit = false) and then perform a second
scan that covers that table and all its children. This will be used
to create or update entries with stainherit = true. It might be
possible to avoid scanning the parent table itself twice, but I won't
contort the code too much to avoid that, since in most practical
applications the parent is empty or small anyway.

Will there be logic to decide how stainherit should be set? Many columns
in an inherited set have similar values in different children, e.g.
OrderValue, OrderStatus but many columns also have very different values
in different children. e.g. OrderId, OrderPlacementDate,
OrderFulfillmentDate

3. Ideally autovacuum would know enough to perform ANALYZEs on
inheritance parents after enough churn has occurred in their child
table(s). I am not entirely clear about a good way to do that.
We could have it just directly force an ANALYZE on parent(s) of any
table it has chosen to ANALYZE, but that might be overkill --- in
particular leading to excess ANALYZEs when several children receive
a lot of updates.

Even without a really smart solution to #3, this would be a big step
forward for inheritance queries.

BTW, while at it I'm inclined to add a non-unique index on
pg_inherits.inhparent, so that find_inheritance_children won't have to
seqscan pg_inherits anymore. It's surprising people haven't complained
about that before.

They have, we just haven't done anything about it.

The code says

* XXX might be a good idea to create an index on pg_inherits' inhparent
* field, so that we can use an indexscan instead of sequential scan here.
* However, in typical databases pg_inherits won't have enough entries to
* justify an indexscan...

but we've long since learned that people stress databases in odd ways.

--
Simon Riggs www.2ndQuadrant.com

#3Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#2)
Re: Stats for inheritance trees

Simon Riggs <simon@2ndQuadrant.com> writes:

On Mon, 2009-12-28 at 17:41 -0500, Tom Lane wrote:

2. When ANALYZE is invoked on a table that has inheritance children,
it will perform its normal duties for just that table (creating or
updating entries with stainherit = false) and then perform a second
scan that covers that table and all its children. This will be used
to create or update entries with stainherit = true. It might be
possible to avoid scanning the parent table itself twice, but I won't
contort the code too much to avoid that, since in most practical
applications the parent is empty or small anyway.

Will there be logic to decide how stainherit should be set? Many columns
in an inherited set have similar values in different children, e.g.
OrderValue, OrderStatus but many columns also have very different values
in different children. e.g. OrderId, OrderPlacementDate,
OrderFulfillmentDate

I don't see that that's relevant here. We're trying to estimate the
overall result of a join against an inheritance tree. The fact that
some or even all of the matching rows might come from specific child
tables is not relevant at this level of detail. When it is relevant,
we'll be looking at the child-table stats (and constraints), not at
the parent's.

regards, tom lane

#4Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#1)
Re: Stats for inheritance trees

I wrote:

Following up on the discussion here
http://archives.postgresql.org/message-id/4B3875C6020000250002D97D@gw.wicourts.gov
I'd like to propose making the following changes that would allow saner
planning for queries involving inheritance:

I've committed the easy aspects of this (still need to work on autovacuum).
I ran into one unexpected loose end: what shall we do with ALTER TABLE
SET STATISTICS DISTINCT?

As committed, any manually set value of attdistinct will be applied to
both the relation-local and inherited stats for the column. From a
logical standpoint this is clearly the Wrong Thing, because it's quite
possible that different values would be needed. On the other hand,
I'm not sure how much it matters in practice. In the typical cases
where you need to force a value, you're probably going to force a
fractional value, and those would be likely be OK for both.

The only "real" fix I could see would be to create an additional
pg_attribute column and a separate command to set it. But it really
doesn't seem worth that much trouble.

Or we could think about some heuristics, like applying the manual
value to inherited stats only if it's fractional (negative).
But I'm afraid any rule like that would get in the way as often
as it would help.

Thoughts?

regards, tom lane

#5Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#1)
Re: Stats for inheritance trees

I wrote:

3. Ideally autovacuum would know enough to perform ANALYZEs on
inheritance parents after enough churn has occurred in their child
table(s). I am not entirely clear about a good way to do that.
We could have it just directly force an ANALYZE on parent(s) of any
table it has chosen to ANALYZE, but that might be overkill --- in
particular leading to excess ANALYZEs when several children receive
a lot of updates.

I've been looking at this for a bit, and I think the only reasonable
way to do it is to make the pgstats mechanism track the need for
ANALYZE on a parent table. A hack like I suggested above would make
the autovacuum.c code even messier than it already is, and it seems
inevitable that we'd get duplicate analyze actions from different
autovac workers.

Now, I don't really want to add Yet Another per-table counter to pgstats
for this. The stats are big enough already. However, the existing
mechanism for triggering ANALYZE looks pretty bogus to me as I look at
it now: there's a last_anl_tuples value with a very hazy definition,
and what's worse it's being computed off numbers that may be only crude
estimates from ANALYZE. What I propose doing is to replace that counter
with a "changes_since_analyze" counter, which can be managed very
simply:

* when a tabstat message comes in, increment changes_since_analyze by
the sum of t_tuples_inserted + t_tuples_updated + t_tuples_deleted;

* when an analyze report message comes in, reset changes_since_analyze
to zero.

This gives us a number that is actually pretty credible and can still
be compared to the analyze threshold the same as before. I think the
current definition dates from before we had accurate
insert/delete/update tracking, but now that we have that, we should
use it.

Now, having done that, what I would suggest doing is having autovacuum
propagate the changes_since_analyze count that it sees up to the parent
table(s) whenever it does an autoanalyze. (This would require adding a
new message type that allows reporting a changes_since_analyze increment
independently of inserted/updated/deleted, or else adding
changes_since_analyze as an independent field in regular tabstat
messages.)

In most cases, with the parent table probably smaller than the child
tables, this would immediately make the parent a candidate for analyze.
That might be overkill, in which case we could try multiplying the count
by some sort of derating factor, but getting hold of a good derating
factor might be more expensive than it's worth --- I think you'd have to
look at all the other children of the same parent to see how big the
current one is compared to the rest.

Thoughts?

regards, tom lane

#6Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#4)
Re: Stats for inheritance trees

On Tue, Dec 29, 2009 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Following up on the discussion here
http://archives.postgresql.org/message-id/4B3875C6020000250002D97D@gw.wicourts.gov
I'd like to propose making the following changes that would allow saner
planning for queries involving inheritance:

I've committed the easy aspects of this (still need to work on autovacuum).
I ran into one unexpected loose end: what shall we do with ALTER TABLE
SET STATISTICS DISTINCT?

As committed, any manually set value of attdistinct will be applied to
both the relation-local and inherited stats for the column.  From a
logical standpoint this is clearly the Wrong Thing, because it's quite
possible that different values would be needed.  On the other hand,
I'm not sure how much it matters in practice.  In the typical cases
where you need to force a value, you're probably going to force a
fractional value, and those would be likely be OK for both.

The only "real" fix I could see would be to create an additional
pg_attribute column and a separate command to set it.  But it really
doesn't seem worth that much trouble.

Or we could think about some heuristics, like applying the manual
value to inherited stats only if it's fractional (negative).
But I'm afraid any rule like that would get in the way as often
as it would help.

Having separate properties for regular attdistinct and inherited
attdistinct seems fairly worthwhile, but I share your lack of
enthusiasm for solving the problem by adding more columns to
pg_attribute. One possibility would be to create a new system catalog
to hold "non-critical" information on pg_attribute properties - that
is, anything that isn't likely to be needed to plan and execute
ordinary queries. attstattarget and attdistinct would certainly
qualify, and there may be others. This would avoid bloating
pg_attribute with things that frequently won't be needed, and as a
side benefit non-bootstrap catalogs are less of a PITA to modify.

A second possibility would be to generalize the concept of reloptions
to apply to columns. Per previous discussion, my per-tablespace
random_page_cost/seq_page_cost patch will generalize reloptions to
apply to tablespaces as well, under the name spcoptions. We could add
attoptions as well, reusing most of the same code and potentially
allowing us to support future options with less recoding. I rather
like this option; it seems like a good fit for any sort of knob that
we want to make available, but don't expect to be used frequently. (I
think, however, that if we're going to do this, I should go ahead and
commit my tablespace patch first, to avoid needless rebase hell.)

These two options aren't completely mutually exclusive; we could
decide that it makes sense to do both, though off the top of my head
it doesn't seem worth the trouble.

...Robert

#7Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#6)
Re: Stats for inheritance trees

Robert Haas <robertmhaas@gmail.com> writes:

Having separate properties for regular attdistinct and inherited
attdistinct seems fairly worthwhile, but I share your lack of
enthusiasm for solving the problem by adding more columns to
pg_attribute. One possibility would be to create a new system catalog
to hold "non-critical" information on pg_attribute properties - that
is, anything that isn't likely to be needed to plan and execute
ordinary queries. attstattarget and attdistinct would certainly
qualify, and there may be others.

Hmm... offhand it seems like all of these columns would be potential
candidates for pushing out to a secondary catalog:

attstattarget | integer | not null
attdistinct | real | not null
attndims | integer | not null
attislocal | boolean | not null
attinhcount | integer | not null
attacl | aclitem[] |

But even with another ndistinct column, that barely amounts to 2 dozen
bytes saved. Maybe it's worth the trouble in order to shave space from
tuple descriptors, but it seems pretty marginal.

(Actually, it looks to me like we could lose attndims altogether with
little pain ...)

A second possibility would be to generalize the concept of reloptions
to apply to columns.

Hm ... the base assumption in the reloptions code is that the user is
free to twiddle all the values. attdistinct and attstattarget might fit
that bill but none of the other stuff would, so we couldn't go very far
in terms of pushing things out of the core attributes. Still there's
some attraction in not having to alter pg_attribute the next time we
think of something like these.

regards, tom lane

#8Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#7)
Re: Stats for inheritance trees

On Tue, Dec 29, 2009 at 8:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Having separate properties for regular attdistinct and inherited
attdistinct seems fairly worthwhile, but I share your lack of
enthusiasm for solving the problem by adding more columns to
pg_attribute.  One possibility would be to create a new system catalog
to hold "non-critical" information on pg_attribute properties - that
is, anything that isn't likely to be needed to plan and execute
ordinary queries.  attstattarget and attdistinct would certainly
qualify, and there may be others.

Hmm... offhand it seems like all of these columns would be potential
candidates for pushing out to a secondary catalog:

 attstattarget | integer   | not null
 attdistinct   | real      | not null
 attndims      | integer   | not null
 attislocal    | boolean   | not null
 attinhcount   | integer   | not null
 attacl        | aclitem[] |

But even with another ndistinct column, that barely amounts to 2 dozen
bytes saved.  Maybe it's worth the trouble in order to shave space from
tuple descriptors, but it seems pretty marginal.

Maybe. I would think that attacl[] would need to be consulted
frequently enough that you'd want to have it around, but maybe not.

(Actually, it looks to me like we could lose attndims altogether with
little pain ...)

A second possibility would be to generalize the concept of reloptions
to apply to columns.

Hm ... the base assumption in the reloptions code is that the user is
free to twiddle all the values.  attdistinct and attstattarget might fit
that bill but none of the other stuff would, so we couldn't go very far
in terms of pushing things out of the core attributes.  Still there's
some attraction in not having to alter pg_attribute the next time we
think of something like these.

Yep. It would also lower the barrier to future innovations of that
type, which would be a good thing, IMO. Unfortunately we'd likely
need to continue to support the existing syntax at least for
attstattarget, which is kind of a bummer, but seems managable. I
think we could throw over the syntax for ALTER TABLE ... ADD
STATISTICS DISTINCT since it is an 8.5-ism.

...Robert

#9Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#8)
Re: Stats for inheritance trees

Robert Haas <robertmhaas@gmail.com> writes:

Yep. It would also lower the barrier to future innovations of that
type, which would be a good thing, IMO. Unfortunately we'd likely
need to continue to support the existing syntax at least for
attstattarget, which is kind of a bummer, but seems managable. I
think we could throw over the syntax for ALTER TABLE ... ADD
STATISTICS DISTINCT since it is an 8.5-ism.

Yeah --- if we think we might want to do this, now is the time,
before we're stuck with supporting that syntax. (I was thinking
earlier today that attdistinct was already in 8.4, but it's not.)

regards, tom lane

#10Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#9)
Re: Stats for inheritance trees

On Tue, Dec 29, 2009 at 9:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Yep.  It would also lower the barrier to future innovations of that
type, which would be a good thing, IMO.  Unfortunately we'd likely
need to continue to support the existing syntax at least for
attstattarget, which is kind of a bummer, but seems managable.  I
think we could throw over the syntax for ALTER TABLE ... ADD
STATISTICS DISTINCT since it is an 8.5-ism.

Yeah --- if we think we might want to do this, now is the time,
before we're stuck with supporting that syntax.  (I was thinking
earlier today that attdistinct was already in 8.4, but it's not.)

If you'd be willing to look over the latest version of my
per-tablespace random_page_cost/seq_page_cost patch, which I posted to
-hackers some time in the last few days, I can get that committed and
then start working on this, if you'd like.

...Robert

#11Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#10)
Re: Stats for inheritance trees

Robert Haas <robertmhaas@gmail.com> writes:

If you'd be willing to look over the latest version of my
per-tablespace random_page_cost/seq_page_cost patch, which I posted to
-hackers some time in the last few days, I can get that committed and
then start working on this, if you'd like.

I think Alvaro would actually be the right person to review that,
since the reloptions code is almost entirely his work.

regards, tom lane

#12Alvaro Herrera
Alvaro Herrera
alvherre@commandprompt.com
In reply to: Tom Lane (#11)
Re: Stats for inheritance trees

Tom Lane escribió:

Robert Haas <robertmhaas@gmail.com> writes:

If you'd be willing to look over the latest version of my
per-tablespace random_page_cost/seq_page_cost patch, which I posted to
-hackers some time in the last few days, I can get that committed and
then start working on this, if you'd like.

I think Alvaro would actually be the right person to review that,
since the reloptions code is almost entirely his work.

I can't promise anything right now though, as my wife could get with
labour very soon ...

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#13Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#12)
Re: Stats for inheritance trees

On Wed, Dec 30, 2009 at 10:24 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Tom Lane escribió:

Robert Haas <robertmhaas@gmail.com> writes:

If you'd be willing to look over the latest version of my
per-tablespace random_page_cost/seq_page_cost patch, which I posted to
-hackers some time in the last few days, I can get that committed and
then start working on this, if you'd like.

I think Alvaro would actually be the right person to review that,
since the reloptions code is almost entirely his work.

I can't promise anything right now though, as my wife could get with
labour very soon ...

In terms of reasons for not being able to guarantee anything, I'd have
to say that's one of the best I've heard.

In all honesty, I'm not very worried about the reloptions stuff
proper. I have copied the existing coding pattern so closely that
it's a little hard to imagine that I've broken anything too badly. My
main concerns are:

1. Am I leaking memory anywhere?
and
2. Can anything bad happen as a result of invalidation events and/or
concurrent updates to pg_tablespace?

If anyone feels qualified to check my work on those two points, that
would be great. In reality, even if I've done something relatively
stupid, it isn't likely to have much practical impact since
pg_tablespace updates figure to be infrequent and many people won't
use this feature at all. But I'd still rather not do something
stupid.

Thanks,

...Robert

#14Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#9)
Re: Stats for inheritance trees

On Tue, Dec 29, 2009 at 9:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Yep.  It would also lower the barrier to future innovations of that
type, which would be a good thing, IMO.  Unfortunately we'd likely
need to continue to support the existing syntax at least for
attstattarget, which is kind of a bummer, but seems managable.  I
think we could throw over the syntax for ALTER TABLE ... ADD
STATISTICS DISTINCT since it is an 8.5-ism.

Yeah --- if we think we might want to do this, now is the time,
before we're stuck with supporting that syntax.  (I was thinking
earlier today that attdistinct was already in 8.4, but it's not.)

I am just starting to look at this now. One of the questions I have
is what we should call the options. We could call the regular options
something like "ndistinct" or "distinct", but I'm not too sure what to
call the "for-inheritance-trees" version of that. I suppose we could
just use the familiar "inh" prefix and call it "inhndistinct", but
that looks suspiciously like gobbledygook. Someone's understanding of
just what that is supposed to mean might be a little... indistinct (ba
dum).

Another option would be to call it "inherits_ndistinct", or something
like that, which seems a little more readable, but not great: I don't
think there's going to be any getting around the need to RTFM before
setting these parameters.

In terms of syntax, I'm thinking something like:

ALTER TABLE name ALTER COLUMN column SET ( column_parameter = value [, ...] )

I am also very tempted before beginning this work to rename
reloptions.c to options.c or genoptions.c or somesuch. If we're going
to use it for relations, attributes, and tablespaces, chances are good
we're going to use it for other things, too. The FDW stuff is already
borrowing from it as well.

...Robert

#15Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#14)
Re: Stats for inheritance trees

Robert Haas <robertmhaas@gmail.com> writes:

Another option would be to call it "inherits_ndistinct", or something
like that, which seems a little more readable, but not great: I don't
think there's going to be any getting around the need to RTFM before
setting these parameters.

Well, the previously agreed-to syntax was SET STATISTICS DISTINCT.
I don't see a problem with using "distinct" and "inherited_distinct"
or something like that. "ndistinct" is probably not a good name to
expose to non-programmers.

The must-RTFM argument is fairly weak, though, since these are knobs
that only advanced users would twiddle anyway.

regards, tom lane

#16Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#15)
Re: Stats for inheritance trees

On Tue, Jan 5, 2010 at 1:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Another option would be to call it "inherits_ndistinct", or something
like that, which seems a little more readable, but not great: I don't
think there's going to be any getting around the need to RTFM before
setting these parameters.

Well, the previously agreed-to syntax was SET STATISTICS DISTINCT.
I don't see a problem with using "distinct" and "inherited_distinct"
or something like that.  "ndistinct" is probably not a good name to
expose to non-programmers.

I like ndistinct because it makes the whole thing sound related to
statistics, which it is. But I'll do it your way unless there are
other votes.

...Robert

#17Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#16)
Re: Stats for inheritance trees

On Tue, Jan 5, 2010 at 1:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jan 5, 2010 at 1:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Another option would be to call it "inherits_ndistinct", or something
like that, which seems a little more readable, but not great: I don't
think there's going to be any getting around the need to RTFM before
setting these parameters.

Well, the previously agreed-to syntax was SET STATISTICS DISTINCT.
I don't see a problem with using "distinct" and "inherited_distinct"
or something like that.  "ndistinct" is probably not a good name to
expose to non-programmers.

I like ndistinct because it makes the whole thing sound related to
statistics, which it is.  But I'll do it your way unless there are
other votes.

It's probably also worth noting that the reason I used DISTINCT
originally is because it's already a keyword. That's a moot point
here. But as I say I'll stick with your names unless there are
contravening votes.

...Robert

#18Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#17)
Re: Stats for inheritance trees

Robert Haas <robertmhaas@gmail.com> writes:

It's probably also worth noting that the reason I used DISTINCT
originally is because it's already a keyword.

True.

It occurs to me that the pg_stats view already exposes "n_distinct"
as a column name. I wouldn't object to using "n_distinct" and
"n_distinct_inherited" or some such.

regards, tom lane

#19Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#18)
Re: Stats for inheritance trees

On Tue, Jan 5, 2010 at 1:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

It's probably also worth noting that the reason I used DISTINCT
originally is because it's already a keyword.

True.

It occurs to me that the pg_stats view already exposes "n_distinct"
as a column name.  I wouldn't object to using "n_distinct" and
"n_distinct_inherited" or some such.

OK. So we have:

1. distinct and inherited_distinct, or
2. n_distinct and n_distinct_inherited

Any other votes/thoughts/opinions/color commentary?

...Robert

#20Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#9)
1 attachment(s)
Re: Stats for inheritance trees

On Tue, Dec 29, 2009 at 9:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Yep.  It would also lower the barrier to future innovations of that
type, which would be a good thing, IMO.  Unfortunately we'd likely
need to continue to support the existing syntax at least for
attstattarget, which is kind of a bummer, but seems managable.  I
think we could throw over the syntax for ALTER TABLE ... ADD
STATISTICS DISTINCT since it is an 8.5-ism.

Yeah --- if we think we might want to do this, now is the time,
before we're stuck with supporting that syntax.  (I was thinking
earlier today that attdistinct was already in 8.4, but it's not.)

[ Hack, hack, hack. ]

I'm not quite sure what the correct approach is to making attoptions
available to examine_attribute(), which can't get at them in any very
obvious way because the relation descriptor it gets passed as an
argument only includes the fixed-size portion of each pg_attribute
tuple. The obvious approaches are:

1. Extract attrelid and attnum from the tuple and issue a syscache
lookup to get the full tuple.
2. Make RelationBuildTupleDesc() parse the attoptions and store them
into a new RelationData member.

I'm leaning toward the latter method. The upside of that method is
that making attoptions part of the Relation descriptor means that
they'll be conveniently available to all clients of the relcache. The
downside is that right now, only ANALYZE actually needs to care at the
moment, and yet we're incurring the overhead across the board. My
thought is that that's OK, but I wonder if anyone thinks it might
cause a measurable performance hit?

WIP patch attached. Right now this just adds the ability to set and
store attoptions, but doesn't actually do anything useful with them.
No doc changes, no pg_dump support, no psql support, doesn't remove
the SET STATISTICS DISTINCT code. All these warts will be fixed in a
future version once I decide what to do about the problem mentioned
above.

...Robert

Attachments:

attoptions-v1.patchtext/x-patch; charset=US-ASCII; name=attoptions-v1.patch
#21decibel
decibel
decibel@decibel.org
In reply to: Tom Lane (#5)
Re: Stats for inheritance trees

On Dec 29, 2009, at 6:29 PM, Tom Lane wrote:

* when a tabstat message comes in, increment changes_since_analyze by
the sum of t_tuples_inserted + t_tuples_updated + t_tuples_deleted;

* when an analyze report message comes in, reset changes_since_analyze
to zero.

If that's being added, could we extend the concept to also keep a reltuples_delta column (name suggestions welcome!) that is = reltuples_delta + t_tuples_inserted - t_tuples_deleted, and then set reltuples_delta back to 0 after an analyze (or anything else that would reset reltuples)? That means you could use reltuples + reltuples_delta as a fairly accurate row count.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#22Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: decibel (#21)
Re: Stats for inheritance trees

decibel <decibel@decibel.org> writes:

On Dec 29, 2009, at 6:29 PM, Tom Lane wrote:

* when a tabstat message comes in, increment changes_since_analyze by
the sum of t_tuples_inserted + t_tuples_updated + t_tuples_deleted;

* when an analyze report message comes in, reset changes_since_analyze
to zero.

If that's being added, could we extend the concept to also keep a reltuples_delta column (name suggestions welcome!) that is = reltuples_delta + t_tuples_inserted - t_tuples_deleted, and then set reltuples_delta back to 0 after an analyze (or anything else that would reset reltuples)? That means you could use reltuples + reltuples_delta as a fairly accurate row count.

We already have a fairly accurate row count.

regards, tom lane

#23Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#20)
Re: Stats for inheritance trees

On Tue, Jan 5, 2010 at 4:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Dec 29, 2009 at 9:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Yep.  It would also lower the barrier to future innovations of that
type, which would be a good thing, IMO.  Unfortunately we'd likely
need to continue to support the existing syntax at least for
attstattarget, which is kind of a bummer, but seems managable.  I
think we could throw over the syntax for ALTER TABLE ... ADD
STATISTICS DISTINCT since it is an 8.5-ism.

Yeah --- if we think we might want to do this, now is the time,
before we're stuck with supporting that syntax.  (I was thinking
earlier today that attdistinct was already in 8.4, but it's not.)

[ Hack, hack, hack. ]

I'm not quite sure what the correct approach is to making attoptions
available to examine_attribute(), which can't get at them in any very
obvious way because the relation descriptor it gets passed as an
argument only includes the fixed-size portion of each pg_attribute
tuple.  The obvious approaches are:

1. Extract attrelid and attnum from the tuple and issue a syscache
lookup to get the full tuple.
2. Make RelationBuildTupleDesc() parse the attoptions and store them
into a new RelationData member.

I'm leaning toward the latter method.

Upon further study, this does not look easy at all. There are complex
assumptions embedded in this code about what aspects of an index can
change on the fly and how those changes must be handled. Trying to
modify this seems likely to lead to (1) a further increase in code
complexity and (2) breakage.

I'm thinking maybe the best approach here is to store this information
in a separate cache indexed by <attrelid, attnum>.

...Robert