Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)

Started by Zeugswetter Andreas IZ5over 26 years ago20 messages

Andreas.Zeugswetter@telecom.at

over 26 years ago

So, the selectivity that a search for the most common value would
have is a reasonable estimate for the selectivity of a search for any
value. That's a bogus assumption in this case --- but it's hard to
justify making any other assumption in general.

Other db's usually use the value count(*) / nunique for the light weight
statistics.
This makes the assumptoin that the distinct index values are evenly
distributed.
That is on average a correct assumption, whereas our assumption on average
overestimates the number of rows returned.
I am not sure we have a nunique info though.

Andreas

Zeugswetter Andreas IZ5

Andreas.Zeugswetter@telecom.at

over 26 years ago

In reply to: Zeugswetter Andreas IZ5 (#1)

Yes, I think we index nulls. What are partial indexes?

A create index statement that accepts a where condition. All
rows that satisfy the where condition are indexed, others not.
This needs intelligence in the optimizer.

This was in postgresql code some time ago, but was removed
for some reason I don't remember.

Example: create index ax0 on a (id) where id is not null;

Andreas

Import Notes

Resolved by subject fallback

Tom Lane

tgl@sss.pgh.pa.us

over 26 years ago

In reply to: Zeugswetter Andreas IZ5 (#2)

Zeugswetter Andreas IZ5 <Andreas.Zeugswetter@telecom.at> writes:

Yes, I think we index nulls. What are partial indexes?

A create index statement that accepts a where condition. All
rows that satisfy the where condition are indexed, others not.
This needs intelligence in the optimizer.

This was in postgresql code some time ago, but was removed
for some reason I don't remember.

It was? There's still a ton of code in the optimizer to support it
(a big chunk of indxqual.c is for testing index predicates).

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofWed28Jul1999100028+0200219F68D65015D011A8E000006F8590C60339E094@sdexcsrv1.f000.d0188.sd.spardat.at | Resolved by subject fallback

Thomas Lockhart

lockhart@alumni.caltech.edu

over 26 years ago

In reply to: Tom Lane (#3)

This was in postgresql code some time ago, but was removed
for some reason I don't remember.

It was? There's still a ton of code in the optimizer to support it
(a big chunk of indxqual.c is for testing index predicates).

There was talk of removing it, but it seemed to be a Bad Idea to do
so. The discussion even provoked a negative response from the Gods
themselves (in the voice of Paul Aoki) and led to the short
description of them in the docs.

They have definitely been neglected, but are a Good Thing and should
be rehabbed...

- Thomas

--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California

Tom Lane

tgl@sss.pgh.pa.us

over 26 years ago

In reply to: Thomas Lockhart (#4)

Zeugswetter Andreas IZ5 <Andreas.Zeugswetter@telecom.at> writes:

Other db's usually use the value count(*) / nunique for the light
weight statistics. This makes the assumptoin that the distinct index
values are evenly distributed. That is on average a correct
assumption, whereas our assumption on average overestimates the number
of rows returned. I am not sure we have a nunique info though.

We don't, and AFAICS it would be an expensive statistic to compute.

I have thought about this a little more overnight, and I have come up
with what I think is a better idea. Suppose that VACUUM ANALYZE stores
in pg_statistic not only the disbursion, but also the most frequently
occurring value of each column. It already computes (or I should say
estimates) the most frequently occurring value (MFOV) in order to arrive
at the disbursion, so storing the value costs nothing except a little
more space in pg_statistic. Now, the logic that eqsel() should use is

if constant-being-compared-against == MFOV then
return disbursion;
else
return MIN(disbursion, 1.0 - disbursion);

which works like this: if we are indeed looking for the MFOV then the
selectivity is just the disbursion, no question. If we are looking for
a value *other* than the MFOV, then the selectivity must be less than
the disbursion, since surely this value occurs less often than the MFOV.
But the total fraction of non-MFOV values in the table is
1.0-disbursion, so the fraction that are the specific value we want
can't exceed that either.

The MIN() above is therefore a hard upper bound for the selectivity
of the non-MFOV case. In practice we might want to multiply the MIN
by a fudge-factor somewhat less than one, to arrive at what we hope
is a reasonable estimate rather than a worst-case estimate.

BTW, this argument proves rigorously that the selectivity of a search
for any value other than the MFOV is not more than 0.5, so there is some
basis for my intuition that eqsel should not return a value above 0.5.
So, in the cases where eqsel does not know the exact value being
searched for, I'd still be inclined to cap its result at 0.5.

If we use this logic, the stat we really want is exactly the frequency
of the MFOV, not the disbursion which is just closely related to it.
I have not looked at the other uses of disbursion, but if they all can
work like this we might want to forget the statistical niceties and just
store the frequency of the MFOV.

A final comment is that NULL would be treated just like any regular
value in determining what is the MFOV...

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofWed28Jul1999094926+0200219F68D65015D011A8E000006F8590C60339E093@sdexcsrv1.f000.d0188.sd.spardat.at | Resolved by subject fallback

Bernard Frankpitt

frankpit@pop.dn.net

over 26 years ago

In reply to: Zeugswetter Andreas IZ5 (#2)

Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple select)

The code for partial indices is still intact in RTREES, and there is
some
information about them in one of the Stonebraker papers. If anyone is
intersted I will dig up my file and look for an exact reference.

Bernie

Bruce Momjian

maillist@candle.pha.pa.us

over 26 years ago

In reply to: Zeugswetter Andreas IZ5 (#1)

So, the selectivity that a search for the most common value would
have is a reasonable estimate for the selectivity of a search for any
value. That's a bogus assumption in this case --- but it's hard to
justify making any other assumption in general.

Other db's usually use the value count(*) / nunique for the light weight
statistics.
This makes the assumptoin that the distinct index values are evenly
distributed.
That is on average a correct assumption, whereas our assumption on average
overestimates the number of rows returned.
I am not sure we have a nunique info though.

Yes, that's the problem. Figuring out the number of uniques is hard,
expecially with no index.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Zeugswetter Andreas IZ5

Andreas.Zeugswetter@telecom.at

over 26 years ago

In reply to: Bruce Momjian (#7)

BTW, this argument proves rigorously that the selectivity of a search
for any value other than the MFOV is not more than 0.5, so there is some
basis for my intuition that eqsel should not return a value above 0.5.
So, in the cases where eqsel does not know the exact value being
searched for, I'd still be inclined to cap its result at 0.5.

Yes, this is imho an easy and efficient fix. I would even use a lower value,
like 0,3.
Good database design would not create an index for such bad selectivity
anyway.
So if you have a performance problem because of so bad selectivity,
the advice is to drop the index.

If you plan to store explicit key values, I would do this in an extra
statistic,
that stores bunches of equally sized buckets, and distinct values for very
badly
scewed values.

Example assuming int index column:
from to nrow_estimate
1 100 10005
101 20000 9997
20001 100000 10014

badly scewed values (excluded in above table):
val nrow_estimate
1 100000
5 1000000

But imho this is an overkill, and seldom useful.

Andreas

Import Notes

Resolved by subject fallback

Bruce Momjian

maillist@candle.pha.pa.us

over 26 years ago

In reply to: Tom Lane (#5)

Zeugswetter Andreas IZ5 <Andreas.Zeugswetter@telecom.at> writes:

Other db's usually use the value count(*) / nunique for the light
weight statistics. This makes the assumptoin that the distinct index
values are evenly distributed. That is on average a correct
assumption, whereas our assumption on average overestimates the number
of rows returned. I am not sure we have a nunique info though.

We don't, and AFAICS it would be an expensive statistic to compute.

I have thought about this a little more overnight, and I have come up
with what I think is a better idea. Suppose that VACUUM ANALYZE stores
in pg_statistic not only the disbursion, but also the most frequently
occurring value of each column. It already computes (or I should say
estimates) the most frequently occurring value (MFOV) in order to arrive
at the disbursion, so storing the value costs nothing except a little
more space in pg_statistic. Now, the logic that eqsel() should use is

if constant-being-compared-against == MFOV then
return disbursion;
else
return MIN(disbursion, 1.0 - disbursion);

Yes, I like this.

BTW, this argument proves rigorously that the selectivity of a search
for any value other than the MFOV is not more than 0.5, so there is some
basis for my intuition that eqsel should not return a value above 0.5.
So, in the cases where eqsel does not know the exact value being
searched for, I'd still be inclined to cap its result at 0.5.

I don't follow this. If the most frequent value occurs 95% of the time,
wouldn't the selectivity be 0.95? And why use 1-disbursion. You may
find the existing code does a better job than the MIN() computation.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#10

Bruce Momjian

maillist@candle.pha.pa.us

over 26 years ago

In reply to: Zeugswetter Andreas IZ5 (#8)

Example assuming int index column:
from to nrow_estimate
1 100 10005
101 20000 9997
20001 100000 10014

badly scewed values (excluded in above table):
val nrow_estimate
1 100000
5 1000000

But imho this is an overkill, and seldom useful.

Yes, some commerical databases do this, though it is of questionable value.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#11

Tom Lane

tgl@sss.pgh.pa.us

over 26 years ago

In reply to: Bruce Momjian (#10)

Bruce Momjian <maillist@candle.pha.pa.us> writes:

BTW, this argument proves rigorously that the selectivity of a search
for any value other than the MFOV is not more than 0.5, so there is some
basis for my intuition that eqsel should not return a value above 0.5.
So, in the cases where eqsel does not know the exact value being
searched for, I'd still be inclined to cap its result at 0.5.

I don't follow this. If the most frequent value occurs 95% of the time,
wouldn't the selectivity be 0.95?

If you are searching for the most frequent value, then the selectivity
estimate should indeed be 0.95. If you are searching for anything else,
the selectivity estimate ought to be 0.05 or less. If you don't know
what value you will be searching for, which number should you use?

The unsupported assumption here is that if the table contains 95%
occurrence of a particular value, then the odds are also 95% (or at
least high) that that's the value you are searching for in any given
query that has an "= something" WHERE qual.

That assumption is pretty reasonable in some cases (such as your
example earlier of "WHERE state = 'PA'" in a Pennsylvania-local
database), but it falls down badly in others, such as where the
most common value is NULL or an empty string or some other indication
that there's no useful data. In that sort of situation it's actually
pretty unlikely that the user will be searching for field =
most-common-value ... but the system probably has no way to know that.

I wonder whether it would help to add even more data to pg_statistic.
For example, suppose we store the fraction of the columns that are NULL,
plus the most frequently occurring *non null* value, plus the fraction
of the columns that are that value. This would allow us to be very
smart about columns in which "no data" is represented by NULL (as a good
DB designer would do):

selectivity of "IS NULL": NULLfraction

selectivity of "IS NOT NULL": 1 - NULLfraction

selectivity of "= X" for a known non-null constant X:
if X == MFOV: MFOVfraction
else: MIN(MFOVfraction, 1-MFOVfraction-NULLfraction)

selectivity of "= X" when X is not known a priori, but presumably is not
null:
MIN(MFOVfraction, 1-NULLfraction)

Both of the MIN()s are upper bounds, so multiplying them by a
fudge-factor < 1 would be reasonable.

These rules would guarantee small selectivity values when either
MFOVfraction or 1-NULLfraction is small. It still wouldn't cost
much, since I believe VACUUM ANALYZE is counting nulls already...

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofWed28Jul1999120622-0400199907281606.MAA11774@candle.pha.pa.us | Resolved by subject fallback

#12

Bruce Momjian

maillist@candle.pha.pa.us

over 26 years ago

In reply to: Tom Lane (#11)

Bruce Momjian <maillist@candle.pha.pa.us> writes:

BTW, this argument proves rigorously that the selectivity of a search
for any value other than the MFOV is not more than 0.5, so there is some
basis for my intuition that eqsel should not return a value above 0.5.
So, in the cases where eqsel does not know the exact value being
searched for, I'd still be inclined to cap its result at 0.5.

I don't follow this. If the most frequent value occurs 95% of the time,
wouldn't the selectivity be 0.95?

If you are searching for the most frequent value, then the selectivity
estimate should indeed be 0.95. If you are searching for anything else,
the selectivity estimate ought to be 0.05 or less. If you don't know
what value you will be searching for, which number should you use?

You are going to love this:

0.95 * 0.95 + 0.05 * 0.05

This is because with a 95% of one value, you would think the ask for
that value 95% of the time, and another value 5% of the time. The last
0.05 is not really accurate. It assumes there are only two unique
values in the table, which may be wrong, but it is close enough.

The unsupported assumption here is that if the table contains 95%
occurrence of a particular value, then the odds are also 95% (or at
least high) that that's the value you are searching for in any given
query that has an "= something" WHERE qual.

Yes.

That assumption is pretty reasonable in some cases (such as your
example earlier of "WHERE state = 'PA'" in a Pennsylvania-local
database), but it falls down badly in others, such as where the
most common value is NULL or an empty string or some other indication
that there's no useful data. In that sort of situation it's actually
pretty unlikely that the user will be searching for field =
most-common-value ... but the system probably has no way to know that.

Well, if null is most common, it is very probable they would be looking
for col IS NULL.

I wonder whether it would help to add even more data to pg_statistic.
For example, suppose we store the fraction of the columns that are NULL,
plus the most frequently occurring *non null* value, plus the fraction
of the columns that are that value. This would allow us to be very
smart about columns in which "no data" is represented by NULL (as a good
DB designer would do):

That would be nice.

selectivity of "IS NULL": NULLfraction

selectivity of "IS NOT NULL": 1 - NULLfraction

selectivity of "= X" for a known non-null constant X:
if X == MFOV: MFOVfraction
else: MIN(MFOVfraction, 1-MFOVfraction-NULLfraction)

selectivity of "= X" when X is not known a priori, but presumably is not
null:
MIN(MFOVfraction, 1-NULLfraction)

Both of the MIN()s are upper bounds, so multiplying them by a
fudge-factor < 1 would be reasonable.

Yes, I am with you here.

These rules would guarantee small selectivity values when either
MFOVfraction or 1-NULLfraction is small. It still wouldn't cost
much, since I believe VACUUM ANALYZE is counting nulls already...

Yes, it is. Sounds nice.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#13

Philip Warner

pjw@rhyme.com.au

over 26 years ago

In reply to: Bruce Momjian (#12)

Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)

At 20:21 28/07/99 -0400, you wrote:

I wonder whether it would help to add even more data to pg_statistic.
For example, suppose we store the fraction of the columns that are NULL,
plus the most frequently occurring *non null* value, plus the fraction
of the columns that are that value. This would allow us to be very
smart about columns in which "no data" is represented by NULL (as a good
DB designer would do):

That would be nice.

I know I've mentioned this before, but can't the designer of the query be
given some influence over optimizer index choices? We can circle around the
problem of understanding the demographics of a table, but without
row-by-row analysis, you'll *never* get the complete and accurate view that
is needed to cater for all cases.

OTOH, a query designer often knows that a particular query will only be run
to find 'exceptions' (ie. non-nulls when 95% are nulls), or to find 'small'
ranges. IMO, when a DBA is in a position to help the optimizer, they
*should* allowed to. PG *already* has something like this in the form of
partial indexes: you can view the query that is associated with the index
as a 'hint' as to when that index should be used. All I'm asking is for
queries, not indexes, to specify when an index is used.

This will not in any way replace the optimizer, but it will give users the
ability deal with pathological cases.

In terms of the statistics collected, it *may* also be worth doing some
rudimentary analysis on the data to see it is conforms to any common
distribution (or sum of distributions), and if it does, save that
information. eg. the optimizer will do pretty well if it *knows* the data
is in a normal distribution, with a mean of 972 and a stdev of 70! Of
course, you must be sure that it *is* a normal distribution to start with.

FWIW, statisticians often seem worried about three values: the mean, median
and mode. I don't really know which is which, but they are:

o The average of all values
o The average of the min and max value
o The most common value.

Someone who knows a lot more about this stuff than me can probably tell us
how these values will affect the trust we place in the index statistics.
Someone on this list must be able to give us some insight???

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: +61-03-5367 7422 | _________ \
Fax: +61-03-5367 7430 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#14

Bruce Momjian

maillist@candle.pha.pa.us

over 26 years ago

In reply to: Philip Warner (#13)

FWIW, statisticians often seem worried about three values: the mean, median
and mode. I don't really know which is which, but they are:

o The average of all values
o The average of the min and max value

We have this value. Good for > and < comparisons.

o The most common value.

We know the number of times the most common value occurs, but not the
actual value.

Someone who knows a lot more about this stuff than me can probably tell us
how these values will affect the trust we place in the index statistics.
Someone on this list must be able to give us some insight???

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: +61-03-5367 7422 | _________ \
Fax: +61-03-5367 7430 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#15

Thomas Lockhart

lockhart@alumni.caltech.edu

over 26 years ago

In reply to: Tom Lane (#11)

Tom Lane wrote:

Bruce Momjian <maillist@candle.pha.pa.us> writes:

BTW, this argument proves rigorously that the selectivity of a search
for any value other than the MFOV is not more than 0.5, so there is some
basis for my intuition that eqsel should not return a value above 0.5.
So, in the cases where eqsel does not know the exact value being
searched for, I'd still be inclined to cap its result at 0.5.

I don't follow this. If the most frequent value occurs 95% of the time,
wouldn't the selectivity be 0.95?

If you are searching for the most frequent value, then the selectivity
estimate should indeed be 0.95. If you are searching for anything else,
the selectivity estimate ought to be 0.05 or less. If you don't know
what value you will be searching for, which number should you use?

The unsupported assumption here is that if the table contains 95%
occurrence of a particular value, then the odds are also 95% (or at
least high) that that's the value you are searching for in any given
query that has an "= something" WHERE qual.

That assumption is pretty reasonable in some cases (such as your
example earlier of "WHERE state = 'PA'" in a Pennsylvania-local
database), but it falls down badly in others, such as where the
most common value is NULL or an empty string or some other indication
that there's no useful data. In that sort of situation it's actually
pretty unlikely that the user will be searching for field =
most-common-value ... but the system probably has no way to know that.

This is exactly what a partial index is supposed to do. And then the
system knows it...

- Thomas

--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California

#16

Tom Lane

tgl@sss.pgh.pa.us

over 26 years ago

In reply to: Thomas Lockhart (#15)

Thomas Lockhart <lockhart@alumni.caltech.edu> writes:

Tom Lane wrote:

... it falls down badly in others, such as where the
most common value is NULL or an empty string or some other indication
that there's no useful data. In that sort of situation it's actually
pretty unlikely that the user will be searching for field =
most-common-value ... but the system probably has no way to know that.

This is exactly what a partial index is supposed to do. And then the
system knows it...

I've heard a couple of people assert in this thread that partial indexes
are the answer, but I don't believe it. Two reasons:

(1) The system won't use a partial index *at all* unless it can prove
that the index's predicate (condition for including tuples) is implied
by the query's WHERE condition. So the predicate doesn't add a thing
to the system's knowledge about the query.

(2) The statistics that we have available are stats about a column.
Not stats about a column given the predicate of some index. So there's
no gain in our statistical knowledge either.

Partial indexes might be a component of a solution, but they are
very far from being a solution all by themselves.

regards, tom lane

PS: a quick glance at gram.y shows that we don't actually accept
partial-index predicates in CREATE INDEX, so Andreas was right that
the feature got ripped out at some point. I have no idea how much
work might be required to re-enable it... but I'll bet it's not
trivial.

Import Notes

Reply to msg id not found: YourmessageofThu29Jul1999044819+0000379FDD13.B0E7D22A@alumni.caltech.edu | Resolved by subject fallback

#17

Hannu Krosing

hannu@trust.ee

over 26 years ago

In reply to: Tom Lane (#16)

Tom Lane wrote:

(2) The statistics that we have available are stats about a column.
Not stats about a column given the predicate of some index. So there's
no gain in our statistical knowledge either.

If we added just count of NULLs we would cover for the NOT NULL case

Partial indexes might be a component of a solution, but they are
very far from being a solution all by themselves.

regards, tom lane

PS: a quick glance at gram.y shows that we don't actually accept
partial-index predicates in CREATE INDEX, so Andreas was right that
the feature got ripped out at some point. I have no idea how much
work might be required to re-enable it... but I'll bet it's not
trivial.

That's why I suggested getting just the simplest case (NOT NULL)
working first.

The more general approach would of course be to gather stats by index.

--------------
Hannu

#18

Bruce Momjian

maillist@candle.pha.pa.us

over 26 years ago

In reply to: Tom Lane (#16)

Thomas Lockhart <lockhart@alumni.caltech.edu> writes:

Tom Lane wrote:

... it falls down badly in others, such as where the
most common value is NULL or an empty string or some other indication
that there's no useful data. In that sort of situation it's actually
pretty unlikely that the user will be searching for field =
most-common-value ... but the system probably has no way to know that.

This is exactly what a partial index is supposed to do. And then the
system knows it...

I've heard a couple of people assert in this thread that partial indexes
are the answer, but I don't believe it. Two reasons:

(1) The system won't use a partial index *at all* unless it can prove
that the index's predicate (condition for including tuples) is implied
by the query's WHERE condition. So the predicate doesn't add a thing
to the system's knowledge about the query.

(2) The statistics that we have available are stats about a column.
Not stats about a column given the predicate of some index. So there's
no gain in our statistical knowledge either.

Partial indexes might be a component of a solution, but they are
very far from being a solution all by themselves.

Agreed.
-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#19

Tom Lane

tgl@sss.pgh.pa.us

over 26 years ago

In reply to: Bruce Momjian (#18)

OK, I haven't heard any objections to my last proposal for improving
selectivity estimates for "=", so I'm going to bull ahead and implement it.

This will require adding columns to a system table, which I've never
done before. There are some attribute statistics in pg_attribute and
some in pg_statistic, but it looks like changing pg_attribute is a
pretty dangerous business, so I'm inclined to leave pg_attribute alone
and just add columns to pg_statistic.

Do I need to do anything beyond making the obvious additions to
catalog/pg_statistic.h, rebuild, and initdb? I see that pg_attribute.h
doesn't contain any handmade entries for pg_statistic, so that at least
is no problem...

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofWed28Jul1999202151-0400199907290021.UAA24604@candle.pha.pa.us | Resolved by subject fallback

#20

Bruce Momjian

maillist@candle.pha.pa.us

over 26 years ago

In reply to: Tom Lane (#19)

OK, I haven't heard any objections to my last proposal for improving
selectivity estimates for "=", so I'm going to bull ahead and implement it.

This will require adding columns to a system table, which I've never
done before. There are some attribute statistics in pg_attribute and
some in pg_statistic, but it looks like changing pg_attribute is a
pretty dangerous business, so I'm inclined to leave pg_attribute alone
and just add columns to pg_statistic.

Do I need to do anything beyond making the obvious additions to
catalog/pg_statistic.h, rebuild, and initdb? I see that pg_attribute.h
doesn't contain any handmade entries for pg_statistic, so that at least
is no problem...

No, that's pretty much it. You have to make sure you get it all
correct, or it will not work, but you know that. The only other thing
is that you have to make sure that every insert into pg_statistic in the
backend knows about the new fields. I do that by looking for all
references to pg_statistic in the backend, and making sure I have the
new fields covered. If you add a varlena field, and there is an index
on the table, you may have to add something to the cache.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026