WIP patch: distinguish selectivity of < from <= and > from >=

Started by Tom Lanealmost 9 years ago19 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

I was reminded by bug #14729 that we are not very good on corner cases
involving inequality operators, ie < <= > >=. Historically, the
selectivity functions have simply not distinguished < from <= or >
from >=, arguing that the fraction of the population that satisfies
the "=" aspect can be considered to be vanishingly small, if the
comparison value isn't any of the most-common-values for the variable.
(If it is, the code path that executes the operator against each MCV
will take care of things properly.) But that isn't really true unless
we're dealing with a continuum of variable values, and in practice
we seldom are. If "x = const" would estimate a nonzero number of
rows for a given const value, then it follows that we ought to estimate
different numbers of rows for "x < const" and "x <= const", whether or
not the const is one of the MCVs.

Hence, the attached patch, which splits the scalar inequality selectivity
functions into four sets instead of two. This demonstrably produces
better estimates, for example using the "tenk1" table in the regression
database:

regression=# explain analyze select * from tenk1 where thousand < 10;
before:
Bitmap Heap Scan on tenk1 (cost=5.14..241.38 rows=110 width=244) (actual time=0.121..0.623 rows=100 loops=1)
with patch:
Bitmap Heap Scan on tenk1 (cost=5.06..227.42 rows=100 width=244) (actual time=0.054..0.300 rows=100 loops=1)

regression=# explain analyze select * from tenk1 where thousand <= 10;
before:
Bitmap Heap Scan on tenk1 (cost=5.14..241.38 rows=110 width=244) (actual time=0.120..0.383 rows=110 loops=1)
with patch:
Bitmap Heap Scan on tenk1 (cost=5.14..241.38 rows=110 width=244) (actual time=0.062..0.288 rows=110 loops=1)

regression=# explain analyze select * from tenk1 where thousand > 10;
before:
Seq Scan on tenk1 (cost=0.00..483.00 rows=9890 width=244) (actual time=0.030..6.276 rows=9890 loops=1)
with patch:
Seq Scan on tenk1 (cost=0.00..483.00 rows=9890 width=244) (actual time=0.019..4.881 rows=9890 loops=1)

regression=# explain analyze select * from tenk1 where thousand >= 10;
before:
Seq Scan on tenk1 (cost=0.00..483.00 rows=9890 width=244) (actual time=0.022..5.371 rows=9900 loops=1)
with patch:
Seq Scan on tenk1 (cost=0.00..483.00 rows=9900 width=244) (actual time=0.014..3.783 rows=9900 loops=1)

regression=# explain analyze select * from tenk1 where thousand between 10 and 11;
before:
Bitmap Heap Scan on tenk1 (cost=4.39..39.52 rows=10 width=244) (actual time=0.082..0.215 rows=20 loops=1)
with patch:
Bitmap Heap Scan on tenk1 (cost=4.49..70.61 rows=20 width=244) (actual time=0.080..0.207 rows=20 loops=1)
Recheck Cond: ((thousand >= 10) AND (thousand <= 11))

regression=# explain analyze select * from tenk1 where thousand between 10 and 10;
before:
Index Scan using tenk1_thous_tenthous on tenk1 (cost=0.29..8.30 rows=1 width=244) (actual time=0.041..0.112 rows=10 loops=1)
with patch:
Bitmap Heap Scan on tenk1 (cost=4.39..39.52 rows=10 width=244) (actual time=0.074..0.142 rows=10 loops=1)

As these examples show, it's cases with very tight range constraints where
this really makes enough difference to be worth doing. I believe the
complaint in bug #14729 basically corresponds to the last example, where
the erroneous rowcount estimate is driving a bad choice of join plan.

Aside from the mind-bendingly-tedious changes in pg_operator.h, the meat
of the patch is in selfuncs.c's ineq_histogram_selectivity(), which now
applies a correction for equal values in the cases where we were getting
it wrong before. While this logic seems experimentally correct (see
above), I have to admit that I've failed to wrap my brain around exactly
why it's correct. The arguments that I've constructed so far seem to
point in the direction of applying the opposite correction, which is
demonstrably wrong. Perhaps someone whose college statistics class
wasn't quite so long ago can explain this satisfactorily?

Aside from the missing/inadequate comment about why to apply this
correction, there remains work to update several contrib modules that
reference scalarltsel/scalargtsel. That could be done separately though.
An extension that doesn't change its <= or >= operators to reference
scalarlesel/scalargesel isn't worse off than before, it's just failing
to benefit from this improvement.

Obviously this is too late for v10; I'll stick it into the next
commitfest.

regards, tom lane

Attachments:

improve-scalar-inequality-estimates-1.patchtext/x-diff; charset=us-ascii; name=improve-scalar-inequality-estimates-1.patchDownload+296-252
#2Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tom Lane (#1)
Re: WIP patch: distinguish selectivity of < from <= and > from >=

On Tue, Jul 4, 2017 at 9:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

regression=# explain analyze select * from tenk1 where thousand < 10;
before:
Bitmap Heap Scan on tenk1 (cost=5.14..241.38 rows=110 width=244) (actual time=0.121..0.623 rows=100 loops=1)
with patch:
Bitmap Heap Scan on tenk1 (cost=5.06..227.42 rows=100 width=244) (actual time=0.054..0.300 rows=100 loops=1)

It's expected that the estimates will change with this patch. But I am
wondering why should actual times vary so much. May be that's just
accidental. Butthe actual timings are consistently lower with the
patch except the last one

regression=# explain analyze select * from tenk1 where thousand between 10 and 10;
before:
Index Scan using tenk1_thous_tenthous on tenk1 (cost=0.29..8.30 rows=1 width=244) (actual time=0.041..0.112 rows=10 loops=1)
with patch:
Bitmap Heap Scan on tenk1 (cost=4.39..39.52 rows=10 width=244) (actual time=0.074..0.142 rows=10 loops=1)

The actual time has increased even though estimation is correct.

Those differences may just vanish if we take average of multiple runs
and are not that important to ponder about.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ashutosh Bapat (#2)
Re: WIP patch: distinguish selectivity of < from <= and > from >=

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:

On Tue, Jul 4, 2017 at 9:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

regression=# explain analyze select * from tenk1 where thousand < 10;
before:
Bitmap Heap Scan on tenk1 (cost=5.14..241.38 rows=110 width=244) (actual time=0.121..0.623 rows=100 loops=1)
with patch:
Bitmap Heap Scan on tenk1 (cost=5.06..227.42 rows=100 width=244) (actual time=0.054..0.300 rows=100 loops=1)

It's expected that the estimates will change with this patch. But I am
wondering why should actual times vary so much. May be that's just
accidental. Butthe actual timings are consistently lower with the
patch except the last one

I didn't intend those examples to illustrate anything except row counts,
so I didn't worry about making the timings comparable. Some of the
examples were probably the first query in a session and so would've faced
cache-populating costs the others didn't. Also, the "before" examples
were all done on 9.6 not HEAD.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Tom Lane (#1)
Re: WIP patch: distinguish selectivity of < from <= and > from >=

On Tue, Jul 4, 2017 at 9:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Aside from the mind-bendingly-tedious changes in pg_operator.h, the meat
of the patch is in selfuncs.c's ineq_histogram_selectivity(), which now
applies a correction for equal values in the cases where we were getting
it wrong before. While this logic seems experimentally correct (see
above), I have to admit that I've failed to wrap my brain around exactly
why it's correct. The arguments that I've constructed so far seem to
point in the direction of applying the opposite correction, which is
demonstrably wrong. Perhaps someone whose college statistics class
wasn't quite so long ago can explain this satisfactorily?

I guess that you're referring the last case, i.e.
explain analyze select * from tenk1 where thousand between 10 and 10;

IMHO, following are the things that I've understood.
The predicate internally got translated to predicate A (p >= 10) and
predicate B (p <=10);
In ineq_histogram_selectivity,
For predicate A, hist_selec = p
For predicate B, hist_selec = 1-p
In clauselist_selectivity,
we calculate the selectivity as = ((p) + (1 - p)) - 1= 0, rounded of to 1.0e-10.

After your changes,
In ineq_histogram_selectivity,
For predicate A, hist_selec = p + correction (since, isgt=iseq)
For predicate B, hist_selec = 1-p
In clauselist_selectivity,
we calculate the selectivity as = ((p + correction) + (1 - p)) - 1= correction,

The correction is calculated as = 1 / num_distinct_values = .001.

Since, the thousand column of tenk1 is uniformly distributed, this
turns out to be the exact selectivity. (rows = .001 * 1000 = 10)

Thoughts?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kuntal Ghosh (#4)
Re: WIP patch: distinguish selectivity of < from <= and > from >=

Kuntal Ghosh <kuntalghosh.2007@gmail.com> writes:

On Tue, Jul 4, 2017 at 9:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

... I have to admit that I've failed to wrap my brain around exactly
why it's correct. The arguments that I've constructed so far seem to
point in the direction of applying the opposite correction, which is
demonstrably wrong. Perhaps someone whose college statistics class
wasn't quite so long ago can explain this satisfactorily?

I guess that you're referring the last case, i.e.
explain analyze select * from tenk1 where thousand between 10 and 10;

No, the thing that is bothering me is why it seems to be correct to
apply a positive correction for ">=", a negative correction for "<",
and no correction for "<=" or ">". That seems weird and I can't
construct a plausible explanation for it. I think it might be a
result of the fact that, given a discrete distribution rather than
a continuous one, the histogram boundary values should be understood
as having some "width" rather than being zero-width points on the
distribution axis. But the arguments I tried to fashion on that
basis led to other rules that didn't actually work.

It's also possible that this logic is in fact wrong and it just happens
to give the right answer anyway for uniformly-distributed cases.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#5)
Re: WIP patch: distinguish selectivity of < from <= and > from >=

I wrote:

No, the thing that is bothering me is why it seems to be correct to
apply a positive correction for ">=", a negative correction for "<",
and no correction for "<=" or ">". That seems weird and I can't
construct a plausible explanation for it.

After further thought, I can put a little more clarity to this, but
it's still not really resolved. It's easily shown by experiment that
the existing code correctly computes the probability that "x <= p"
where p is the given probe value. It uses that value as-is for the <
and <= cases, and 1 minus that value for > and >=. From this statement,
it's clear why the above is the right way to correct matters. What I
find remarkable is that this is what the loop computes regardless of
which of the four operators is used to probe, and regardless of whether
the probe value p is exactly equal to some histogram boundary value.
That doesn't seem intuitive at all --- when p does match a histogram
entry, you'd think it would matter which operator you probe with.

(Pokes at it some more...) Oh, interesting: it behaves that way except
when p is exactly the lowest histogram entry. The code is probably
blowing off that edge case without enough thought.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Tom Lane (#5)
Re: WIP patch: distinguish selectivity of < from <= and > from >=

On Tue, Jul 4, 2017 at 9:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kuntal Ghosh <kuntalghosh.2007@gmail.com> writes:

On Tue, Jul 4, 2017 at 9:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

... I have to admit that I've failed to wrap my brain around exactly
why it's correct. The arguments that I've constructed so far seem to
point in the direction of applying the opposite correction, which is
demonstrably wrong. Perhaps someone whose college statistics class
wasn't quite so long ago can explain this satisfactorily?

I guess that you're referring the last case, i.e.
explain analyze select * from tenk1 where thousand between 10 and 10;

No, the thing that is bothering me is why it seems to be correct to
apply a positive correction for ">=", a negative correction for "<",
and no correction for "<=" or ">". That seems weird and I can't
construct a plausible explanation for it. I think it might be a
result of the fact that, given a discrete distribution rather than
a continuous one, the histogram boundary values should be understood
as having some "width" rather than being zero-width points on the
distribution axis. But the arguments I tried to fashion on that
basis led to other rules that didn't actually work.

It's also possible that this logic is in fact wrong and it just happens
to give the right answer anyway for uniformly-distributed cases.

So, here are two points I think:
1. When should we apply(add/subtract) the correction?
2. What should be the correction?

The first point:
there can be further two cases,
a) histfrac - actual_selectivity(p<=0) = 0.
For this case, I've an explanation why applying the correction in
above way works correctly. (This is the case with tenk1)
Since, histfrac correctly calculates selectivity for (p<=0),
hist_selec will either be <= or > (isgt is true). Hence, there is no
need to apply the correction. A negative correction is needed for less
than operator (sel(<=10) - sel(=10)). Similarly, a positive correction
is needed for greater than and equals to operator (sel(>10) +
sel(=10)).

b) histfrac - actual_selectivity(p<=0) != 0.
This is possible when we've high variance in the histogram buckets. I
guess here things may go wrong. Please consider the following example,
UPDATE tenk1 SET thousand=11 where thousand=10;
VACUUM ANALYZE tenk1;
explain analyze select * from tenk1 where thousand between 10 and 10;
Bitmap Heap Scan on tenk1 (cost=4.39..39.52 rows=10 width=244)
(actual time=0.018..0.018 rows=0 loops=1)

The second point:
In this case, the correction is calculated correctly as selectivity of
(p=0) because of uniform distribution. Hence, it works. When we don't
have uniform distribution, the current calculation of the correction
may prove to be over-estimation for most of the time.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Kuntal Ghosh (#7)
Re: WIP patch: distinguish selectivity of < from <= and > from >=

On Tue, Jul 4, 2017 at 10:56 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

On Tue, Jul 4, 2017 at 9:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kuntal Ghosh <kuntalghosh.2007@gmail.com> writes:

On Tue, Jul 4, 2017 at 9:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

... I have to admit that I've failed to wrap my brain around exactly
why it's correct. The arguments that I've constructed so far seem to
point in the direction of applying the opposite correction, which is
demonstrably wrong. Perhaps someone whose college statistics class
wasn't quite so long ago can explain this satisfactorily?

I guess that you're referring the last case, i.e.
explain analyze select * from tenk1 where thousand between 10 and 10;

No, the thing that is bothering me is why it seems to be correct to
apply a positive correction for ">=", a negative correction for "<",
and no correction for "<=" or ">". That seems weird and I can't
construct a plausible explanation for it. I think it might be a
result of the fact that, given a discrete distribution rather than
a continuous one, the histogram boundary values should be understood
as having some "width" rather than being zero-width points on the
distribution axis. But the arguments I tried to fashion on that
basis led to other rules that didn't actually work.

It's also possible that this logic is in fact wrong and it just happens
to give the right answer anyway for uniformly-distributed cases.

So, here are two points I think:
1. When should we apply(add/subtract) the correction?
2. What should be the correction?

The first point:
there can be further two cases,
a) histfrac - actual_selectivity(p<=0) = 0.

Sorry for the typo. I meant (p<=10) for all the cases.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#6)
Re: WIP patch: distinguish selectivity of < from <= and > from >=

I wrote:

(Pokes at it some more...) Oh, interesting: it behaves that way except
when p is exactly the lowest histogram entry.

Here's a revised version that addresses that point and cleans up some
other minor details about treatment of cases near the histogram endpoints.
I'm still pretty unclear on exactly why it works (see XXX comments in
patch), but testing shows that it does work very well indeed given a flat
data distribution. On less-flat distributions, you get some estimation
errors, but that's neither new nor surprising.

regards, tom lane

Attachments:

improve-scalar-inequality-estimates-2.patchtext/x-diff; charset=us-ascii; name=improve-scalar-inequality-estimates-2.patchDownload+326-264
#10Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Tom Lane (#9)
Re: WIP patch: distinguish selectivity of < from <= and > from >=

On Thu, Jul 6, 2017 at 3:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

(Pokes at it some more...) Oh, interesting: it behaves that way except
when p is exactly the lowest histogram entry.

+ /*
+ * In the first bin (i==1), add a fudge factor that ensures
+ * that histfrac is at least eq_selec.  We do this because we
+ * know that the first histogram entry does satisfy the
+ * inequality (if !isgt) or not satisfy it (if isgt), so our
+ * estimate here should certainly not be zero even if binfrac
+ * is zero.  (XXX experimentally this is the correct way to do
+ * it, but why isn't it a linear adjustment across the whole
+ * histogram rather than just the first bin?)
+ */
Given that the values are distinct, (I've some doubts for real number case)

if histogram_bounds are assigned as,
{0,9,19,29,39,49,59,69,79,89,99,109,119,129,13,..........}

I think the buckets are defined as,
0 < bucket1 <= 9
9 < bucket2 <=19
19 < bucket3 <= 29 and so on.

Because, the histfrac is calculated as follows:

histfrac = (double) (bucket_current - 1) + (val - low) / (high - low);
(where bucket_current is obtained by doing a binary search on
histogram_bounds.)
histfrac /= (double) (nvalues - 1);

So, if val=low, then hisfrac = (bucket_current - 1)/num_of_buckets
which means it assumes val is included in the previous bucket.

This means that it always fails to calculate the selectivity for
lowest histogram boundary. Hence, we need adjustment only for the
first bucket.

Do you think my reasoning justifies your concern?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kuntal Ghosh (#10)
Re: WIP patch: distinguish selectivity of < from <= and > from >=

Kuntal Ghosh <kuntalghosh.2007@gmail.com> writes:

On Thu, Jul 6, 2017 at 3:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
+ * In the first bin (i==1), add a fudge factor that ensures
+ * that histfrac is at least eq_selec.  We do this because we
+ * know that the first histogram entry does satisfy the
+ * inequality (if !isgt) or not satisfy it (if isgt), so our
+ * estimate here should certainly not be zero even if binfrac
+ * is zero.  (XXX experimentally this is the correct way to do
+ * it, but why isn't it a linear adjustment across the whole
+ * histogram rather than just the first bin?)

Given that the values are distinct, (I've some doubts for real number case)

Well, really, we are *always* dealing with discrete distributions here.

if histogram_bounds are assigned as,
{0,9,19,29,39,49,59,69,79,89,99,109,119,129,13,..........}
...
So, if val=low, then hisfrac = (bucket_current - 1)/num_of_buckets
which means it assumes val is included in the previous bucket.

I looked at that again and realized that one of the answers I was missing
lies in the behavior of analyze.c's compute_scalar_stats(). When it has
collected "nvals" values and wants to make a histogram containing
"num_hist" entries, it does this:

* The object of this loop is to copy the first and last values[]
* entries along with evenly-spaced values in between. So the
* i'th value is values[(i * (nvals - 1)) / (num_hist - 1)].

(where i runs from 0 to num_hist - 1). Because the "/" denotes integer
division, this effectively means that for all but the first entry,
we are taking the last value out of the corresponding bucket of samples.
That's obviously true for the last histogram entry, which will be the very
last sample value, and it's also true for earlier ones --- except for the
*first* histogram entry, which is necessarily the first one in its bucket.
So the first histogram bucket is actually a tad narrower than the others,
which is visible in the typical data you showed above: 0 to 9 is only 9
counts wide, but all the remaining buckets are 10 counts wide. That
explains why we need a scale adjustment just in the first bucket.

I think I'm also beginning to grasp why scalarineqsel's basic estimation
process produces an estimate for "x <= p" rather than some other condition
such as "x < p". For clarity, imagine that we're given p equal to the
last histogram entry. If the test operator is in fact "<=", then it will
say "true" for every histogram entry, and we'll fall off the end of the
histogram and return 1.0, which is correct. If the test operator is "<",
it will return "true" for all but the last entry, so that we end up with
"i" equal to sslot.nvalues - 1 --- but we will compute binfrac = 1.0,
if convert_to_scalar() produces sane answers, again resulting in
histfrac = 1.0. Similar reasoning applies for ">" and ">=", so that in
all cases we arrive at histfrac = 1.0 if p equals the last histogram
entry. More generally, if p is equal to some interior histogram entry,
say the k'th one out of n total, we end up with histfrac = (k-1)/(n-1)
no matter which operator we probe with, assuming that convert_to_scalar()
correctly gives us a binfrac of 0.0 or 1.0 depending on which of the
adjacent bins we end up examining. So, remembering that the histogram
entries occupy the right ends of their buckets, the proper interpretation
of that is that it's the probability of "x <= p". (This is wrong for the
first histogram entry, but that's why we need a correction there.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#11)
Re: WIP patch: distinguish selectivity of < from <= and > from >=

I wrote:

I looked at that again and realized that one of the answers I was missing
lies in the behavior of analyze.c's compute_scalar_stats().

I updated the patch's comments based on this. I'll just attach the
selfuncs.c part of the patch, since nothing else changed.

regards, tom lane

Attachments:

improve-scalar-inequality-estimates-3-selfuncs-only.patchtext/x-diff; charset=us-ascii; name=improve-scalar-inequality-estimates-3-selfuncs-only.patchDownload+203-120
#13Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Tom Lane (#11)
Re: WIP patch: distinguish selectivity of < from <= and > from >=

On Fri, Jul 7, 2017 at 2:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kuntal Ghosh <kuntalghosh.2007@gmail.com> writes:

Wow. Thank you for the wonderful explanation.

On Thu, Jul 6, 2017 at 3:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

if histogram_bounds are assigned as,
{0,9,19,29,39,49,59,69,79,89,99,109,119,129,13,..........}
...
So, if val=low, then hisfrac = (bucket_current - 1)/num_of_buckets
which means it assumes val is included in the previous bucket.

I looked at that again and realized that one of the answers I was missing
lies in the behavior of analyze.c's compute_scalar_stats(). When it has
collected "nvals" values and wants to make a histogram containing
"num_hist" entries, it does this:

* The object of this loop is to copy the first and last values[]
* entries along with evenly-spaced values in between. So the
* i'th value is values[(i * (nvals - 1)) / (num_hist - 1)].

(where i runs from 0 to num_hist - 1). Because the "/" denotes integer
division, this effectively means that for all but the first entry,
we are taking the last value out of the corresponding bucket of samples.
That's obviously true for the last histogram entry, which will be the very
last sample value, and it's also true for earlier ones --- except for the
*first* histogram entry, which is necessarily the first one in its bucket.
So the first histogram bucket is actually a tad narrower than the others,
which is visible in the typical data you showed above: 0 to 9 is only 9
counts wide, but all the remaining buckets are 10 counts wide. That
explains why we need a scale adjustment just in the first bucket.

Agreed. But, I've some doubt regarding the amount of adjustment needed.
+  if (i == 1)
+      histfrac += eq_selec * (1.0 - binfrac);
For predicate x<=p, when p lies in the first bucket, following is the amount
of binfrac that we're off from the actual one.
(Assume the same histogram boundaries i.e., 0,9,19,...)

For p=0, (1/10)-(0/9) = 1/10 * (1 - 0)
For p=1, (2/10)-(1/9) = 1/10 * (1 - 1/9)
For p=2, (3/10)-(2/9) = 1/10 * (1 - 2/9)
For p=3, (4/10)-(3/9) = 1/10 * (1 - 3/9)

In general, 1/(high + 1 - low) * (1.0 - binfrac) and the difference in
histfrac is
(1.0 - binfrac) / (high + 1 - low) / num_of_hist_buckets.

So, the above adjustment for the first bucket looks correct when,
otherdistinct ~ (high + 1 - low) * num_of_hist_buckets

Is that always true or I'm missing something?

I think I'm also beginning to grasp why scalarineqsel's basic estimation
process produces an estimate for "x <= p" rather than some other condition
such as "x < p". For clarity, imagine that we're given p equal to the
last histogram entry. If the test operator is in fact "<=", then it will
say "true" for every histogram entry, and we'll fall off the end of the
histogram and return 1.0, which is correct. If the test operator is "<",
it will return "true" for all but the last entry, so that we end up with
"i" equal to sslot.nvalues - 1 --- but we will compute binfrac = 1.0,
if convert_to_scalar() produces sane answers, again resulting in
histfrac = 1.0. Similar reasoning applies for ">" and ">=", so that in
all cases we arrive at histfrac = 1.0 if p equals the last histogram
entry. More generally, if p is equal to some interior histogram entry,
say the k'th one out of n total, we end up with histfrac = (k-1)/(n-1)
no matter which operator we probe with, assuming that convert_to_scalar()
correctly gives us a binfrac of 0.0 or 1.0 depending on which of the
adjacent bins we end up examining. So, remembering that the histogram
entries occupy the right ends of their buckets, the proper interpretation
of that is that it's the probability of "x <= p". (This is wrong for the
first histogram entry, but that's why we need a correction there.)

True. In general, histfrac = (k-1)/(n-1) + binfrac where binfrac
depends on the linear interpolation.

For p=some histogram boundary, it'll always be the probability of
"x<=p". When p lies inside the bucket and we've a flat distribution,
then also it'll be the probability of "x<=p". But, when we've high
variance inside a bucket and p lies inside the bucket, linear
interpolation fails to capture the actual probability. But, as you've
mentioned earlier, I guess that's not a new problem.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kuntal Ghosh (#13)
Re: WIP patch: distinguish selectivity of < from <= and > from >=

Here's an updated version of this patch that is rebased against HEAD
(no changes there except line numbers) and adds the necessary updates
for contrib.

I'd like to get this committed fairly soon before it bit-rots.
Anyone want to review it?

regards, tom lane

Attachments:

improve-scalar-inequality-estimates-4.patchtext/x-diff; charset=us-ascii; name=improve-scalar-inequality-estimates-4.patchDownload+632-272
#15Aleksander Alekseev
aleksander@timescale.com
In reply to: Tom Lane (#14)
Re: WIP patch: distinguish selectivity of < from <= and > from >=

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

Also I didn't manage to find any typos or obvious mistakes in the code.

The new status of this patch is: Ready for Committer

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Aleksander Alekseev (#15)
Re: WIP patch: distinguish selectivity of < from <= and > from >=

Aleksander Alekseev <a.alekseev@postgrespro.ru> writes:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

Also I didn't manage to find any typos or obvious mistakes in the code.

Thanks for reviewing! I assume this is against the prior version of the
patch, though, not the one I just posted with updates for contrib.
Do you want to look at those?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Aleksander Alekseev
aleksander@timescale.com
In reply to: Tom Lane (#16)
Re: WIP patch: distinguish selectivity of < from <= and > from >=

Hi Tom,

Thanks for reviewing! I assume this is against the prior version of the
patch, though, not the one I just posted with updates for contrib.
Do you want to look at those?

regards, tom lane

No, I reviewed the latest v4 patch right after you submitted it.

--
Best regards,
Aleksander Alekseev

#18Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Aleksander Alekseev (#17)
Re: WIP patch: distinguish selectivity of < from <= and > from >=

On Tue, Sep 12, 2017 at 9:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Aleksander Alekseev <a.alekseev@postgrespro.ru> writes:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

Also I didn't manage to find any typos or obvious mistakes in the code.

Thanks for reviewing! I assume this is against the prior version of the
patch, though, not the one I just posted with updates for contrib.
Do you want to look at those?

I've performed the regression tests for the same. It passed all the
test cases. Also, I've verified the feature implementation using the
queries posted by you earlier and some of my own test cases. It is
working as expected.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kuntal Ghosh (#18)
Re: WIP patch: distinguish selectivity of < from <= and > from >=

Kuntal Ghosh <kuntalghosh.2007@gmail.com> writes:

On Tue, Sep 12, 2017 at 9:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Aleksander Alekseev <a.alekseev@postgrespro.ru> writes:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed
Also I didn't manage to find any typos or obvious mistakes in the code.

I've performed the regression tests for the same. It passed all the
test cases. Also, I've verified the feature implementation using the
queries posted by you earlier and some of my own test cases. It is
working as expected.

Pushed, thanks for reviewing!

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers