Improve selectivity estimate for range queries

Started by Yuzuko Hosoyaover 7 years ago12 messageshackers
Jump to latest
#1Yuzuko Hosoya
hosoya.yuzuko@lab.ntt.co.jp

Hi,

I found the problem about selectivity estimate for range queries
on master as follows. This is my test case:

postgres=# CREATE TABLE test(id int, val text);
CREATE TABLE
postgres=# INSERT INTO test SELECT i, 'testval' FROM generate_series(0,29999)i;
INSERT 0 30000
postgres=# VACUUM analyze test;
VACUUM
postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE id > 0 and id < 10000;
QUERY PLAN
----------------------------------------------------------------------------------------------------
---
Seq Scan on test (cost=0.00..613.00 rows=150 width=12) (actual time=0.044..21.371 rows=9999
loops=1)
Filter: ((id > 0) AND (id < 10000))
Rows Removed by Filter: 20001
Planning Time: 0.099 ms
Execution Time: 37.142 ms
(5 rows)

Although the actual number of rows was 9999, the estimated number
of rows was 150.

So I dug to see what happened and thought that the following part
in clauselist_selectivity caused this problem.

------------
/*
* Now scan the rangequery pair list.
*/
while (rqlist != NULL)
{
RangeQueryClause *rqnext;

if (rqlist->have_lobound && rqlist->have_hibound)
{
/* Successfully matched a pair of range clauses */
Selectivity s2;

/*
* Exact equality to the default value probably means the
* selectivity function punted. This is not airtight but should
* be good enough.
*/
if (rqlist->hibound == DEFAULT_INEQ_SEL ||
rqlist->lobound == DEFAULT_INEQ_SEL)
{
s2 = DEFAULT_RANGE_INEQ_SEL;
}
else
{
s2 = rqlist->hibound + rqlist->lobound - 1.0;
------------

In my environment, the selectivity for id > 0 was 0.99990000000000001,
and the selectivity for id < 10000 was 0.33333333333333331. Then, the
value of rqlist->hibound and rqlist->lobound are set respectively.
On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333. As a result,
the final selectivity is computed according to DEFAULT_RANGE_INEQ_SEL,
since the condition of the second if statement was satisfied. However,
these selectivities were computed according to the statistics, so the
final selectivity had to be calculated from rqlist->hibound + rqlist->lobound - 1.0.

My test case might be uncommon, but I think it would cause some problems.
To handle such cases I've thought up of an idea based on a previous commit[1]https://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=commitdiff&amp;h=4c2777d0b733220d9029f78817af8ce
which solved a similar problem related to DEFAULT_NUM_DISTINCT. Just like
this modification, I added a flag which shows whether the selectivity
was calculated according to the statistics or not to clauselist_selectivity,
and used it as a condition of the if statement instead of
rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL.
I am afraid that changes were more than I had expected, but we can estimate
selectivity correctly.

Patch attached. Do you have any thoughts?

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=commitdiff&amp;h=4c2777d0b733220d9029f78817af8ce
https://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=commitdiff&amp;h=4c2777d0b733220d9029f78817af8ce

Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Attachments:

improve_selectivity_estimate_for_range_queries.patchapplication/octet-stream; name=improve_selectivity_estimate_for_range_queries.patchDownload+74-38
#2Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Yuzuko Hosoya (#1)
Re: Improve selectivity estimate for range queries

Hello.

At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in <008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp>

In my environment, the selectivity for id > 0 was 0.99990000000000001,
and the selectivity for id < 10000 was 0.33333333333333331. Then, the
value of rqlist->hibound and rqlist->lobound are set respectively.
On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333. As a result,
the final selectivity is computed according to DEFAULT_RANGE_INEQ_SEL,
since the condition of the second if statement was satisfied. However,
these selectivities were computed according to the statistics, so the
final selectivity had to be calculated from rqlist->hibound + rqlist->lobound - 1.0.
My test case might be uncommon, but I think it would cause some problems.

Yeah, its known behavior as the comment just above. If in your
example query, the table weer very large and had an index it
could run faultly an index scan over a fair amount of tuples. But
I'm not sure how much it matters and your example doesn't show
that.

The behvavior discussed here is introduced by this commit.

| commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a
| Author: Tom Lane <tgl@sss.pgh.pa.us>
| Date: Tue Nov 9 00:34:46 2004 +0000
|
| Use a hopefully-more-reliable method of detecting default selectivity
| estimates when combining the estimates for a range query. As pointed out
| by Miquel van Smoorenburg, the existing check for an impossible combined
| result would quite possibly fail to detect one default and one non-default
| input. It seems better to use the default range query estimate in such
| cases. To do so, add a check for an estimate of exactly DEFAULT_INEQ_SEL.
| This is a bit ugly because it introduces additional coupling between
| clauselist_selectivity and scalarltsel/scalargtsel, but it's not like
| there wasn't plenty already...

To handle such cases I've thought up of an idea based on a previous commit[1]
which solved a similar problem related to DEFAULT_NUM_DISTINCT. Just like
this modification, I added a flag which shows whether the selectivity

The commit [1] added almost no complexity, but this does. Affects
many functions to introduce tighter coupling between
operator-selectivity functions and clause selectivity functions.
isdefault travells alone too long just to bind remote
functions. We might need more pricipled way to handle that.

was calculated according to the statistics or not to clauselist_selectivity,
and used it as a condition of the if statement instead of
rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL.
I am afraid that changes were more than I had expected, but we can estimate
selectivity correctly.

Patch attached. Do you have any thoughts?

I've just run over the patch, but some have comments.

+ if (isdefault)
+ rqelem->lobound_isdefault = true;

This is taking the disjunction of lobounds ever added, I suppose
it should be the last lobounds' isdefault.

As you see, four other instances of "selec ==/!= DEFAULT_*" exist
in mergejoinsel. Don't they lead to similar kind of problem?

     Selectivity lobound;        /* Selectivity of a var > something clause */
     Selectivity hibound;        /* Selectivity of a var < something clause */
+    bool        lobound_isdefault;    /* lobound is a default selectivity? */
+    bool        hibound_isdefault;    /* hibound is a default selectivity? */
 } RangeQueryClause;

Around the [1], there was a discussion about introducing the
notion of uncertainty to selectivity. The isdefualt is a kind of
uncertainty value indicating '0/100% uncertain'. So my feeling is
saying that it's better that Selectivity has certinty component
then building a selectivity arithmetics involving uncertainty,
though I don't have a clear picture.

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=commitdiff&amp;h=4c2777d0b733220d9029f78817af8ce

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#3Yuzuko Hosoya
hosoya.yuzuko@lab.ntt.co.jp
In reply to: Kyotaro Horiguchi (#2)
RE: Improve selectivity estimate for range queries

Hi,

Thanks for the comments.
I attach the v2 patch.

From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
Sent: Friday, December 21, 2018 12:25 PM

Hello.

At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in
<008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp>

In my environment, the selectivity for id > 0 was 0.99990000000000001,
and the selectivity for id < 10000 was 0.33333333333333331. Then, the
value of rqlist->hibound and rqlist->lobound are set respectively.
On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333. As a
result, the final selectivity is computed according to
DEFAULT_RANGE_INEQ_SEL, since the condition of the second if statement
was satisfied. However, these selectivities were computed according to
the statistics, so the final selectivity had to be calculated from rqlist->hibound +

rqlist->lobound

- 1.0.

My test case might be uncommon, but I think it would cause some problems.

Yeah, its known behavior as the comment just above. If in your example query, the table weer very

large

and had an index it could run faultly an index scan over a fair amount of tuples. But I'm not sure
how much it matters and your example doesn't show that.

The behvavior discussed here is introduced by this commit.

| commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a
| Author: Tom Lane <tgl@sss.pgh.pa.us>
| Date: Tue Nov 9 00:34:46 2004 +0000
|
| Use a hopefully-more-reliable method of detecting default selectivity
| estimates when combining the estimates for a range query. As pointed out
| by Miquel van Smoorenburg, the existing check for an impossible combined
| result would quite possibly fail to detect one default and one non-default
| input. It seems better to use the default range query estimate in such
| cases. To do so, add a check for an estimate of exactly DEFAULT_INEQ_SEL.
| This is a bit ugly because it introduces additional coupling between
| clauselist_selectivity and scalarltsel/scalargtsel, but it's not like
| there wasn't plenty already...

To handle such cases I've thought up of an idea based on a previous
commit[1] which solved a similar problem related to
DEFAULT_NUM_DISTINCT. Just like this modification, I added a flag
which shows whether the selectivity

The commit [1] added almost no complexity, but this does. Affects many functions to introduce

tighter

coupling between operator-selectivity functions and clause selectivity functions.
isdefault travells alone too long just to bind remote functions. We might need more pricipled way

to

handle that.

Yes, indeed. But I have not found better way than this approach yet.

was calculated according to the statistics or not to
clauselist_selectivity, and used it as a condition of the if statement
instead of
rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL.
I am afraid that changes were more than I had expected, but we can
estimate selectivity correctly.

Patch attached. Do you have any thoughts?

I've just run over the patch, but some have comments.

+ if (isdefault)
+ rqelem->lobound_isdefault = true;

This is taking the disjunction of lobounds ever added, I suppose it should be the last lobounds'
isdefault.

Indeed. I fixed it.

As you see, four other instances of "selec ==/!= DEFAULT_*" exist in mergejoinsel. Don't they lead
to similar kind of problem?

I didn't encounter similar problems, but as you said, such problems would be occurred due to the
condition, selec != DEFAULT_INEQ_SEL. I changed these conditions into "!isdefault".

Selectivity lobound;        /* Selectivity of a var > something clause */
Selectivity hibound;        /* Selectivity of a var < something clause */
+    bool        lobound_isdefault;    /* lobound is a default selectivity? */
+    bool        hibound_isdefault;    /* hibound is a default selectivity? */
} RangeQueryClause;

Around the [1], there was a discussion about introducing the notion of uncertainty to selectivity.
The isdefualt is a kind of uncertainty value indicating '0/100% uncertain'. So my feeling is

saying

that it's better that Selectivity has certinty component then building a selectivity arithmetics
involving uncertainty, though I don't have a clear picture.

I see. Indeed if we would change Selectivity so that it includes
information about default/non-default, such problems I mentioned
would be solved. But I'm afraid that this modification would lead
to a big impact, since there are lots of functions that calculate
selectivity. I also don't have a concreate idea, so I will think
about it. Besides that, I'd like to ask community whether such
modification would be acceptable or not.

Best regards,

Yuzuko Hosoya
NTT Open Source Software Center

Show quoted text

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=commitdiff&amp;h=4c2
777d0b733220d9029f78817af8ce

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v2_improve_selectivity_estimate_for_range_queries.patchapplication/octet-stream; name=v2_improve_selectivity_estimate_for_range_queries.patchDownload+90-42
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Yuzuko Hosoya (#3)
Re: Improve selectivity estimate for range queries

"Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> writes:

From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]

At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in
<008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp>

To handle such cases I've thought up of an idea based on a previous
commit[1] which solved a similar problem related to
DEFAULT_NUM_DISTINCT. Just like this modification, I added a flag
which shows whether the selectivity

The commit [1] added almost no complexity, but this does. Affects many
functions to introduce tighter coupling between operator-selectivity
functions and clause selectivity functions. isdefault travells alone
too long just to bind remote functions. We might need more pricipled
way to handle that.

Yeah, I don't think that this approach is a reasonable tradeoff to
eliminate a narrow case. In particular, what's been done here to
clause_selectivity is totally unprincipled and poorly thought out:
you can't add an output parameter that's set in only some cases.
Yet, there's no good way to decide how to set it in many cases.
For instance, what would you do for an AND or OR where some of
the branches are default estimates and some not?

Around the [1], there was a discussion about introducing the notion of
uncertainty to selectivity. The isdefualt is a kind of uncertainty
value indicating '0/100% uncertain'. So my feeling is saying that
it's better that Selectivity has certinty component then building a
selectivity arithmetics involving uncertainty, though I don't have a
clear picture.

I see. Indeed if we would change Selectivity so that it includes
information about default/non-default, such problems I mentioned
would be solved. But I'm afraid that this modification would lead
to a big impact, since there are lots of functions that calculate
selectivity. I also don't have a concreate idea, so I will think
about it. Besides that, I'd like to ask community whether such
modification would be acceptable or not.

The planner definitely has a lot of problems that could be addressed
with some sort of uncertainty framework ... but it'd be a huge
undertaking, which is why nobody's tried yet.

A smaller-footprint way to fix the immediate problem might be to
change the values of DEFAULT_INEQ_SEL and friends so that they're
even less likely to be matched by accident. That is, instead of
0.3333333333333333, use 0.333333333333342 or some such. It might
seem that that's just moving the problem around, but I think it
might be possible to show that such a value couldn't be computed
by scalarltsel given a histogram with no more than 10000 members.
(I haven't tried to actually prove that, but it seems intuitive
that the set of possible results would be quantized with no more
than about 5 digits precision.)

regards, tom lane

#5Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tom Lane (#4)
Re: Improve selectivity estimate for range queries

Hello.

At Fri, 21 Dec 2018 11:50:28 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <28533.1545411028@sss.pgh.pa.us>

"Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> writes:

From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]

At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in
<008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp>

To handle such cases I've thought up of an idea based on a previous
commit[1] which solved a similar problem related to
DEFAULT_NUM_DISTINCT. Just like this modification, I added a flag
which shows whether the selectivity

The commit [1] added almost no complexity, but this does. Affects many
functions to introduce tighter coupling between operator-selectivity
functions and clause selectivity functions. isdefault travells alone
too long just to bind remote functions. We might need more pricipled
way to handle that.

Yeah, I don't think that this approach is a reasonable tradeoff to
eliminate a narrow case. In particular, what's been done here to
clause_selectivity is totally unprincipled and poorly thought out:
you can't add an output parameter that's set in only some cases.
Yet, there's no good way to decide how to set it in many cases.
For instance, what would you do for an AND or OR where some of
the branches are default estimates and some not?

Around the [1], there was a discussion about introducing the notion of
uncertainty to selectivity. The isdefualt is a kind of uncertainty
value indicating '0/100% uncertain'. So my feeling is saying that
it's better that Selectivity has certinty component then building a
selectivity arithmetics involving uncertainty, though I don't have a
clear picture.

I see. Indeed if we would change Selectivity so that it includes
information about default/non-default, such problems I mentioned
would be solved. But I'm afraid that this modification would lead
to a big impact, since there are lots of functions that calculate
selectivity. I also don't have a concreate idea, so I will think
about it. Besides that, I'd like to ask community whether such
modification would be acceptable or not.

The planner definitely has a lot of problems that could be addressed
with some sort of uncertainty framework ... but it'd be a huge
undertaking, which is why nobody's tried yet.

Sure.

A smaller-footprint way to fix the immediate problem might be to
change the values of DEFAULT_INEQ_SEL and friends so that they're
even less likely to be matched by accident. That is, instead of
0.3333333333333333, use 0.333333333333342 or some such. It might

Sounds reasonable.

seem that that's just moving the problem around, but I think it
might be possible to show that such a value couldn't be computed
by scalarltsel given a histogram with no more than 10000 members.
(I haven't tried to actually prove that, but it seems intuitive
that the set of possible results would be quantized with no more
than about 5 digits precision.)

FWIW, I got the following result on my environment. It seems
different enough if this holds on all supported platforms, though
there still is a case where the result of a sequence of
arithmetics makes false match.

x = 0.33333333333333331
d = 1.000000e+30 match
d = 1.000000e+31: 0.000000 match
x = 0.33333333333334197
d = 0.000000e+00 match
d = 1.000000e+00: 0.000000 match

==== t.c
#include <stdio.h>
#include <math.h>

int test(double x)
{
double d = 1.0;
double d0 = 0;

fprintf(stderr, "x = %19.17f\n", x);
while (d != d0)
{
double z = floor(d * 3);
double z1 = z + 1.0;
double y = d / z;
double y1 = d / z1;

/* Check if both sides of d * 3 doesn't make match */
if (y != x && y1 != x)
{
fprintf(stderr, " d = %e match\n", d0);
fprintf(stderr, " d = %e: %f match\n", d);
return 0;
}

d0 = d;
d = d * 10;
}
}

int main(void)
{
test(0.3333333333333333);
test(0.333333333333342);

return 0;
}
====

--
Kyotaro Horiguchi
NTT Open Source Software Center

#6Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#5)
Re: Improve selectivity estimate for range queries

Mmm.

At Tue, 08 Jan 2019 16:26:38 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190108.162638.106314087.horiguchi.kyotaro@lab.ntt.co.jp>

FWIW, I got the following result on my environment. It seems
different enough if this holds on all supported platforms, though
there still is a case where the result of a sequence of
arithmetics makes false match.

x = 0.33333333333333331
d = 1.000000e+30 match
d = 1.000000e+31: 0.000000 match

Of course the "match" in the last line above is a mistake of "NOT
match".

--
Kyotaro Horiguchi
NTT Open Source Software Center

#7Kyotaro HORIGUCHI
horiguchi.kyotaro@oss.ntt.co.jp
In reply to: Kyotaro Horiguchi (#6)
Re: Improve selectivity estimate for range queries

Sigh.. In the frrst place, the loop should not stop until the maximum
exponent.
sorry, but I don't have a time now..

2019年1月8日(火) 16:43 Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>:

Mmm.

At Tue, 08 Jan 2019 16:26:38 +0900 (Tokyo Standard Time), Kyotaro
HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <
20190108.162638.106314087.horiguchi.kyotaro@lab.ntt.co.jp>

FWIW, I got the following result on my environment. It seems
different enough if this holds on all supported platforms, though
there still is a case where the result of a sequence of
arithmetics makes false match.

x = 0.33333333333333331
d = 1.000000e+30 match
d = 1.000000e+31: 0.000000 match

Of course the "match" in the last line above is a mistake of "NOT
match".

--

Show quoted text

Kyotaro Horiguchi
NTT Open Source Software Center

#8Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#5)
Re: Improve selectivity estimate for range queries

At Tue, 08 Jan 2019 16:26:38 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190108.162638.106314087.horiguchi.kyotaro@lab.ntt.co.jp>

Hello.

At Fri, 21 Dec 2018 11:50:28 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <28533.1545411028@sss.pgh.pa.us>

seem that that's just moving the problem around, but I think it
might be possible to show that such a value couldn't be computed
by scalarltsel given a histogram with no more than 10000 members.
(I haven't tried to actually prove that, but it seems intuitive
that the set of possible results would be quantized with no more
than about 5 digits precision.)

I think we don't need a perfect proof for that. The fact that
exactly 1/3 is quite natural and common but 1/3 + ε is not would
be enough.

FWIW, I got the following result on my environment. It seems
different enough if this holds on all supported platforms, though
there still is a case where the result of a sequence of
arithmetics makes false match.

Simple selectivity of a relation theoretically cannot match with
the epsilon. (Of couse on *my* environment.)

(0.333..)
binary format: 3f d5 55 55 55 55 55 55
x = 0.333333333333333315
231 matches, 79 no_matches

(0.3{13}42..)
binary format: 3f d5 55 55 55 55 55 f1
x = 0.333333333333341975
0 matches, 310 no_matches

(0.3{15}42..)
binary format: 3f d5 55 55 55 55 55 57
x = 0.333333333333333426
0 matches, 310 no_matches

It seems that 0.3{13}42 is correctly 0.3{15}42, which makes just
two LSBs difference from 1/3. I believe C is well standardized on
the translation. Other DEFAULT_*_SELs are not compared in this
way.

The attached small patch fixes the case mentioned in this thread,
but I'm not sure where to put a test. Static assertion is not
usable. Assertion in somewhere perhaps in clauselist_selectivity
seems somewhat overdone.. I don't find a suitable place in
regression test..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

t.ctext/plain; charset=us-asciiDownload
avoid_false_match_with_default_ineq_sel.patchtext/x-patch; charset=us-asciiDownload+7-3
#9Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#4)
Re: Improve selectivity estimate for range queries

On Fri, Dec 21, 2018 at 11:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

A smaller-footprint way to fix the immediate problem might be to
change the values of DEFAULT_INEQ_SEL and friends so that they're
even less likely to be matched by accident. That is, instead of
0.3333333333333333, use 0.333333333333342 or some such. It might
seem that that's just moving the problem around, but I think it
might be possible to show that such a value couldn't be computed
by scalarltsel given a histogram with no more than 10000 members.
(I haven't tried to actually prove that, but it seems intuitive
that the set of possible results would be quantized with no more
than about 5 digits precision.

That's not a dumb idea, but it seems pretty unprincipled to me, and to
be honest I'm kind of surprised that you're not proposing something
cleaner. If you admit the need to distinguish between real estimates
and last-ditch ones, why shouldn't we have an explicit representation
of that rather than trying to pick a last-ditch value that is unlikely
to be confused with a real one?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#9)
Re: Improve selectivity estimate for range queries

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Dec 21, 2018 at 11:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

A smaller-footprint way to fix the immediate problem might be to
change the values of DEFAULT_INEQ_SEL and friends so that they're
even less likely to be matched by accident. That is, instead of
0.3333333333333333, use 0.333333333333342 or some such.

That's not a dumb idea, but it seems pretty unprincipled to me, and to
be honest I'm kind of surprised that you're not proposing something
cleaner.

The problem is the invasiveness of such a change (large) vs the benefit
(not so large). The upthread patch attempted to add a separate signaling
path, but it was very incomplete --- and yet both I and Horiguchi-san
thought it was already too messy.

Maybe at some point we'll go over to something reasonably principled,
like adding confidence intervals to all selectivity estimates. That
would be *really* invasive but perhaps would bring enough benefit to
justify itself. But the current patch is just attempting to fix one
extremely narrow pain point, and if that is all it's doing it should
have a commensurately small footprint. So I don't think the submitted
patch looks good from a cost/benefit standpoint.

regards, tom lane

#11Yuzuko Hosoya
hosoya.yuzuko@lab.ntt.co.jp
In reply to: Tom Lane (#10)
RE: Improve selectivity estimate for range queries

Hi,

Thanks for the comments, and I'm sorry for the late reply.

From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, January 11, 2019 7:04 AM

Robert Haas <robertmhaas@gmail.com> writes:
On Fri, Dec 21, 2018 at 11:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

A smaller-footprint way to fix the immediate problem might be to
change the values of DEFAULT_INEQ_SEL and friends so that they're
even less likely to be matched by accident. That is, instead of
0.3333333333333333, use 0.333333333333342 or some such.

That's not a dumb idea, but it seems pretty unprincipled to me, and to
be honest I'm kind of surprised that you're not proposing something
cleaner.

The problem is the invasiveness of such a change (large) vs the benefit (not so large). The

upthread

patch attempted to add a separate signaling path, but it was very incomplete --- and yet both I

and

Horiguchi-san thought it was already too messy.

Maybe at some point we'll go over to something reasonably principled, like adding confidence

intervals

to all selectivity estimates. That would be *really* invasive but perhaps would bring enough

benefit

to justify itself. But the current patch is just attempting to fix one extremely narrow pain

point,

and if that is all it's doing it should have a commensurately small footprint. So I don't think

the

submitted patch looks good from a cost/benefit standpoint.

Yes, I agree with you. Indeed the patch I attached is insufficient in cost-effectiveness.
However, I want to solve problems of that real estimates happened to equal to the default
values such as this case, even though it's a narrow pain point. So I tried distinguishing
explicitly between real estimates and otherwise as Robert said.

The idea Tom proposed and Horiguchi-san tried seems reasonable, but I'm concerned whether
any range queries really cannot match 0.333333333333342 (or some such) by accident in any
environments. Is the way which Horiguchi-san did enough to prove that?

Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

#12Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Yuzuko Hosoya (#11)
Re: Improve selectivity estimate for range queries

Hi.

At Fri, 11 Jan 2019 11:36:47 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in <000001d4a956$806a2ab0$813e8010$@lab.ntt.co.jp>

Hi,

Thanks for the comments, and I'm sorry for the late reply.

From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, January 11, 2019 7:04 AM

Robert Haas <robertmhaas@gmail.com> writes:
On Fri, Dec 21, 2018 at 11:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

A smaller-footprint way to fix the immediate problem might be to
change the values of DEFAULT_INEQ_SEL and friends so that they're
even less likely to be matched by accident. That is, instead of
0.3333333333333333, use 0.333333333333342 or some such.

That's not a dumb idea, but it seems pretty unprincipled to me, and to
be honest I'm kind of surprised that you're not proposing something
cleaner.

The problem is the invasiveness of such a change (large) vs the benefit (not so large). The

upthread

patch attempted to add a separate signaling path, but it was very incomplete --- and yet both I

and

Horiguchi-san thought it was already too messy.

Maybe at some point we'll go over to something reasonably principled, like adding confidence

intervals

to all selectivity estimates. That would be *really* invasive but perhaps would bring enough

benefit

to justify itself. But the current patch is just attempting to fix one extremely narrow pain

point,

and if that is all it's doing it should have a commensurately small footprint. So I don't think

the

submitted patch looks good from a cost/benefit standpoint.

Yes, I agree with you. Indeed the patch I attached is insufficient in cost-effectiveness.
However, I want to solve problems of that real estimates happened to equal to the default
values such as this case, even though it's a narrow pain point. So I tried distinguishing
explicitly between real estimates and otherwise as Robert said.

The idea Tom proposed and Horiguchi-san tried seems reasonable, but I'm concerned whether
any range queries really cannot match 0.333333333333342 (or some such) by accident in any
environments. Is the way which Horiguchi-san did enough to prove that?

It cannot be perfect in priciple, but I think it is reliable
enough. This is not principled at all but very effective
comparatively the complexity, I think.

Actually, i/(i*3+1) for some 10^n's are:

1/ 4:binary format: 3f d0 00 00 00 00 00 00
10/ 31:binary format: 3f d4 a5 29 4a 52 94 a5
100/ 301:binary format: 3f d5 43 30 7a 78 c5 51
1000/ 3001:binary format: 3f d5 53 83 74 70 f1 95
10000/ 30001:binary format: 3f d5 55 26 bb 44 2b a8
100000/ 300001:binary format: 3f d5 55 50 ac 4a 74 6d
1000000/ 3000001:binary format: 3f d5 55 54 de 07 5a 96
10000000/ 30000001:binary format: 3f d5 55 55 49 67 22 6d
100000000/ 300000001:binary format: 3f d5 55 55 54 23 e9 d7
1000000000/ 3000000001:binary format: 3f d5 55 55 55 36 ca 95
10000000000/ 30000000001:binary format: 3f d5 55 55 55 52 47 75
100000000000/ 300000000001:binary format: 3f d5 55 55 55 55 07 25
1000000000000/ 3000000000001:binary format: 3f d5 55 55 55 55 4d 84
10000000000000/ 30000000000001:binary format: 3f d5 55 55 55 55 54 8d
100000000000000/ 300000000000001:binary format: 3f d5 55 55 55 55 55 41
1000000000000000/ 3000000000000001:binary format: 3f d5 55 55 55 55 55 53

So, (as Tom said upthread) intuitively we can expect that we are
safe up to roughly 10^10? for the simple cases.

# Of course involving histogram makes things complex, though.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center