Less-silly selectivity for JSONB matching operators
While looking at a recent complaint about bad planning, I was
reminded that jsonb's @> and related operators use "contsel"
as their selectivity estimator. This is really bad, because
(a) contsel is only a stub, yielding a fixed default estimate,
and (b) that default is 0.001, meaning we estimate these operators
as five times more selective than equality, which is surely pretty
silly.
There's a good model for improving this in ltree's ltreeparentsel():
for any "var OP constant" query, we can try applying the operator
to all of the column's MCV and histogram values, taking the latter
as being a random sample of the non-MCV values. That code is
actually 100% generic except for the question of exactly what
default selectivity ought to be plugged in when we don't have stats.
Hence, the attached draft patch moves that logic into a generic
function in selfuncs.c, and then invents "matchsel" and "matchjoinsel"
generic estimators that have a default estimate of twice DEFAULT_EQ_SEL.
(I'm not especially wedded to that number, but it seemed like a
reasonable starting point.)
There were a couple of other operators that seemed to be inappropriately
using contsel, so I changed all of these to use matchsel:
@>(tsquery,tsquery) | tsq_mcontains
<@(tsquery,tsquery) | tsq_mcontained
@@(text,text) | ts_match_tt
@@(text,tsquery) | ts_match_tq
-|-(anyrange,anyrange) | range_adjacent
@>(jsonb,jsonb) | jsonb_contains
?(jsonb,text) | jsonb_exists
?|(jsonb,text[]) | jsonb_exists_any
?&(jsonb,text[]) | jsonb_exists_all
<@(jsonb,jsonb) | jsonb_contained
@?(jsonb,jsonpath) | jsonb_path_exists_opr
@@(jsonb,jsonpath) | jsonb_path_match_opr
Note: you might think that we should just shove this generic logic
into contsel itself, and maybe areasel and patternsel while at it.
However, that would be pretty useless for these functions' intended
usage with the geometric operators, because we collect neither MCV
nor histogram stats for the geometric data types, making the extra
complexity worthless. Pending somebody putting some effort into
estimation for the geometric data types, I think we should just get
out of the business of having non-geometric types relying on these
estimators.
This patch is not complete, because I didn't look at changing
the contrib modules, and grep says at least some of them are using
contsel for non-geometric data types. But I thought I'd put it up
for discussion at this stage.
regards, tom lane
Attachments:
better-jsonb-selectivity-1.patchtext/x-diff; charset=us-ascii; name=better-jsonb-selectivity-1.patchDownload+204-110
I wrote:
This patch is not complete, because I didn't look at changing
the contrib modules, and grep says at least some of them are using
contsel for non-geometric data types. But I thought I'd put it up
for discussion at this stage.
Hearing nothing, I went ahead and hacked on the contrib code.
The attached 0002 updates hstore, ltree, and pg_trgm to get them
out of using contsel/contjoinsel for anything. (0001 is the same
patch I posted before.)
In ltree, I noted that < <= >= > were using contsel even though
those are part of a btree opclass, meaning they could perfectly
well use scalarltsel and friends. So now they do. Everything
else now uses matchsel/matchjoinsel, leaving ltreeparentsel as
an unused backward-compatibility feature. I didn't think that
the default selectivity in ltreeparentsel was particularly sane,
so having those operators use their own selectivity logic
instead of using matchsel like everything else seemed pointless
(and certainly pairing a custom ltreeparentsel with contjoinsel
isn't something to encourage).
In pg_trgm, the change of default selectivity estimate causes one
plan to change, but I think that's fine; looking at the data hidden
by COSTS OFF shows the new estimate is closer to reality anyway.
(That test is meant to exercise some gist consistent-function logic,
which it still does, so no worries there.)
The cube and seg extensions still make significant use of contsel and
the other geometric estimator stubs. Although we could in principle
change those operators to use matchsel, I'm hesitant to do so without
closer analysis. The sort orderings imposed by their default btree
opclasses correlate strongly with cube/seg size, which is related to
overlap/containment outcomes, so I'm not sure that the histogram
entries would provide a plausibly random sample for this purpose.
So those modules are not touched here.
There are a few other random uses of geometric join estimators
paired with non-geometric restriction estimators, including
these in the core core:
@>(anyrange,anyelement) | range_contains_elem | rangesel | contjoinsel
@>(anyrange,anyrange) | range_contains | rangesel | contjoinsel
<@(anyelement,anyrange) | elem_contained_by_range | rangesel | contjoinsel
<@(anyrange,anyrange) | range_contained_by | rangesel | contjoinsel
&&(anyrange,anyrange) | range_overlaps | rangesel | areajoinsel
plus the @@ and ~~ operators in intarray. While this is ugly,
it's probably not worth changing until somebody creates non-stub
join selectivity code that will work for these cases.
regards, tom lane
Hi Tom,
The patches look entirely reasonable to me.
The second one needs to be rebased.
I like the idea of stubbing matchjoinsel for now,
as well as being careful with operators that may correlate with sort
orderings.
The only little thing I can think of is hardcoding it as 2 * DEFAULT_EQ_SEL.
While I don't have any arguments against the value itself I think it
should be configurable independently.
Sadly DEFAULT_MATCH_SEL name is already taken for text patterns.
Not sure if it's a reason to rename all the stuff.
Best, Alex
Quickly tested like this:
create table t(a jsonb);
insert into t select jsonb_object( array[(random() * 10)::int::text],
'{" "}') from generate_series(1, 100000);
insert into t select jsonb_object( array[(random() * 10)::int::text],
array[(random() * 1000)::int::text]) from generate_series(1, 100000);
explain analyze select * from t where a ? '1';
analyze t;
explain analyze select * from t where a ? '1';
Best, Alex
Alexey Bashtanov <bashtanov@imap.cc> writes:
The only little thing I can think of is hardcoding it as 2 * DEFAULT_EQ_SEL.
While I don't have any arguments against the value itself I think it
should be configurable independently.
Sadly DEFAULT_MATCH_SEL name is already taken for text patterns.
Not sure if it's a reason to rename all the stuff.
Yeah, I was going to invent a symbol till I noticed that DEFAULT_MATCH_SEL
was already taken :-(.
There are only about half a dozen uses of that in-core, so maybe we could
get away with renaming that one, but on the whole I'd rather leave it
alone in case some extension is using it. So that leaves us with needing
to find a better name for this new one. Any ideas?
regards, tom lane
So that leaves us with needing
to find a better name for this new one. Any ideas?
I'm thinking of something wide like
opersel, operjoinsel, DEFAULT_OPER_SEL
or maybe even
genericsel, genericjoinsel, DEFAULT_GENERIC_SEL
Best, Alex
Alexey Bashtanov <bashtanov@imap.cc> writes:
So that leaves us with needing
to find a better name for this new one. Any ideas?
I'm thinking of something wide like
opersel, operjoinsel, DEFAULT_OPER_SEL
or maybe even
genericsel, genericjoinsel, DEFAULT_GENERIC_SEL
Seems a little *too* generic :-(
I was wondering about DEFAULT_MATCHING_SEL. The difference in purpose
from DEFAULT_MATCH_SEL wouldn't be too obvious, but then it probably
wouldn't be anyway.
regards, tom lane
I was wondering about DEFAULT_MATCHING_SEL. The difference in purpose
from DEFAULT_MATCH_SEL wouldn't be too obvious, but then it probably
wouldn't be anyway.
Fine with me, especially if both new functions are renamed accordingly.
Best, Alex
Alexey Bashtanov <bashtanov@imap.cc> writes:
I was wondering about DEFAULT_MATCHING_SEL. The difference in purpose
from DEFAULT_MATCH_SEL wouldn't be too obvious, but then it probably
wouldn't be anyway.
Fine with me, especially if both new functions are renamed accordingly.
Yup, that would make sense, will do it like that.
regards, tom lane
Renamed "matchsel" to "matchingsel" etc, added DEFAULT_MATCHING_SEL,
rebased over commit 911e70207. Since that commit already created
new versions of the relevant contrib modules, I think we can just
redefine what those versions contain, rather than making yet-newer
versions. (Of course, that assumes we're going to include this in
v13.)
regards, tom lane
On 31/03/2020 18:53, Tom Lane wrote:
Renamed "matchsel" to "matchingsel" etc, added DEFAULT_MATCHING_SEL,
rebased over commit 911e70207. Since that commit already created
new versions of the relevant contrib modules, I think we can just
redefine what those versions contain, rather than making yet-newer
versions. (Of course, that assumes we're going to include this in
v13.)
Looks good to me.
Best, Alex
Alexey Bashtanov <bashtanov@imap.cc> writes:
On 31/03/2020 18:53, Tom Lane wrote:
Renamed "matchsel" to "matchingsel" etc, added DEFAULT_MATCHING_SEL,
rebased over commit 911e70207. Since that commit already created
new versions of the relevant contrib modules, I think we can just
redefine what those versions contain, rather than making yet-newer
versions. (Of course, that assumes we're going to include this in
v13.)
Looks good to me.
Pushed, thanks for reviewing!
regards, tom lane