Use zero for nullness estimates of system attributes

Started by Edmund Horneralmost 7 years ago6 messages

ejrh00@gmail.com

almost 7 years ago

1 attachment(s)

I added some code to selfuncs.c to estimate the selectivity of CTID,
including nullness, in my ongoing attempt to add TID range scans [1]/messages/by-id/31682.1545415852@sss.pgh.pa.us. And
as Tom pointed out [2]/messages/by-id/31682.1545415852@sss.pgh.pa.us, no system attribute can be null, so we might as
well handle them all.

That's what the attached patch does.
I observed a few interesting things with outer join selectivity:

While system attributes aren't NULL in the table, they can be in queries
such as:

SELECT *
FROM a LEFT JOIN b ON a.id = b.id
WHERE b.ctid IS NULL;

And the patch does affect the estimates for such plans. But it's just
replacing one hardcoded nullness (0.005) for another (0.0), which seems no
worse than the original.

I was a bit concerned that with, for example,

CREATE TABLE a (id INTEGER);
INSERT INTO a SELECT * FROM generate_series(1,1000);
ANALYZE a;
CREATE TABLE b (id INTEGER, id2 INTEGER);
INSERT INTO b SELECT *, * FROM generate_series(1,10);
ANALYZE b;

EXPLAIN ANALYZE
SELECT * FROM a LEFT JOIN b ON a.id = b.id
WHERE b.ctid IS NULL;

you get a row estimate of 1 (vs the actual 990). It's not specific to
system attributes. Plain left-join selectivity calculation doesn't seem to
take into account the join selectivity, while anti-join calculation does.

I do not think this affects the usefulness of the present patch, but maybe
it's something we could improve.

Finally: I thought about introducing a macro to attnum.h:

/*
* AttrNumberIsForSystemAttr
* True iff the attribute number corresponds to a system attribute.
*/
#define AttrNumberIsForSystemAttr(attributeNumber) \
((bool) ((attributeNumber) < 0))

But there's a zillion places that could be changed to use it, so I haven't
in this version of the patch.

Edmund

[1]: /messages/by-id/31682.1545415852@sss.pgh.pa.us
/messages/by-id/31682.1545415852@sss.pgh.pa.us
[2]: /messages/by-id/31682.1545415852@sss.pgh.pa.us

Attachments:

v1-nullness-selectivity-for-system-cols.patchapplication/octet-stream; name=v1-nullness-selectivity-for-system-cols.patchDownload

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index dcb35d8..4af8c80 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -1797,6 +1797,15 @@ nulltestsel(PlannerInfo *root, NullTestType nulltesttype, Node *arg,
 				return (Selectivity) 0; /* keep compiler quiet */
 		}
 	}
+	else if (vardata.var && IsA(vardata.var, Var) &&
+			 ((Var *) vardata.var)->varattno < 0)
+	{
+		/*
+		 * There are no stats for system columns, but we know they are never
+		 * NULL.
+		 */
+		selec = (nulltesttype == IS_NULL) ? 0.0 : 1.0;
+	}
 	else
 	{
 		/*

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Edmund Horner (#1)

Re: Use zero for nullness estimates of system attributes

Edmund Horner <ejrh00@gmail.com> writes:

I added some code to selfuncs.c to estimate the selectivity of CTID,
including nullness, in my ongoing attempt to add TID range scans [1]. And
as Tom pointed out [2], no system attribute can be null, so we might as
well handle them all.
That's what the attached patch does.

Seems pretty uncontroversial, so pushed.

I observed a few interesting things with outer join selectivity:
While system attributes aren't NULL in the table, they can be in queries
such as:

Yeah, none of our selectivity calculations account for the possibility
that we're above a join that has affected the distribution of a Var's
values. Going to NULL in an outer join is just part of that issue.
I don't feel this patch needs to solve it, and anyway it'd be a rather
massive rethink.

Finally: I thought about introducing a macro to attnum.h:
#define AttrNumberIsForSystemAttr(attributeNumber) \
((bool) ((attributeNumber) < 0))
But there's a zillion places that could be changed to use it, so I haven't
in this version of the patch.

I can't get too excited about that. Even if the reader is unfamiliar with
the negative-attno convention, most of these places are commented in a way
that makes it clear what's going on.

regards, tom lane

Jim Finnerty

jfinnert@amazon.com

almost 7 years ago

In reply to: Tom Lane (#2)

Re: Use zero for nullness estimates of system attributes

re: Yeah, none of our selectivity calculations account for the possibility
that we're above a join that has affected the distribution of a Var's
values.

Tom, there's an analogous issue of adjusting distinct values on a per-column
basis based on the selectivity of other local predicates. Several
commercial RDBMS's make such adjustments in an effort to get better
selectivity estimates when there are multiple local predicates. Is this
something that the PG community has considered and decided not to do because
of the additional planning-time overhead that may be required, or just
something that hasn't been discussed or tackled yet?

-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Jim Finnerty (#3)

Re: Use zero for nullness estimates of system attributes

Jim Finnerty <jfinnert@amazon.com> writes:

Tom, there's an analogous issue of adjusting distinct values on a per-column
basis based on the selectivity of other local predicates. Several
commercial RDBMS's make such adjustments in an effort to get better
selectivity estimates when there are multiple local predicates. Is this
something that the PG community has considered and decided not to do because
of the additional planning-time overhead that may be required, or just
something that hasn't been discussed or tackled yet?

I think what you're talking about is the correlated-variables problem,
which we have made a start on with the "extended statistics" mechanism,
though certainly a lot remains to be done.

regards, tom lane

Jim Finnerty

jfinnert@amazon.com

almost 7 years ago

In reply to: Tom Lane (#4)

Re: Use zero for nullness estimates of system attributes

It's related, but what I was referring to applies even to the uncorrelated
case: suppose you have something like:

select x, sum(z)
from t
where
x > 5 and y in ('a', 'b', 'c')
group by x;

let's say that 'a', 'b', and 'c' are not frequent values of y, so the
estimated selectivity is based on the n_distinct of y and the 3 values. Now
imagine that x > 5 is applied first. That reduces the number of qualifying
rows by the selectivity of (x > 5), but it may also reduce the number of
distinct values of y. If it reduces the n_distinct of y, then the IN
predicate selectivity should be adjusted also.

The negative n_distinct representation (for n_distinct a large fraction of
numrows) already accounts for an automatic scaling of n_distinct when the
table size grows or shrinks. Adjusting the n_distinct of y after applying
predicate (x > 5) is conceptually the same thing, except that the scaling is
not linear, but would be based on a (fairly cheap) probabilistic formula of
how many distinct y's can be expected to survive after applying predicate x,
rather than assuming that all of them do.

If you have correlation information for columns (x, y) you'd want to use
that information instead, but in the absence of that information you could
make this adjustment.

It might make a good intern project if somebody wants to investigate it.

-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Jim Finnerty (#5)

Re: Use zero for nullness estimates of system attributes

Jim Finnerty <jfinnert@amazon.com> writes:

It's related, but what I was referring to applies even to the uncorrelated
case: suppose you have something like:

select x, sum(z)
from t
where
x > 5 and y in ('a', 'b', 'c')
group by x;

let's say that 'a', 'b', and 'c' are not frequent values of y, so the
estimated selectivity is based on the n_distinct of y and the 3 values. Now
imagine that x > 5 is applied first. That reduces the number of qualifying
rows by the selectivity of (x > 5), but it may also reduce the number of
distinct values of y. If it reduces the n_distinct of y, then the IN
predicate selectivity should be adjusted also.

I don't actually think that's a foregone conclusion. If the two where
clauses are in fact independent, then simply multiplying their
selectivities together is the right thing.

regards, tom lane