Fixing some ancient errors in hash join costing

Started by Tom Lane4 months ago7 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

I spent some time digging into bug #19363 [1]/messages/by-id/19363-8dd32fc7600a1153@postgresql.org, and figured out the
reason why the planner is failing to reject a horribly bad plan.
Even without any stats, it should be able to figure out that building
a hash table estimated at 10 billion rows is less good than building
one estimated at 1000 rows ... but it didn't. The cause is

(1) estimate_hash_bucket_stats is defined to return zero to *mcv_freq
if it cannot obtain a value for the frequency of the most common
value.

(2) final_cost_hashjoin neglected this specification, and would
blindly adopt zero for the innermcvfreq.

(3) This results in calculating zero for the largest hash bucket size,
and thus the code that intends to disable hashjoin when that bucket
size exceeds get_hash_memory_limit() is turned into a no-op.

This is the exact opposite of what we want, I think. The intent in
the planner has always been to avoid hashing unless we are pretty
confident that the inner relation's hash key is well-distributed.
Turning off the disable-for-giant-hash-table check when we have
no stats is the polar opposite of sane.

So I said to myself "this is a one-liner fix, we just have to
disregard any mcv_freq reported as zero". And that did fix the
case shown in the bug report, but it also broke a bunch of
regression test cases. Upon closer investigation, there is also
an old oversight within estimate_hash_bucket_stats itself. It
returns zero for mcv_freq if there's no STATISTIC_KIND_MCV entry,
but that neglects the fact that ANALYZE does not make an MCV entry
if it doesn't find any data values that are noticeably more common
than any others. So the correct behavior really should be to
assume the column is unique and set the mcv_freq to 1 / rows.
In the attached draft patch I made it do this if there's no MCV
stats entry but there is a STATISTIC_KIND_HISTOGRAM entry.
If there's neither, we are probably dealing with a weird datatype
that doesn't have meaningful scalar stats, so I'm hesitant to
just apply the 1 / rows rule blindly.

Even after that, there were more changes in regression test outputs
than I'd expected. Poking into it further, the first diff in join.out
is precisely a case like the bug report's, where we have no stats
about a potentially large self-join. The repaired code decides that a
hash join is too risky, so it disables it and we select a merge join
instead, as desired. However, none of the other places that visibly
change plans are triggering that disable logic. Instead what is
happening is that the improved mcv_freq estimate is getting used
within estimate_hash_bucket_stats to refine its bucket-size result:

/*
* Adjust estimated bucketsize upward to account for skewed distribution.
*/
if (avgfreq > 0.0 && *mcv_freq > avgfreq)
estfract *= *mcv_freq / avgfreq;

This code does nothing if *mcv_freq is zero, but if we have a
more-than-zero estimate it can increase the bucket size fraction,
and that is indeed happening in the other places in the core
regression tests where we see plans change. AFAICT these changes
are all perfectly sane and not anything to worry about.

I also had to stick an additional ANALYZE step into join_hash.sql
to keep plans from changing there.

I remain a bit confused by the change in postgres_fdw.out though.
It's deciding to push an ORDER BY down to the remote side when
it didn't before, which seems like an improvement; but I fail to
see how a marginal change in hash join costing would lead to that.
Perhaps that is worth looking into more closely.

regards, tom lane

[1]: /messages/by-id/19363-8dd32fc7600a1153@postgresql.org

Attachments:

v1-fix-bogus-hashjoin-costing.patchtext/x-diff; charset=us-ascii; name=v1-fix-bogus-hashjoin-costing.patchDownload+253-231
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#1)
Re: Fixing some ancient errors in hash join costing

I wrote:

I remain a bit confused by the change in postgres_fdw.out though.
It's deciding to push an ORDER BY down to the remote side when
it didn't before, which seems like an improvement; but I fail to
see how a marginal change in hash join costing would lead to that.
Perhaps that is worth looking into more closely.

I dug into that bit today, and concluded that it's a "nothing to
see here" case. We are comparing the costs of doing a sort step
locally vs remotely --- but if the remote server is identically
configured, which it surely is in this test, then cost_sort()
will produce the same answers on both sides, and we are comparing
path costs that are the same to within roundoff error and
cost-quantization effects. My patch does move the underlying
semijoin's cost just a hair, and that results in changes in what
add_path does with the locally-sorted versus remotely-sorted paths,
but really there's no reason to prefer one over the other.

I was amused to notice that the postgres_fdw.out change made in my
patch reverts one made in aa86129e1 (which also affected semijoin
costing). So we've had trouble before with that test case being
fundamentally unstable. I wonder if we shouldn't do something to try
to stabilize it? I see that the test immediately before this one
forces the matter by turning off enable_sort (which'd affect only
the local side not the remote). That's a hack all right but maybe
we should extend it to this test.

regards, tom lane

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#2)
Re: Fixing some ancient errors in hash join costing

I wrote:

I was amused to notice that the postgres_fdw.out change made in my
patch reverts one made in aa86129e1 (which also affected semijoin
costing). So we've had trouble before with that test case being
fundamentally unstable. I wonder if we shouldn't do something to try
to stabilize it? I see that the test immediately before this one
forces the matter by turning off enable_sort (which'd affect only
the local side not the remote). That's a hack all right but maybe
we should extend it to this test.

Here's a v2 patchset that splits that out as a separate change for
clarity's sake. I also spent a bit of effort on commit log messages,
including researching the git history.

regards, tom lane

Attachments:

v2-0001-Further-stabilize-a-postgres_fdw-test-case.patchtext/x-diff; charset=us-ascii; name=v2-0001-Further-stabilize-a-postgres_fdw-test-case.patchDownload+21-13
v2-0002-Ensure-sanity-of-hash-join-costing-when-there-are.patchtext/x-diff; charset=us-ascii; name*0=v2-0002-Ensure-sanity-of-hash-join-costing-when-there-are.p; name*1=atchDownload+247-224
#4Chao Li
li.evan.chao@gmail.com
In reply to: Tom Lane (#3)
Re: Fixing some ancient errors in hash join costing

On Dec 29, 2025, at 06:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

I was amused to notice that the postgres_fdw.out change made in my
patch reverts one made in aa86129e1 (which also affected semijoin
costing). So we've had trouble before with that test case being
fundamentally unstable. I wonder if we shouldn't do something to try
to stabilize it? I see that the test immediately before this one
forces the matter by turning off enable_sort (which'd affect only
the local side not the remote). That's a hack all right but maybe
we should extend it to this test.

Here's a v2 patchset that splits that out as a separate change for
clarity's sake. I also spent a bit of effort on commit log messages,
including researching the git history.

regards, tom lane

Just a few small comments on v2:

1 - 0001
```
-SET enable_hashjoin TO off;
+-- These next tests require choosing between remote and local sort, which is
+-- a coin flip so long as cost_sort() gives the same results on both sides.
+-- To stabilize the expected plans, disable sorting locally.
SET enable_sort TO off;
-- subquery using stable function (can't be sent to remote)
+SET enable_hashjoin TO off;  -- this one needs even more help to be stable
```

This is not a concern, just curious why switch the setting order of enable_hashjoin and enable_sort?

2 - 0002
```
+		else if (get_attstatsslot(&sslot, vardata.statsTuple,
+								  STATISTIC_KIND_HISTOGRAM, InvalidOid,
+								  0))
+		{
+			/*
+			 * If there are no recorded MCVs, but we do have a histogram, then
+			 * assume that ANALYZE determined that the column is unique.
+			 */
+			if (vardata.rel && vardata.rel->rows > 0)
+				*mcv_freq = 1.0 / vardata.rel->rows;
+			free_attstatsslot(&sslot);
+		}
```

As flag 0 is passed to get_attstatsslot(), free_attstatsslot() is not needed. The function comment of get_attstatsslot states about that:
```
* Passing flags=0 can be useful to quickly check if the requested slot type
* exists. In this case no arrays are extracted, so free_attstatsslot need
* not be called.
```

3 - In funciton estimate_hash_bucket_stats(), when mcv_freq is initialized:
```
/* Look up the frequency of the most common value, if available */
*mcv_freq = 0.0;
```

Maybe worth adding a short comment like “0.0 doesn’t mean zero frequency, instead 0.0 means no data or unknown frequency”, which might help code readers to quickly understand the logic.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Chao Li (#4)
Re: Fixing some ancient errors in hash join costing

Chao Li <li.evan.chao@gmail.com> writes:

Just a few small comments on v2:

This is not a concern, just curious why switch the setting order of enable_hashjoin and enable_sort?

The need to disable hashjoin only applies to one of the two tests, so
it seemed to me that this way is more nicely nested. Judgment call
of course.

As flag 0 is passed to get_attstatsslot(), free_attstatsslot() is not needed.

True. I wrote it like that so people wouldn't wonder if I'd forgotten
free_attstatsslot(), but if other call sites passing flags == 0 don't
use it then it'd be better to be consistent. (I didn't check that.)

Maybe worth adding a short comment like “0.0 doesn’t mean zero frequency, instead 0.0 means no data or unknown frequency”, which might help code readers to quickly understand the logic.

Doesn't the function's header comment cover that adequately?

regards, tom lane

#6Chao Li
li.evan.chao@gmail.com
In reply to: Tom Lane (#5)
Re: Fixing some ancient errors in hash join costing

On Dec 29, 2025, at 11:29, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Chao Li <li.evan.chao@gmail.com> writes:

Just a few small comments on v2:

This is not a concern, just curious why switch the setting order of enable_hashjoin and enable_sort?

The need to disable hashjoin only applies to one of the two tests, so
it seemed to me that this way is more nicely nested. Judgment call
of course.

Make sense. Making enable_hashjoin closer to the test that depends on it and immediately resetting it after the test, which is clearer.

As flag 0 is passed to get_attstatsslot(), free_attstatsslot() is not needed.

True. I wrote it like that so people wouldn't wonder if I'd forgotten
free_attstatsslot(), but if other call sites passing flags == 0 don't
use it then it'd be better to be consistent. (I didn't check that.)

I searched over the source tree, looks like not a direct reference. The only usages of flag 0 is in eqjoinsel(), the code snippet is:
```
/*
* There is no use in fetching one side's MCVs if we lack MCVs for the
* other side, so do a quick check to verify that both stats exist.
*/
get_mcv_stats = (HeapTupleIsValid(vardata1.statsTuple) &&
HeapTupleIsValid(vardata2.statsTuple) &&
get_attstatsslot(&sslot1, vardata1.statsTuple,
STATISTIC_KIND_MCV, InvalidOid,
0) &&
get_attstatsslot(&sslot2, vardata2.statsTuple,
STATISTIC_KIND_MCV, InvalidOid,
0));

if (HeapTupleIsValid(vardata1.statsTuple))
{
/* note we allow use of nullfrac regardless of security check */
stats1 = (Form_pg_statistic) GETSTRUCT(vardata1.statsTuple);
if (get_mcv_stats &&
statistic_proc_security_check(&vardata1, opfuncoid))
have_mcvs1 = get_attstatsslot(&sslot1, vardata1.statsTuple,
STATISTIC_KIND_MCV, InvalidOid,
ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
}

if (HeapTupleIsValid(vardata2.statsTuple))
{
/* note we allow use of nullfrac regardless of security check */
stats2 = (Form_pg_statistic) GETSTRUCT(vardata2.statsTuple);
if (get_mcv_stats &&
statistic_proc_security_check(&vardata2, opfuncoid))
have_mcvs2 = get_attstatsslot(&sslot2, vardata2.statsTuple,
STATISTIC_KIND_MCV, InvalidOid,
ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
}
```

Here sslot1 and sslot2 are called with flag 0 first, then when called again with non-0 flags, sslot1 and sslot2 are not free-ed first, which implies that free_attstatsslot() is not need to be called.

If a reader considers free_attstatsslot() is necessary in this patch, then he might concern memory leaks in eqjoinsel().

Maybe worth adding a short comment like “0.0 doesn’t mean zero frequency, instead 0.0 means no data or unknown frequency”, which might help code readers to quickly understand the logic.

Doesn't the function's header comment cover that adequately?

Yep, fair point. The function comment does explain that.

My thought was just that at the point of initialization, the nearby comment reads as if we’re about to fetch a value, whereas the assignment is really initializing with “unknown so far”. A short tweak like “Initialize MCV frequency to unknown” might help make that intent obvious locally, but I’m fine either way.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Chao Li (#6)
Re: Fixing some ancient errors in hash join costing

Chao Li <li.evan.chao@gmail.com> writes:

On Dec 29, 2025, at 11:29, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Chao Li <li.evan.chao@gmail.com> writes:

As flag 0 is passed to get_attstatsslot(), free_attstatsslot() is not needed.

True. I wrote it like that so people wouldn't wonder if I'd forgotten
free_attstatsslot(), but if other call sites passing flags == 0 don't
use it then it'd be better to be consistent. (I didn't check that.)

I searched over the source tree, looks like not a direct reference. The only usages of flag 0 is in eqjoinsel(), the code snippet is:

Yeah, that's not a direct precedent since there is a
free_attstatsslot() further down; nonetheless it is relying on the
call with flags==0 to not do anything that'd need freeing, since
there's later calls that may overwrite the sslot structs. But
get_attstatsslot's API spec is clear enough that you don't need
free_attstatsslot() in this usage, so I removed it.

My thought was just that at the point of initialization, the nearby comment reads as if we’re about to fetch a value, whereas the assignment is really initializing with “unknown so far”. A short tweak like “Initialize MCV frequency to unknown” might help make that intent obvious locally, but I’m fine either way.

Fair enough, I modified it along those lines.

Pushed, thanks for reviewing.

regards, tom lane