[PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

Started by Chengpeng Yan4 months ago16 messageshackers

chengpeng_yan@outlook.com

4 months ago

Hi,

`compute_distinct_stats()` is used for data types that have an equality
operator but no ordering (so we can't use the sort-based path). It
keeps a `track[]` array of candidate MCVs and, for each sampled row,
searches that array for a match. With high `statistics_target`, that
becomes O(n) per sample and can dominate ANALYZE time.

This patch keeps the existing behavior but adds a fast lookup path:

- When `attstattarget >= 100`, and the type's default hash support
matches the equality operator, build a small `simplehash` table mapping
a tracked Datum to its current `track[]` slot (average O(1) match
lookup).

- Otherwise, fall back to the existing linear scan.

While here, the singleton-eviction logic changes from repeatedly
shifting the count=1 region to a simple round-robin cursor over that
region. This keeps replacement O(1) and (when hashing is enabled)
avoids having to update many hash->index mappings.

Performance

I used a small harness focusing on this code path (xid and aclitem, with
match-heavy / unique-heavy / Zipf-ish distributions).

Setup:
- MacBook Pro (Apple M4 Max, 36GB RAM), macOS 26.1 (Darwin 25.1.0)
- Reported numbers are median of 5 runs
- For the "after" numbers below, I set the hash threshold to 0 to show
the potential benefit; the patch as attached enables hashing only for
`attstattarget >= 100`.

Results (ms; before -> after):

obj | statistics_target | before_ms | after_ms | speedup_x
----+------------------+----------+---------+---------
bench_aclitem.x_match | 25 | 174.154 | 182.346 | 0.96
bench_aclitem.x_match | 50 | 55.708 | 55.693 | 1.00
bench_aclitem.x_match | 100 | 74.311 | 63.978 | 1.16
bench_aclitem.x_match | 200 | 143.621 | 86.790 | 1.65
bench_aclitem.x_match | 500 | 462.616 | 125.083 | 3.70
bench_aclitem.x_match | 1000 | 1590.918 | 202.849 | 7.84
bench_aclitem.x_match | 2000 | 5857.253 | 406.840 | 14.40
bench_aclitem.x_match | 5000 | 30844.177 | 1678.826 | 18.37
bench_aclitem.x_match | 10000 | 73134.141 | 9207.071 | 7.94
bench_aclitem.x_unique | 25 | 18.214 | 17.962 | 1.01
bench_aclitem.x_unique | 50 | 39.305 | 37.045 | 1.06
bench_aclitem.x_unique | 100 | 76.555 | 64.800 | 1.18
bench_aclitem.x_unique | 200 | 151.416 | 90.619 | 1.67
bench_aclitem.x_unique | 500 | 497.243 | 134.351 | 3.70
bench_aclitem.x_unique | 1000 | 1788.755 | 209.299 | 8.55
bench_aclitem.x_unique | 2000 | 7249.638 | 332.865 | 21.78
bench_aclitem.x_unique | 5000 | 50785.046 | 604.944 | 83.95
bench_aclitem.x_unique | 10000 | 195506.022 | 792.658 | 246.65
bench_aclitem.x_zipf | 25 | 17.451 | 19.770 | 0.88
bench_aclitem.x_zipf | 50 | 37.131 | 35.560 | 1.04
bench_aclitem.x_zipf | 100 | 76.053 | 67.494 | 1.13
bench_aclitem.x_zipf | 200 | 145.435 | 100.484 | 1.45
bench_aclitem.x_zipf | 500 | 429.574 | 131.647 | 3.26
bench_aclitem.x_zipf | 1000 | 1313.187 | 223.918 | 5.86
bench_aclitem.x_zipf | 2000 | 4117.637 | 445.521 | 9.24
bench_aclitem.x_zipf | 5000 | 14863.325 | 1189.886 | 12.49
bench_aclitem.x_zipf | 10000 | 31706.434 | 2269.572 | 13.97
bench_xid.x_match | 25 | 22.136 | 56.767 | 0.39
bench_xid.x_match | 50 | 42.182 | 40.947 | 1.03
bench_xid.x_match | 100 | 81.089 | 67.848 | 1.20
bench_xid.x_match | 200 | 118.315 | 80.993 | 1.46
bench_xid.x_match | 500 | 403.674 | 120.421 | 3.35
bench_xid.x_match | 1000 | 1246.068 | 171.385 | 7.27
bench_xid.x_match | 2000 | 4374.122 | 406.823 | 10.75
bench_xid.x_match | 5000 | 21961.532 | 1903.715 | 11.54
bench_xid.x_match | 10000 | 55453.808 | 11035.402 | 5.03
bench_xid.x_unique | 25 | 21.062 | 20.011 | 1.05
bench_xid.x_unique | 50 | 40.473 | 41.414 | 0.98
bench_xid.x_unique | 100 | 79.711 | 67.382 | 1.18
bench_xid.x_unique | 200 | 125.255 | 72.853 | 1.72
bench_xid.x_unique | 500 | 412.332 | 96.030 | 4.29
bench_xid.x_unique | 1000 | 1373.882 | 144.837 | 9.49
bench_xid.x_unique | 2000 | 5084.898 | 277.216 | 18.34
bench_xid.x_unique | 5000 | 31412.454 | 574.693 | 54.66
bench_xid.x_unique | 10000 | 126639.495 | 804.947 | 157.33
bench_xid.x_zipf | 25 | 21.915 | 20.758 | 1.06
bench_xid.x_zipf | 50 | 41.458 | 39.498 | 1.05
bench_xid.x_zipf | 100 | 78.128 | 70.466 | 1.11
bench_xid.x_zipf | 200 | 118.361 | 79.816 | 1.48
bench_xid.x_zipf | 500 | 351.440 | 114.009 | 3.08
bench_xid.x_zipf | 1000 | 1049.475 | 173.981 | 6.03
bench_xid.x_zipf | 2000 | 3162.409 | 404.565 | 7.82
bench_xid.x_zipf | 5000 | 11170.524 | 1346.067 | 8.30
bench_xid.x_zipf | 10000 | 23928.836 | 2549.381 | 9.39

The bench harness (bench/) is attached for reproduction.

Patch attached.

--
Best regards,
Chengpeng Yan

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

4 months ago

In reply to: Chengpeng Yan (#1)

Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

Hi Chengpeng,

On 21.01.2026 17:13, Chengpeng Yan wrote:

`compute_distinct_stats()` is used for data types that have an equality
operator but no ordering (so we can't use the sort-based path). It
keeps a `track[]` array of candidate MCVs and, for each sampled row,
searches that array for a match. With high `statistics_target`, that
becomes O(n) per sample and can dominate ANALYZE time.

Nice catch - this is indeed not first time we run into an O(N^2)
bottleneck in MCV array and address it with hash-based lookup. Thanks
for working on this!

I have a couple of follow-up questions after a quick look at the patch:

1. The hash table is created but I do not see a corresponding destroy call.
2. In the original code, when a value was not found using the
nested-loop search, the singleton (count = 1) region was shifted to make
room for the new entry. After the patch, I no longer see this shifting
logic. I might be missing something, but it looks like the non-hash path
now follows the same replacement behavior as the hash-based one.

I’ll need a bit more time for a deeper review and some local benchmarks,
but overall the approach looks interesting.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

Chengpeng Yan

chengpeng_yan@outlook.com

4 months ago

In reply to: Ilia Evdokimov (#2)

Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

On Jan 22, 2026, at 04:44, Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> wrote:

Nice catch - this is indeed not first time we run into an O(N^2) bottleneck in MCV array and address it with hash-based lookup. Thanks for working on this!

Thanks for the review. This change was actually inspired by your earlier
patch on a similar issue, so thanks a lot for that as well.

I have a couple of follow-up questions after a quick look at the patch:
1. The hash table is created but I do not see a corresponding destroy call.

The hash table is allocated in the per-column temporary memory context
(the “Analyze Column” context that is active while compute_stats()
runs). That context is MemoryContextReset() after each column and
eventually deleted at the end of ANALYZE, so the table is reclaimed
automatically.

This matches the existing convention noted at the end of
compute_distinct_stats() (“We don’t need to bother cleaning up any of
our temporary palloc’s”).

That said, I can add an explicit DistinctHash_destroy() before returning
for clarity, although it shouldn’t be required for correctness or leak
prevention.

2. In the original code, when a value was not found using the nested-loop search, the singleton (count = 1) region was shifted to make room for the new entry. After the patch, I no longer see this shifting logic. I might be missing something, but it looks like the non-hash path now follows the same replacement behavior as the hash-based one.

In the original code, inserting a new distinct value into the singleton
(count = 1) region effectively evicted the oldest singleton (FIFO) by
inserting at the head of that region and shifting the tail, which is
O(n) per new distinct value.

The patch replaces this with a FIFO eviction cursor (effectively a
circular / round-robin cursor) over the count = 1 region, avoiding
repeated shifts. When hashing is enabled, this also avoids having to
update many hash→index mappings due to element shifts.

I kept the non-hash path using the same mechanism so that the hash and
fallback paths follow the same replacement behavior, with the intention
of preserving the original FIFO semantics while avoiding the O(n) cost.

I’ve also posted a v2 update. It adjusts the eviction cursor when a
count = 1 value is promoted (via the existing bubble-up logic), so that
the cursor-based eviction is now exactly equivalent to the original
shift-based FIFO behavior.

I’ve also updated the relevant comments to clarify this behavior. The v2
patch is attached.

--
Best regards,
Chengpeng Yan

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

4 months ago

In reply to: Chengpeng Yan (#3)

Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

On 22.01.2026 12:40, Chengpeng Yan wrote:

The v2 patch is attached.

I took a deeper look at the v2 patch. The hash-based lookup itself looks
correct, and when testing ANALYZE runtime with large
default_statistics_target values, the patch indeed provides a noticeable
speedup.

I have one small suggestion regarding the handling of firstcount1 and
c1_cursor in the 'match' path.

When we find a match in track[], we increment count and perform
bubble-up swaps to keep the array ordered by frequency. If the value
previously has count = 1, it is effectively leaving the singleton (count
= 1) region and becoming part of the count>1. Conceptually, this means
that the boundary between these two regions (firstcount1) should move
left by one.

Given that, it seems sufficient to update the boundary and then ensure
that c1_cursor still point inside the singleton region:

if (was_count1 && j < firstcount1)
firstcount1--;
if (c1_cursor < firstcount1)
c1_cursor = firstcount1;

This avoids reasoning about specific shifted subranges
(firstcount1..match_index). FIFO behavior is still preserved because
c1_cursor is only advanced when an eviction actually happens.

Let me know if I'm overlooking any corner cases.

--
Best regards.
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

Tatsuya Kawata

kawatatatsuya0913@gmail.com

4 months ago

In reply to: Ilia Evdokimov (#4)

Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

Hi,

As a newcomer who is still learning the codebase, I attempted to review the
v2 patch.
The behavior looks correct to me.

I considered some edge cases and they all seem to be handled properly:

- Edge case 1: c1_cursor is behind match_index
This can occur when firstcount1 advances past c1_cursor due to bubble-up.
It appears to be handled by the condition at L2318.

- Edge case 2: All singletons are promoted, leaving no singletons to evict
The condition `else if (firstcount1 < track_cnt)` at L2348 ensures that
when firstcount1 == track_cnt (empty singleton region), the new value
is simply skipped. This looks correct.

- Edge case 3: c1_cursor exceeds track array bounds
While Ilia's proposal would simplify this, it's currently handled by
L2321.

Regarding Ilia's simplification proposal:

if (was_count1 && j < firstcount1)
firstcount1--;
if (c1_cursor < firstcount1)
c1_cursor = firstcount1;

The simplification looks appealing. However, I may be misunderstanding
something -
should the logic perhaps be:

if (was_count1 && j <= firstcount1)
firstcount1++;
if (c1_cursor < firstcount1)
c1_cursor = firstcount1;

My reasoning:
- Without `<=`, the case where j == firstcount1 would leave firstcount1
pointing
to a value with count=2 (no longer a singleton).
- I believe `firstcount1--` should be `firstcount1++`, since when a
singleton
is promoted to multiply-seen, the singleton region shrinks from the left,
meaning the boundary moves right (increases).

Please correct me if I'm missing something.

Regards,
Tatsuya Kawata

Chengpeng Yan

chengpeng_yan@outlook.com

4 months ago

In reply to: Tatsuya Kawata (#5)

Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

On Feb 1, 2026, at 17:09, Tatsuya Kawata <kawatatatsuya0913@gmail.com> wrote:

Regarding Ilia's simplification proposal:

if (was_count1 && j < firstcount1)
firstcount1--;
if (c1_cursor < firstcount1)
c1_cursor = firstcount1;

The simplification looks appealing. However, I may be misunderstanding something -
should the logic perhaps be:

if (was_count1 && j <= firstcount1)
firstcount1++;
if (c1_cursor < firstcount1)
c1_cursor = firstcount1;

My reasoning:
- Without `<=`, the case where j == firstcount1 would leave firstcount1 pointing
to a value with count=2 (no longer a singleton).
- I believe `firstcount1--` should be `firstcount1++`, since when a singleton
is promoted to multiply-seen, the singleton region shrinks from the left,
meaning the boundary moves right (increases).

Please correct me if I'm missing something.

Regards,
Tatsuya Kawata

Hi,

Thanks for the careful review and the suggestions.

For v3 I did not change the algorithm/behavior; I only adjusted the
comment in the match path to make the bubble-up / cursor interaction
more explicit.

On the firstcount1 / c1_cursor handling: c1_cursor is an index-based
“clock hand” over the count==1 region used to get FIFO eviction without
physically shifting the singleton subarray. The extra cursor adjustment
in the match path is there to preserve the same FIFO victim sequence as
the original shift-based code.

The key case is when a singleton is matched again (was_count1) and then
bubbles up into the count>1 prefix. That bubble-up effectively shifts
track[firstcount1..match_index-1] right by one. If c1_cursor points
anywhere in that shifted range (or at match_index itself), it must
advance by one as well; otherwise the cursor would start referring to a
different element than before the shift and we’d evict a newer singleton
next, breaking FIFO equivalence.

Concrete example: singleton region is [A,B,C,D] and c1_cursor points to
B (next to evict). If D is seen again, it becomes count=2 and bubbles
up, shifting [A,B,C] one slot to the right. B’s index changes, so
c1_cursor must move with it; otherwise the next eviction would
incorrectly target A (or whatever now sits at the old index), not B as
in the shift-based implementation. This is also why the condition uses
<= match_index: the cursor may point at the promoted singleton slot as
well.

Regarding the boundary direction: since firstcount1 is “index of first
singleton”, promotions shrink the singleton region and move the boundary
to the right. In hash mode that’s handled by advancing firstcount1 while
track[firstcount1].count > 1, and then clamping c1_cursor back into the
singleton region if needed.

If anything above still sounds off, happy to walk through a specific
trace.

--
Best regards,
Chengpeng Yan

Tatsuya Kawata

kawatatatsuya0913@gmail.com

4 months ago

In reply to: Chengpeng Yan (#6)

Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

Hi,

Thank you for the detailed explanation! Your explanation helped me
understand the design much better.
I hope my understanding is now on the right track.

I tested v3 both approaches:

1. Ilia's proposal with corrected increment and <= condition:
if (was_count1 && j <= firstcount1)
firstcount1++;

2. The original patch with while loop:
while (use_hash && firstcount1 < track_cnt &&
track[firstcount1].count > 1)
firstcount1++;

I verified the following cases and both approaches produced correct
track array values after the loop completed:

Case 1: c1_cursor == match_index
c1_cursor points to a singleton, that singleton is matched again,
bubble-up occurs, then a new value arrives triggering eviction.

Case 2: c1_cursor < match_index
c1_cursor is in the earlier part of the singleton region,
and a singleton further back is matched.

Case 3: c1_cursor > match_index
c1_cursor has advanced past match_index due to previous evictions,
and an earlier singleton is matched.

Both approaches seem to work correctly. The code reduction from 1 is
minimal, so either approach should be fine.
I believe the while loop exists to handle potential edge cases,
though in typical scenarios firstcount1 would only increment once per match
(since one singleton is promoted at a time).

Overall, the patch looks good to me.

Regards,
Tatsuya Kawata

Chengpeng Yan

chengpeng_yan@outlook.com

4 months ago

In reply to: Tatsuya Kawata (#7)

Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

On Feb 2, 2026, at 21:09, Tatsuya Kawata <kawatatatsuya0913@gmail.com> wrote:

Hi,

Thank you for the detailed explanation! Your explanation helped me understand the design much better.
I hope my understanding is now on the right track.

I tested v3 both approaches:

1. Ilia's proposal with corrected increment and <= condition:
if (was_count1 && j <= firstcount1)
firstcount1++;

2. The original patch with while loop:
while (use_hash && firstcount1 < track_cnt &&
track[firstcount1].count > 1)
firstcount1++;

I verified the following cases and both approaches produced correct
track array values after the loop completed:

Case 1: c1_cursor == match_index
c1_cursor points to a singleton, that singleton is matched again,
bubble-up occurs, then a new value arrives triggering eviction.

Case 2: c1_cursor < match_index
c1_cursor is in the earlier part of the singleton region,
and a singleton further back is matched.

Case 3: c1_cursor > match_index
c1_cursor has advanced past match_index due to previous evictions,
and an earlier singleton is matched.

Both approaches seem to work correctly. The code reduction from 1 is minimal, so either approach should be fine.
I believe the while loop exists to handle potential edge cases,
though in typical scenarios firstcount1 would only increment once per match (since one singleton is promoted at a time).

Overall, the patch looks good to me.

Hi Tatsuya,

Thank you for the detailed testing and for validating those
c1_cursor/match_index cases. I agree with your conclusion that both
variants behave correctly, and that the code reduction from the
single-step approach is small.

On firstcount1: in the typical case it should advance by one when a
singleton is promoted. I kept the loop-style adjustment mainly as an
invariant repair step in hash mode — after bubble-up, it simply advances
firstcount1 until it again points to the first singleton (or track_cnt).
That makes the update less dependent on subtle index relationships and
is a bit more robust against potential corner cases (or future tweaks to
the reordering), while still being cheap since firstcount1 only moves
forward and is bounded by track_cnt/track_max.

That said, if other community members would prefer the simpler one-step
update for readability, I’m happy to switch.

--
Best regards,
Chengpeng Yan

John Naylor

john.naylor@enterprisedb.com

4 months ago

In reply to: Chengpeng Yan (#8)

Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

On Tue, Feb 3, 2026 at 11:33 AM Chengpeng Yan <chengpeng_yan@outlook.com> wrote:

That said, if other community members would prefer the simpler one-step
update for readability, I’m happy to switch.

To evaluate that, it would be easier if the patch were split into two
commits, one with the simple step approach, and one that changes to
the other approach.

Also, we have some places that switch from e.g. an array to a hash
table once a collection gets above a certain threshold. In this case,
however, having less than 100 stat target is unheard of, in my
experience. It's only a tiny amount of code, but it doesn't seem worth
special-casing. Others may have a different opinion, and I'm not sure
we don't already have this behavior elsewhere.

--
John Naylor
Amazon Web Services

#10

Chengpeng Yan

chengpeng_yan@outlook.com

4 months ago

In reply to: John Naylor (#9)

Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

On Feb 3, 2026, at 13:39, John Naylor <johncnaylorls@gmail.com> wrote:

To evaluate that, it would be easier if the patch were split into two
commits, one with the simple step approach, and one that changes to
the other approach.

Also, we have some places that switch from e.g. an array to a hash
table once a collection gets above a certain threshold. In this case,
however, having less than 100 stat target is unheard of, in my
experience. It's only a tiny amount of code, but it doesn't seem worth
special-casing. Others may have a different opinion, and I'm not sure
we don't already have this behavior elsewhere.

Thanks for the suggestion — I’ve applied it in v4 by splitting the
changes into two commits.

0001 contains the main optimization and uses the simpler one-step update
for firstcount1 in the hash path.

0002 is a small follow-up that switches that logic to the more
invariant-preserving form discussed earlier.

Regarding the hash threshold: my initial approach followed existing
postgres precedent for hash-based MCV matching. For example, commit
057012b205a (“Speed up eqjoinsel() with lots of MCV entries”) introduces
EQJOINSEL_MCV_HASH_THRESHOLD in selfuncs.c, only enabling hashing once
the MCV set is large enough to amortize the setup cost. The exact
threshold differs, but the pattern is similar.

Also, while 100 is the default statistics target, there are internal
cases that use smaller values — e.g. vacuumdb staged analyze temporarily
sets default_statistics_target to 1 and 10 in vacuuming.c — which was
part of the motivation for keeping the thresholded behavior.

That said, I’m happy to adjust or remove the threshold if the consensus
is to simplify this further.

--
Best regards,
Chengpeng Yan

#11

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

3 months ago

In reply to: Chengpeng Yan (#10)

Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

On 07.02.2026 16:39, Chengpeng Yan wrote:

On Feb 3, 2026, at 13:39, John Naylor <johncnaylorls@gmail.com> wrote:

To evaluate that, it would be easier if the patch were split into two
commits, one with the simple step approach, and one that changes to
the other approach.

Also, we have some places that switch from e.g. an array to a hash
table once a collection gets above a certain threshold. In this case,
however, having less than 100 stat target is unheard of, in my
experience. It's only a tiny amount of code, but it doesn't seem worth
special-casing. Others may have a different opinion, and I'm not sure
we don't already have this behavior elsewhere.

Thanks for the suggestion — I’ve applied it in v4 by splitting the
changes into two commits.

I think it would be beneficial to split it into two logically
independent parts. Right now the patch introduces two different changes
at once:

1. Replacing the original shift-based singleton handling with the cursor
based eviction mechanism.
2. Adding the hash-based lookup.

Splitting the patch would make review easier because I can first focus
purely on the algorithmic transformation (shift -> cursor) and verify
semantic equivalence.

Regarding the hash threshold: my initial approach followed existing
postgres precedent for hash-based MCV matching. For example, commit
057012b205a (“Speed up eqjoinsel() with lots of MCV entries”) introduces
EQJOINSEL_MCV_HASH_THRESHOLD in selfuncs.c, only enabling hashing once
the MCV set is large enough to amortize the setup cost. The exact
threshold differs, but the pattern is similar.

Also, while 100 is the default statistics target, there are internal
cases that use smaller values — e.g. vacuumdb staged analyze temporarily
sets default_statistics_target to 1 and 10 in vacuuming.c — which was
part of the motivation for keeping the thresholded behavior.

That said, I’m happy to adjust or remove the threshold if the consensus
is to simplify this further.

I would probably not rely too heavily on EQJOINSEL_MCV_HASH_THRESHOLD as
a direct reference point here. That threshold is somewhat heuristic as
well, and the cost trade-offs in eqjoinsel() are not identical to what
we have in compute_distinct_stats(). In particular, here the break-even
point will strongly depend on the data type and value width. For
example, with wider datums (e.g. bytea), the comparison cost increases,
and the threshold at which hashing amortizes its setup cost may shift
noticeably. We don't need to compute the exact optimal threshold - just
get a reasonable order.

--
Best regards.
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

#12

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

3 months ago

In reply to: Ilia Evdokimov (#11)

Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

Have you benchmarked this change except in first message in this thread?

While reviewing the patch more closely, I noticed
that compute_distinct_stats() is only used for types where we have =, !=
but not <. In practice, most common scalar types go through
compute_scalar_stats() instead.

That makes me wonder how often this optimization would actually trigger
in real workloads. Since compute_scalar_stats() is the more common path,
there's chance that the hash-table based improvement in
compute_distinct_stats() may not provide a noticeable overall benefit.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

#13

Chengpeng Yan

chengpeng_yan@outlook.com

about 2 months ago

In reply to: Ilia Evdokimov (#11)

Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

Hi Ilia,

Thanks for the review.

On Feb 17, 2026, at 00:16, Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> wrote:

I think it would be beneficial to split it into two logically independent parts. Right now the patch introduces two different changes at once:
1. Replacing the original shift-based singleton handling with the cursor based eviction mechanism.
2. Adding the hash-based lookup.
Splitting the patch would make review easier because I can first focus purely on the algorithmic transformation (shift -> cursor) and verify semantic equivalence.

I split v5 accordingly. The first patch changes the singleton handling
from shifting to a cursor-based eviction scheme, and the second patch
adds the hash lookup.

On Feb 26, 2026, at 19:46, Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> wrote:

Have you benchmarked this change except in first message in this thread?

Yes.

I reran the same bench scripts from the first post, unchanged. For this
rerun I kept the same comparison setup as in that first message: one run
on the pre-change path with no hash lookup, and one run with hash lookup
forced on for all statistics targets by setting the threshold to 0.

Across the full bench run, the hash-enabled variant shows a 3.963x
geomean speedup. For statistics_target >= 100, all measured points
improved, with a 5.578x geomean speedup in that region.

A few representative rows are:

- bench_aclitem.x_unique, target 10000: 204740.298 ms -> 941.056 ms
(217.564x)
- bench_xid.x_unique, target 10000: 138083.532 ms -> 984.214 ms
(140.298x)

At lower targets, the effect is smaller and mixed. For example:

- bench_xid.x_match, target 50: 42.245 ms -> 43.690 ms (0.967x)

I would probably not rely too heavily on EQJOINSEL_MCV_HASH_THRESHOLD as a direct reference point here. That threshold is somewhat heuristic as well, and the cost trade-offs in eqjoinsel() are not identical to what we have in compute_distinct_stats(). In particular, here the break-even point will strongly depend on the data type and value width. For example, with wider datums (e.g. bytea), the comparison cost increases, and the threshold at which hashing amortizes its setup cost may shift noticeably. We don't need to compute the exact optimal threshold - just get a reasonable order.

I agree that EQJOINSEL_MCV_HASH_THRESHOLD is not a strong direct
reference point here, and I am not treating 100 as something to inherit
from eqjoinsel().

I reran the bench mainly to sanity-check the threshold for
compute_distinct_stats() itself. In this setup, forcing hashing at all
statistics targets gives mixed results below 100, while from 100 upward
the effect is consistently positive. So in the attached v5 I kept 100 as
a conservative heuristic for this relatively narrow path, rather than as
a claim that it is an exact or universally optimal cutoff. That seemed
enough to establish a reasonable order for the threshold, even if it is
not meant as an exact break-even point for every type.

While reviewing the patch more closely, I noticed that compute_distinct_stats() is only used for types where we have =, != but not <. In practice, most common scalar types go through compute_scalar_stats() instead.

That makes me wonder how often this optimization would actually trigger in real workloads. Since compute_scalar_stats() is the more common path, there's chance that the hash-table based improvement in compute_distinct_stats() may not provide a noticeable overall benefit.

I agree that this is a relatively narrow optimization.

In core, the main direct built-in cases on this path are xid, cid, and
aclitem, plus some derived cases over equality-only types. So the scope
here is limited, but when compute_distinct_stats() is exercised,
especially with higher statistics targets, the improvement is
substantial.

I am attaching v5 as a two-patch series, together with
bench-compare-result.out containing the full no-hash vs all-hash
comparison.

--
Best regards,
Chengpeng Yan

#14

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

27 days ago

In reply to: Chengpeng Yan (#13)

Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

On 4/14/26 04:34, Chengpeng Yan wrote:

I split v5 accordingly. The first patch changes the singleton handling
from shifting to a cursor-based eviction scheme, and the second patch
adds the hash lookup.

I reviewed v5 of the patches. Instead of going through each issue one by
one, I made a pass to clean up and clarify the code and summarize the
main changes belows:

- Fixed a few typos in comments and added comments in places where the
logic was not immediately clear;

- Rewrote the bubble-up loop using `for` loop, which I find more
readable. Also removed some confusing uses of the `j` variable that mase
the flow harder to follow;

- Simplified parts of the code to improve overall readability.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

#15

John Naylor

john.naylor@enterprisedb.com

26 days ago

In reply to: Ilia Evdokimov (#12)

Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

On Thu, Feb 26, 2026 at 6:46 PM Ilia Evdokimov
<ilya.evdokimov@tantorlabs.com> wrote:

While reviewing the patch more closely, I noticed
that compute_distinct_stats() is only used for types where we have =, !=
but not <. In practice, most common scalar types go through
compute_scalar_stats() instead.

That makes me wonder how often this optimization would actually trigger
in real workloads. Since compute_scalar_stats() is the more common path,
there's chance that the hash-table based improvement in
compute_distinct_stats() may not provide a noticeable overall benefit.

Coming back to this point, in any installation, the common path is
going to vastly outnumber the rare path, so this patch is optimizing
the wrong thing.

--
John Naylor
Amazon Web Services

#16

Chengpeng Yan

chengpeng_yan@outlook.com

24 days ago

In reply to: Ilia Evdokimov (#14)

Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

Hi,

On May 3, 2026, at 22:16, Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> wrote:

I reviewed v5 of the patches. Instead of going through each issue one by one, I made a pass to clean up and clarify the code and summarize the main changes belows:
- Fixed a few typos in comments and added comments in places where the logic was not immediately clear;
- Rewrote the bubble-up loop using `for` loop, which I find more readable. Also removed some confusing uses of the `j` variable that mase the flow harder to follow;
- Simplified parts of the code to improve overall readability.

Thanks for the review and for working on this.

After giving it more thought, I agree with John’s point that the set of
types benefiting from this optimization is currently too limited. While
this optimization can still provide noticeable improvements in some
specific cases, those cases do not seem common enough at this stage, so
the overall benefit also seems fairly small.

Because of that, I marked this patch as Withdrawn in the current
CommitFest and plan to pause further work on it for now.

That said, if postgres or some extensions introduce more commonly used
types that are a good fit for this optimization, I think it would make
sense to revisit and potentially revive this work.

Thanks again for the effort and contribution.

--
Best regards,
Chengpeng Yan

[PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: