Improve docs for n_distinct_inherited
In [1]/messages/by-id/18916-985c04e9b5bed12b@postgresql.org there's a bug report about ALTER TABLE ... ALTER COLUMN SET
(n_distinct = N) not working for partitioned tables. Of course, you
need to use n_distinct_inherited for partitioned tables, but the docs
don't say that.
I went through a few iterations of the wording to make this clearer
and landed on the attached.
David
Attachments:
doc_n_distinct_inherited.patchapplication/octet-stream; name=doc_n_distinct_inherited.patchDownload
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index d63f3a621ac..9e8e9388774 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -366,7 +366,7 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
<link linkend="sql-analyze"><command>ANALYZE</command></link>
operations. <literal>n_distinct</literal> affects the statistics for the table
itself, while <literal>n_distinct_inherited</literal> affects the statistics
- gathered for the table plus its inheritance children. When set to a
+ gathered which include the descendant tables of partitioned tables and inheritance parent tables. When set to a
positive value, <command>ANALYZE</command> will assume that the column contains
exactly the specified number of distinct nonnull values. When set to a
negative value, which must be greater
On Wed, May 7, 2025 at 4:15 PM David Rowley <dgrowleyml@gmail.com> wrote:
In [1] there's a bug report about ALTER TABLE ... ALTER COLUMN SET
(n_distinct = N) not working for partitioned tables. Of course, you
need to use n_distinct_inherited for partitioned tables, but the docs
don't say that.I went through a few iterations of the wording to make this clearer
and landed on the attached.
Not liking the proposal, not sure it is even correct. Somehow "children of
inheritance parent tables" are omitted. In both cases the only statistics
affected are those of the table upon which the option is being read. That
can go implied. Leaving "affects the statistics for the table itself,
while" is causing most of the harm.
In terms of wording - n_distinct must be used for all tables except
partitioned tables and inheritance parent tables. Those two acquire their
override value from n_distinct_inherited. So something like:
diff --git a/doc/src/sgml/ref/alter_table.sgml
b/doc/src/sgml/ref/alter_table.sgml
index 24316bb816a..657e4c835ac 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -339,9 +339,13 @@ WITH ( MODULUS <replaceable
class="parameter">numeric_literal</replaceable>, REM
<literal>n_distinct_inherited</literal>, which override the
number-of-distinct-values estimates made by subsequent
<link linkend="sql-analyze"><command>ANALYZE</command></link>
- operations. <literal>n_distinct</literal> affects the statistics
for the table
- itself, while <literal>n_distinct_inherited</literal> affects the
statistics
- gathered for the table plus its inheritance children. When set to a
+ operations.
+ For partitioned tables and inheritance parents
<command>ANALYZE</command>
+ consults the <literal>n_distinct_inherited</literal> option and, if
applicable,
+ includes all descendants when calculating the table size for use in
the
+ number-of-distinct-values estimate. For all other tables,
<literal>n_distinct</literal>
+ is used and the table size estimate is based on the table itself.
+ When set to a
positive value, <command>ANALYZE</command> will assume that the
column contains
exactly the specified number of distinct nonnull values. When set
to a
negative value, which must be greater
It has a bit of forward-referencing to the following paragraph, but nothing
surprising or novel. The existing wording is already doing the same by
alluding to descendants/children - but its just one data point that is
involved so lets call it out.
David J.
On Thu, 8 May 2025 at 15:23, David G. Johnston
<david.g.johnston@gmail.com> wrote:
Not liking the proposal, not sure it is even correct. Somehow "children of inheritance parent tables" are omitted.
I don't see the quoted text anywhere in this area, so I'm not sure I
follow what you mean with the omission.
In both cases the only statistics affected are those of the table upon which the option is being read. That can go implied. Leaving "affects the statistics for the table itself, while" is causing most of the harm.
So the text, when it says "affects the statistics for the table
itself" is talking about stainherit==false statistics and the "the
statistics gathered which include the descendant tables of partitioned
tables and inheritance parent tables" which I added is trying to
describe the stainherits==true statistics. You may have
misinterpreted that to think it was talking which table the option was
being applied to rather than where the sample rows were being gathered
from.
In terms of wording - n_distinct must be used for all tables except partitioned tables and inheritance parent tables. Those two acquire their override value from n_distinct_inherited. So something like:
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index 24316bb816a..657e4c835ac 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -339,9 +339,13 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM <literal>n_distinct_inherited</literal>, which override the number-of-distinct-values estimates made by subsequent <link linkend="sql-analyze"><command>ANALYZE</command></link> - operations. <literal>n_distinct</literal> affects the statistics for the table - itself, while <literal>n_distinct_inherited</literal> affects the statistics - gathered for the table plus its inheritance children. When set to a + operations. + For partitioned tables and inheritance parents <command>ANALYZE</command> + consults the <literal>n_distinct_inherited</literal> option and, if applicable, + includes all descendants when calculating the table size for use in the + number-of-distinct-values estimate. For all other tables, <literal>n_distinct</literal> + is used and the table size estimate is based on the table itself. + When set to a positive value, <command>ANALYZE</command> will assume that the column contains exactly the specified number of distinct nonnull values. When set to a negative value, which must be greaterIt has a bit of forward-referencing to the following paragraph, but nothing surprising or novel. The existing wording is already doing the same by alluding to descendants/children - but its just one data point that is involved so lets call it out.
I can't follow your proposed text. It seems to indicate that ANALYZE
uses n_distinct_inherits to calculate the table size. ANALYZE does
nothing of the sort. I think you might have been confused by the
following text which talks about how *the planner* proportions
*negative* stadistinct values over the "table size", which really
should say estimated row count and not table size. table size has
nothing to do with this, but that seems like another issue.
As for my proposed patch and the text it changes, what this text is
meant to be talking about is which of ANALYZE's sample sets the
stadistinct estimate is overridden for. The two options for sample
sets are 1) stainherits==true and 2) stainherits==false. With
inheritance parent tables, ANALYZE gathers both sets and records both
sets in pg_statistic. Which one the planner uses depends on what it's
doing and whether you specify ONLY or not in your query. For
partitioned tables, since they're empty, we don't bother with
stainherits==false stats.
I've attached another patch which adjusts the text to detail that
there are two sample row sets gathered for some cases, and then
mentions the two options that can be used to selectively overwrite
each sample's n_distinct estimate. I'd rather find a shorter set of
words to better explain this than what I'm proposing in v2, but the
current text has presumably confused one person (in the bug report)
and the v1 patch still seems to be a source of confusion.
David
Attachments:
doc_n_distinct_inherited_v2.patchapplication/x-patch; name=doc_n_distinct_inherited_v2.patchDownload
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index d63f3a621ac..10aaf1b3d34 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -364,9 +364,19 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
<literal>n_distinct_inherited</literal>, which override the
number-of-distinct-values estimates made by subsequent
<link linkend="sql-analyze"><command>ANALYZE</command></link>
- operations. <literal>n_distinct</literal> affects the statistics for the table
- itself, while <literal>n_distinct_inherited</literal> affects the statistics
- gathered for the table plus its inheritance children. When set to a
+ operations. For inheritance parent tables, <command>ANALYZE</command>
+ operates on two row sample sets, one which includes sample rows from
+ only the inheritance parent itself, and the other set also contains a
+ proportionate set of sample rows from the table's inheritance children.
+ Adjusting <literal>n_distinct</literal> affects the statistics for the
+ sample without the inheritance children, while
+ <literal>n_distinct_inherited</literal> affects the statistics which
+ include the child sample rows. For partitioned tables,
+ <command>ANALYZE</command> does not gather statistics for the
+ partitioned table itself as it never contains rows, so
+ <literal>n_distinct_inherited</literal> must be used for partitioned
+ tables. For non-partitioned and non-inheritance parent tables, only
+ <literal>n_distinct</literal> is applicable. When set to a
positive value, <command>ANALYZE</command> will assume that the column contains
exactly the specified number of distinct nonnull values. When set to a
negative value, which must be greater
On Wed, May 7, 2025 at 11:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 8 May 2025 at 15:23, David G. Johnston
<david.g.johnston@gmail.com> wrote:Not liking the proposal, not sure it is even correct. Somehow "children
of inheritance parent tables" are omitted.
I don't see the quoted text anywhere in this area, so I'm not sure I
follow what you mean with the omission.
Omitted from your v1 patch.
The two options for sample
sets are 1) stainherits==true and 2) stainherits==false. With
inheritance parent tables, ANALYZE gathers both sets and records both
sets in pg_statistic.
I was missing this key piece of knowledge which invalidated my entire
attempt.
Here's an attempt at shortening this now that I understand the mechanics
better.
Separate options exist because an inheritance parent table has two
different sets of statistics: one considering only itself and one
which
also includes its children (<literal>n_distinct_inherited</literal>).
Partitioned tables, which only have rows in the children, likewise
uses
the inherited option while everyone else uses
<literal>n_distinct</literal>.
David J.
Just picking this one up again. I forgot to come back to this after PGConf.dev.
On Fri, 9 May 2025 at 02:50, David G. Johnston
<david.g.johnston@gmail.com> wrote:
I was missing this key piece of knowledge which invalidated my entire attempt.
Here's an attempt at shortening this now that I understand the mechanics better.
Separate options exist because an inheritance parent table has two
different sets of statistics: one considering only itself and one which
also includes its children (<literal>n_distinct_inherited</literal>).
Partitioned tables, which only have rows in the children, likewise uses
the inherited option while everyone else uses <literal>n_distinct</literal>.
I wasn't quite happy with that as the text indicates that
n_distinct_inherited is the statistics. But, it's not, it's just the
option that allows some modification of the gathered statistics.
I came up with:
Ordinarily <literal>n_distinct</literal> is used.
<literal>n_distinct_inherited</literal> exists to allow the distinct
estimate to be overwritten for the statistics gathered for inheritance
parent tables and for partitioned tables.
I also fixed what I thought was some misleading text about ANALYZE
using this value to calculate things. That's not true. It's the query
planner that uses this value. ANALYZE just stores whatever this is set
to into pg_statistic. I also adjusted the text that was talking about
"the size of the table", which, as I mentioned earlier isn't correct.
It's all related to the estimated number rows in the table, per
"ntuples = vardata->rel->tuples;" in get_variable_numdistinct().
Also fixed a typo; "twice on the average" shouldn't contain "the".
I wonder if ", since the multiplication by the number of rows in the
table is not performed until query planning time" should be deleted
since I modified the text earlier to talk about "the query planner".
David
Attachments:
doc_n_distinct_inherited_v3.patchapplication/octet-stream; name=doc_n_distinct_inherited_v3.patchDownload
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 8867da6c693..7d944f1ffe1 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -376,18 +376,19 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
<literal>n_distinct_inherited</literal>, which override the
number-of-distinct-values estimates made by subsequent
<link linkend="sql-analyze"><command>ANALYZE</command></link>
- operations. <literal>n_distinct</literal> affects the statistics for the table
- itself, while <literal>n_distinct_inherited</literal> affects the statistics
- gathered for the table plus its inheritance children. When set to a
- positive value, <command>ANALYZE</command> will assume that the column contains
- exactly the specified number of distinct nonnull values. When set to a
- negative value, which must be greater
- than or equal to -1, <command>ANALYZE</command> will assume that the number of
- distinct nonnull values in the column is linear in the size of the
- table; the exact count is to be computed by multiplying the estimated
- table size by the absolute value of the given number. For example,
+ operations. Ordinarily <literal>n_distinct</literal> is used.
+ <literal>n_distinct_inherited</literal> exists to allow the distinct
+ estimate to be overwritten for the statistics gathered for inheritance
+ parent tables and for partitioned tables. When these are set to a
+ positive value, the query planner will assume that
+ the column contains exactly the specified number of distinct nonnull
+ values. When set to a negative value, which must be greater
+ than or equal to -1, the query planner will assume that the number of
+ distinct nonnull values in the column is linear with the estimated number
+ of rows in the table; the exact count is to be computed by multiplying the estimated
+ rows in the table by the absolute value of the given number. For example,
a value of -1 implies that all values in the column are distinct, while
- a value of -0.5 implies that each value appears twice on the average.
+ a value of -0.5 implies that each value appears twice on average.
This can be useful when the size of the table changes over time, since
the multiplication by the number of rows in the table is not performed
until query planning time. Specify a value of 0 to revert to estimating
On Sun, Oct 12, 2025 at 7:42 PM David Rowley <dgrowleyml@gmail.com> wrote:
Just picking this one up again. I forgot to come back to this after
PGConf.dev.On Fri, 9 May 2025 at 02:50, David G. Johnston
<david.g.johnston@gmail.com> wrote:I was missing this key piece of knowledge which invalidated my entire
attempt.
Here's an attempt at shortening this now that I understand the mechanics
better.
Separate options exist because an inheritance parent table has two
different sets of statistics: one considering only itself and onewhich
also includes its children
(<literal>n_distinct_inherited</literal>).
Partitioned tables, which only have rows in the children, likewise
uses
the inherited option while everyone else uses
<literal>n_distinct</literal>.
I wasn't quite happy with that as the text indicates that
n_distinct_inherited is the statistics. But, it's not, it's just the
option that allows some modification of the gathered statistics.I came up with:
Ordinarily <literal>n_distinct</literal> is used.
<literal>n_distinct_inherited</literal> exists to allow the distinct
estimate to be overwritten for the statistics gathered for
inheritance
parent tables and for partitioned tables.
How about:
"n_disinct is used for normal tables while n_distinct_inherited is used for
partitioned tables. Both are usable (selected via the ONLY modifier) for
an inheritance parent table since it has both storage and children."
The use of both "Ordinarily" and "overwritten" is bothering me here. And
it implies that n_distinct doesn't work for inheritance parent tables or,
conversely, that n_distinct does work for partitioned tables.
I also fixed what I thought was some misleading text about ANALYZE
using this value to calculate things.
values in the column is linear with the estimated number
of rows in the table; the exact count is to be computed by
multiplying the estimated
rows in the table by the absolute value of the given number.
"...is proportional to the estimated number of rows in the table at
planning time."
(The final "at planning time" substitutes for the sentence you pondered
removing.)
(The rest, including the examples, seem a bit self-explanatory given the
definition, though I do get reader inexperience with the terminology. But
proportional implies a linear relationship, and the positive/negative
bifurcation seems straight-forward here.)
I'm thinking everything else below is better incorporated into 14.2 which
should be linking back to this section. That way the crux of the usage is
defined in syntax while the details about setting a specific value are
located in the section covering the overall topic.
the exact count is to be computed by multiplying the estimated
+ rows in the table by the absolute value of the given number. For
example,
a value of -1 implies that all values in the column are distinct,
while
- a value of -0.5 implies that each value appears twice on the average.
+ a value of -0.5 implies that each value appears twice on average.
This can be useful when the size of the table changes over time,
since
the multiplication by the number of rows in the table is not
performed
until query planning time.
(Leave: Specify a value of 0 to revert to estimating...)
That said, this rework would be OK as-is.
Also, looking at stadistinct, the multiplier stored there accounts for the
presence of null. This attribute-option does not. Is that difference
worth noting?
David J.
Hi David,
I think your revision is good and accurate.
On Oct 13, 2025, at 07:42, David Rowley <dgrowleyml@gmail.com> wrote:
Just picking this one up again. I forgot to come back to this after PGConf.dev.
I came up with:
Ordinarily <literal>n_distinct</literal> is used.
<literal>n_distinct_inherited</literal> exists to allow the distinct
estimate to be overwritten for the statistics gathered for inheritance
parent tables and for partitioned tables.
This clarifies that n_distinct_inherited applies to both inheritance parents and partitioned tabled, that’s accurate and better than the original wording.
I also fixed what I thought was some misleading text about ANALYZE
using this value to calculate things. That's not true. It's the query
planner that uses this value. ANALYZE just stores whatever this is set
to into pg_statistic. I also adjusted the text that was talking about
"the size of the table", which, as I mentioned earlier isn't correct.
It's all related to the estimated number rows in the table, per
"ntuples = vardata->rel->tuples;" in get_variable_numdistinct().
When I read the diff, I thought “table size” is usually used in PG docs, so maybe “estimated table size”. But as you explicitly explained why you made the change, I am fine with your change.
Also fixed a typo; "twice on the average" shouldn't contain "the".
Correct grammer fix.
I wonder if ", since the multiplication by the number of rows in the
table is not performed until query planning time" should be deleted
since I modified the text earlier to talk about "the query planner”.
Yeah, with your rewrite, that clause now feels a little redundant. I think it can be removed entirely.
The other thing that doesn’t belong to your change but as you are touching here:
“When set to a negative value, which must be greater than or equal to -1"
When I first time read the doc, I was confused. Because no easier sentence indicated “n_distinct” is of float type. I thought “greater than” was a typo. When I read through, the later example (0.5) resolved my confusion. To avoid the same confusion to other readers, maybe change to “when set to a negative value between -1 and 0 (inclusive of -1)” or “when set to a negative value, which must be in the range -1<= value < 0”.
But as I said, this doesn’t belong to your change. If you don’t want to enhance this place, or you don’t consider this is a problem, I am absolutely fine.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Mon, 13 Oct 2025 at 14:28, David G. Johnston
<david.g.johnston@gmail.com> wrote:
How about:
"n_disinct is used for normal tables while n_distinct_inherited is used for partitioned tables. Both are usable (selected via the ONLY modifier) for an inheritance parent table since it has both storage and children."
I specifically was trying my best to avoid using the word "normal"
here as it's very indescriptive. Rhetorical question: would you
expect a setting which applies to "normal" tables to be applied to a
materialized view? Instead, I described the cases where
n_distinct_inherited applies, and left it to the reader to assume that
n_distinct applies in all other cases, which IMO is correct. I don't
see the need to talk about ONLY. I'm only discussing stored statistics
in this part, i.e, what's in pg_statistic.
The use of both "Ordinarily" and "overwritten" is bothering me here. And it implies that n_distinct doesn't work for inheritance parent tables or, conversely, that n_distinct does work for partitioned tables.
I don't know why you think that implies that n_distinct works for
partitioned tables. I've specifically said that "n_distinct_inherited
allows the n_distinct estimate to be overwritten for partitioned
tables". Why would the reader then assume n_distinct is for doing
that?
As for "overwritten", this is exactly what changing this setting does
to the pg_statistic.stadistinct when ANALYZE runs for the table, so as
far as I'm concerned, it's a good and accurate way to describe what
the settings do.
I'm open to finding something better for "Ordinarily
<literal>n_distinct</literal> is used."
I also fixed what I thought was some misleading text about ANALYZE
using this value to calculate things.values in the column is linear with the estimated number
of rows in the table; the exact count is to be computed by multiplying the estimated
rows in the table by the absolute value of the given number."...is proportional to the estimated number of rows in the table at planning time."
(The final "at planning time" substitutes for the sentence you pondered removing.)
I've adjusted the "proportional" part, but the "add planning time" at
the end is surplus because we've already mentioned that this is
happening in "the query planner". I doubt anyone needs the additional
reassurance that things that the query planner does happen at planning
time.
(The rest, including the examples, seem a bit self-explanatory given the definition, though I do get reader inexperience with the terminology. But proportional implies a linear relationship, and the positive/negative bifurcation seems straight-forward here.)
I don't feel the need to remove the examples. I think they're good to
reinforce how the negative values are utilised during planning.
I'm thinking everything else below is better incorporated into 14.2 which should be linking back to this section. That way the crux of the usage is defined in syntax while the details about setting a specific value are located in the section covering the overall topic.
the exact count is to be computed by multiplying the estimated + rows in the table by the absolute value of the given number. For example, a value of -1 implies that all values in the column are distinct, while - a value of -0.5 implies that each value appears twice on the average. + a value of -0.5 implies that each value appears twice on average. This can be useful when the size of the table changes over time, since the multiplication by the number of rows in the table is not performed until query planning time.(Leave: Specify a value of 0 to revert to estimating...)
Do you mean remove that part? I did consider that as I didn't think
the reader needed to know that as they could use RESET (n_distinct);
That said, this rework would be OK as-is.
Also, looking at stadistinct, the multiplier stored there accounts for the presence of null. This attribute-option does not. Is that difference worth noting?
It's already becoming more painful to get this changed than I would
have anticipated. Feel free to suggest something on a new thread if
you think it's worthwhile to mention it.
David
On Mon, 13 Oct 2025 at 14:47, Chao Li <li.evan.chao@gmail.com> wrote:
Hi David,
I think your revision is good and accurate.
On Oct 13, 2025, at 07:42, David Rowley <dgrowleyml@gmail.com> wrote:
I wonder if ", since the multiplication by the number of rows in the
table is not performed until query planning time" should be deleted
since I modified the text earlier to talk about "the query planner”.Yeah, with your rewrite, that clause now feels a little redundant. I think it can be removed entirely.
I've now removed that part.
The other thing that doesn’t belong to your change but as you are touching here:
“When set to a negative value, which must be greater than or equal to -1"
When I first time read the doc, I was confused. Because no easier sentence indicated “n_distinct” is of float type. I thought “greater than” was a typo. When I read through, the later example (0.5) resolved my confusion. To avoid the same confusion to other readers, maybe change to “when set to a negative value between -1 and 0 (inclusive of -1)” or “when set to a negative value, which must be in the range -1<= value < 0”.
I agree that part is a bit clumsy. Starting at this cold again today I
also thought it must mean less than -1, so rewording that seems like a
good idea.
It now reads:
Fractional values may also be specified by using values below 0 and above
or equal to -1. This instructs the query planner to estimate the number
of distinct values by multiplying the absolute value of the specified
number by the estimated number of rows in the table.
Updated patch attached.
David
Attachments:
doc_n_distinct_inherited_v4.patchapplication/octet-stream; name=doc_n_distinct_inherited_v4.patchDownload
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 8867da6c693..e0808afc350 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -376,22 +376,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
<literal>n_distinct_inherited</literal>, which override the
number-of-distinct-values estimates made by subsequent
<link linkend="sql-analyze"><command>ANALYZE</command></link>
- operations. <literal>n_distinct</literal> affects the statistics for the table
- itself, while <literal>n_distinct_inherited</literal> affects the statistics
- gathered for the table plus its inheritance children. When set to a
- positive value, <command>ANALYZE</command> will assume that the column contains
- exactly the specified number of distinct nonnull values. When set to a
- negative value, which must be greater
- than or equal to -1, <command>ANALYZE</command> will assume that the number of
- distinct nonnull values in the column is linear in the size of the
- table; the exact count is to be computed by multiplying the estimated
- table size by the absolute value of the given number. For example,
+ operations. Ordinarily <literal>n_distinct</literal> is used.
+ <literal>n_distinct_inherited</literal> exists to allow the distinct
+ estimate to be overwritten for the statistics gathered for inheritance
+ parent tables and for partitioned tables. When the value specified is a
+ positive value, the query planner will assume that
+ the column contains exactly the specified number of distinct nonnull
+ values. Fractional values may also be specified by using values below 0 and above
+ or equal to -1. This instructs the query planner to estimate the number
+ of distinct values by multiplying the absolute value of the specified
+ number by the estimated number of rows in the table. For example,
a value of -1 implies that all values in the column are distinct, while
- a value of -0.5 implies that each value appears twice on the average.
- This can be useful when the size of the table changes over time, since
- the multiplication by the number of rows in the table is not performed
- until query planning time. Specify a value of 0 to revert to estimating
- the number of distinct values normally. For more information on the use
+ a value of -0.5 implies that each value appears twice on average.
+ This can be useful when the size of the table changes over time.
+ For more information on the use
of statistics by the <productname>PostgreSQL</productname> query
planner, refer to <xref linkend="planner-stats"/>.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 568696333c2..c9f7d4aa1f7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -490,6 +490,16 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
scan->rs_startblock = startBlk;
scan->rs_numblocks = numBlks;
+
+ /* set the limits in the ParallelBlockTableScanDesc, when present as leader */
+ if (scan->rs_base.rs_parallel != NULL && !IsParallelWorker())
+ {
+ ParallelBlockTableScanDesc bpscan;
+
+ bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ bpscan->phs_startblock = startBlk;
+ bpscan->phs_numblock = numBlks;
+ }
}
/*
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 5e41404937e..5ad6963dbbe 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -188,6 +188,42 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
pscan, flags);
}
+TableScanDesc
+table_beginscan_parallel_tidrange(Relation relation,
+ ParallelTableScanDesc pscan,
+ ItemPointer mintid, ItemPointer maxtid)
+{
+ Snapshot snapshot;
+ uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
+ TableScanDesc sscan;
+
+ Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
+
+ /* disable syncscan in parallel tid range scan. */
+ pscan->phs_syncscan = false;
+
+ if (!pscan->phs_snapshot_any)
+ {
+ /* Snapshot was serialized -- restore it */
+ snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
+ RegisterSnapshot(snapshot);
+ flags |= SO_TEMP_SNAPSHOT;
+ }
+ else
+ {
+ /* SnapshotAny passed by caller (not serialized) */
+ snapshot = SnapshotAny;
+ }
+
+ sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
+ pscan, flags);
+
+ /* Set the scan limits based on the given min and max tids */
+ relation->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid);
+
+ return sscan;
+}
+
/* ----------------------------------------------------------------------------
* Index scan related functions.
@@ -398,6 +434,7 @@ table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
bpscan->phs_nblocks > NBuffers / 4;
SpinLockInit(&bpscan->phs_mutex);
bpscan->phs_startblock = InvalidBlockNumber;
+ bpscan->phs_numblock = InvalidBlockNumber;
pg_atomic_init_u64(&bpscan->phs_nallocated, 0);
return sizeof(ParallelBlockTableScanDescData);
@@ -577,8 +614,15 @@ table_block_parallelscan_nextpage(Relation rel,
pbscanwork->phsw_chunk_remaining = pbscanwork->phsw_chunk_size - 1;
}
+ /*
+ * Check if we've allocated every block in the relation, or if we've
+ * reached the limit imposed by pbscan->phs_numblock (if set).
+ */
if (nallocated >= pbscan->phs_nblocks)
- page = InvalidBlockNumber; /* all blocks have been allocated */
+ page = InvalidBlockNumber; /* all blocks have been allocated */
+ else if (pbscan->phs_numblock != InvalidBlockNumber &&
+ nallocated >= pbscan->phs_numblock)
+ page = InvalidBlockNumber; /* upper scan limit reached */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f098a5557cf..c8c41cac1ea 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -40,6 +40,7 @@
#include "executor/nodeSeqscan.h"
#include "executor/nodeSort.h"
#include "executor/nodeSubplan.h"
+#include "executor/nodeTidrangescan.h"
#include "executor/tqueue.h"
#include "jit/jit.h"
#include "nodes/nodeFuncs.h"
@@ -266,6 +267,11 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
ExecForeignScanEstimate((ForeignScanState *) planstate,
e->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanEstimate((TidRangeScanState *) planstate,
+ e->pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendEstimate((AppendState *) planstate,
@@ -493,6 +499,11 @@ ExecParallelInitializeDSM(PlanState *planstate,
ExecForeignScanInitializeDSM((ForeignScanState *) planstate,
d->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeDSM((TidRangeScanState *) planstate,
+ d->pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendInitializeDSM((AppendState *) planstate,
@@ -994,6 +1005,11 @@ ExecParallelReInitializeDSM(PlanState *planstate,
ExecForeignScanReInitializeDSM((ForeignScanState *) planstate,
pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanReInitializeDSM((TidRangeScanState *) planstate,
+ pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendReInitializeDSM((AppendState *) planstate, pcxt);
@@ -1362,6 +1378,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
ExecForeignScanInitializeWorker((ForeignScanState *) planstate,
pwcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeWorker((TidRangeScanState *) planstate,
+ pwcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendInitializeWorker((AppendState *) planstate, pwcxt);
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 1bce8d6cbfe..065bd37d9dd 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -128,14 +128,17 @@ TidExprListCreate(TidRangeScanState *tidrangestate)
* TidRangeEval
*
* Compute and set node's block and offset range to scan by evaluating
- * node->trss_tidexprs. Returns false if we detect the range cannot
- * contain any tuples. Returns true if it's possible for the range to
- * contain tuples. We don't bother validating that trss_mintid is less
- * than or equal to trss_maxtid, as the scan_set_tidrange() table AM
- * function will handle that.
+ * node->trss_tidexprs. Sets node's trss_rangeIsEmpty to true if the
+ * calculated range must be empty of any tuples, otherwise sets
+ * trss_rangeIsEmpty to false and sets trss_mintid and trss_maxtid to the
+ * calculated range.
+ *
+ * We don't bother validating that trss_mintid is less than or equal to
+ * trss_maxtid, as the scan_set_tidrange() table AM function will handle
+ * that.
* ----------------------------------------------------------------
*/
-static bool
+static void
TidRangeEval(TidRangeScanState *node)
{
ExprContext *econtext = node->ss.ps.ps_ExprContext;
@@ -165,7 +168,10 @@ TidRangeEval(TidRangeScanState *node)
/* If the bound is NULL, *nothing* matches the qual. */
if (isNull)
- return false;
+ {
+ node->trss_rangeIsEmpty = true;
+ return;
+ }
if (tidopexpr->exprtype == TIDEXPR_LOWER_BOUND)
{
@@ -207,7 +213,7 @@ TidRangeEval(TidRangeScanState *node)
ItemPointerCopy(&lowerBound, &node->trss_mintid);
ItemPointerCopy(&upperBound, &node->trss_maxtid);
- return true;
+ node->trss_rangeIsEmpty = false;
}
/* ----------------------------------------------------------------
@@ -234,12 +240,19 @@ TidRangeNext(TidRangeScanState *node)
slot = node->ss.ss_ScanTupleSlot;
direction = estate->es_direction;
- if (!node->trss_inScan)
+ /* First time through, compute TID range to scan */
+ if (!node->trss_rangeCalcDone)
{
- /* First time through, compute TID range to scan */
- if (!TidRangeEval(node))
- return NULL;
+ TidRangeEval(node);
+ node->trss_rangeCalcDone = true;
+ }
+ /* Check if the range was detected not to contain any tuples */
+ if (node->trss_rangeIsEmpty)
+ return NULL;
+
+ if (!node->trss_inScan)
+ {
if (scandesc == NULL)
{
scandesc = table_beginscan_tidrange(node->ss.ss_currentRelation,
@@ -250,9 +263,13 @@ TidRangeNext(TidRangeScanState *node)
}
else
{
- /* rescan with the updated TID range */
- table_rescan_tidrange(scandesc, &node->trss_mintid,
- &node->trss_maxtid);
+ /* rescan with the updated TID range only in non-parallel mode */
+ if (scandesc->rs_parallel == NULL)
+ {
+ /* rescan with the updated TID range */
+ table_rescan_tidrange(scandesc, &node->trss_mintid,
+ &node->trss_maxtid);
+ }
}
node->trss_inScan = true;
@@ -274,13 +291,18 @@ TidRangeNext(TidRangeScanState *node)
static bool
TidRangeRecheck(TidRangeScanState *node, TupleTableSlot *slot)
{
- if (!TidRangeEval(node))
- return false;
+ /* First call? Compute the TID Range */
+ if (!node->trss_rangeCalcDone)
+ {
+ TidRangeEval(node);
+ node->trss_rangeCalcDone = true;
+ }
Assert(ItemPointerIsValid(&slot->tts_tid));
/* Recheck the ctid is still within range */
- if (ItemPointerCompare(&slot->tts_tid, &node->trss_mintid) < 0 ||
+ if (node->trss_rangeIsEmpty ||
+ ItemPointerCompare(&slot->tts_tid, &node->trss_mintid) < 0 ||
ItemPointerCompare(&slot->tts_tid, &node->trss_maxtid) > 0)
return false;
@@ -322,6 +344,9 @@ ExecReScanTidRangeScan(TidRangeScanState *node)
/* mark scan as not in progress, and tid range list as not computed yet */
node->trss_inScan = false;
+ /* mark that the TID range needs to be recalculated */
+ node->trss_rangeCalcDone = false;
+
/*
* We must wait until TidRangeNext before calling table_rescan_tidrange.
*/
@@ -380,6 +405,10 @@ ExecInitTidRangeScan(TidRangeScan *node, EState *estate, int eflags)
* mark scan as not in progress, and TID range as not computed yet
*/
tidrangestate->trss_inScan = false;
+ tidrangestate->trss_rangeCalcDone = false;
+
+ /* This will be calculated correctly in TidRangeEval() */
+ tidrangestate->trss_rangeIsEmpty = true;
/*
* open the scan relation
@@ -415,3 +444,109 @@ ExecInitTidRangeScan(TidRangeScan *node, EState *estate, int eflags)
*/
return tidrangestate;
}
+/* ----------------------------------------------------------------
+ * Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanEstimate
+ *
+ * Compute the amount of space we'll need in the parallel
+ * query DSM, and inform pcxt->estimator about our needs.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+
+ node->trss_pscanlen =
+ table_parallelscan_estimate(node->ss.ss_currentRelation,
+ estate->es_snapshot);
+ shm_toc_estimate_chunk(&pcxt->estimator, node->trss_pscanlen);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeDSM
+ *
+ * Set up a parallel TID scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_allocate(pcxt->toc, node->trss_pscanlen);
+ table_parallelscan_initialize(node->ss.ss_currentRelation,
+ pscan,
+ estate->es_snapshot);
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
+
+ /* Determine the TID Range */
+ TidRangeEval(node);
+ node->trss_rangeCalcDone = true;
+
+ /* Setup the scandesc, unless TidRangeEval found the range is empty */
+ if (!node->trss_rangeIsEmpty)
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation,
+ pscan,
+ &node->trss_mintid,
+ &node->trss_maxtid);
+ else
+ node->ss.ss_currentScanDesc = NULL;
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanReInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = node->ss.ss_currentScanDesc->rs_parallel;
+ table_parallelscan_reinitialize(node->ss.ss_currentRelation, pscan);
+
+ /* Determine the TID Range */
+ TidRangeEval(node);
+ node->trss_rangeCalcDone = true;
+
+ if (!node->trss_rangeIsEmpty)
+ node->ss.ss_currentRelation->rd_tableam->scan_set_tidrange(node->ss.ss_currentScanDesc,
+ &node->trss_mintid,
+ &node->trss_maxtid);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+
+ /* Determine the TID Range */
+ TidRangeEval(node);
+ node->trss_rangeCalcDone = true;
+
+ if (!node->trss_rangeIsEmpty)
+ node->ss.ss_currentScanDesc = table_beginscan_parallel_tidrange(node->ss.ss_currentRelation,
+ pscan,
+ &node->trss_mintid,
+ &node->trss_maxtid);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 94077e6a006..03ac556c73d 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1366,8 +1366,9 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
{
Selectivity selectivity;
double pages;
- Cost startup_cost = 0;
- Cost run_cost = 0;
+ Cost startup_cost;
+ Cost cpu_run_cost;
+ Cost disk_run_cost;
QualCost qpqual_cost;
Cost cpu_per_tuple;
QualCost tid_qual_cost;
@@ -1396,11 +1397,7 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
/*
* The first page in a range requires a random seek, but each subsequent
- * page is just a normal sequential page read. NOTE: it's desirable for
- * TID Range Scans to cost more than the equivalent Sequential Scans,
- * because Seq Scans have some performance advantages such as scan
- * synchronization and parallelizability, and we'd prefer one of them to
- * be picked unless a TID Range Scan really is better.
+ * page is just a normal sequential page read.
*/
ntuples = selectivity * baserel->tuples;
nseqpages = pages - 1.0;
@@ -1417,7 +1414,7 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
&spc_seq_page_cost);
/* disk costs; 1 random page and the remainder as seq pages */
- run_cost += spc_random_page_cost + spc_seq_page_cost * nseqpages;
+ disk_run_cost = spc_random_page_cost + spc_seq_page_cost * nseqpages;
/* Add scanning CPU costs */
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
@@ -1429,20 +1426,35 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
* can't be removed, this is a mistake and we're going to underestimate
* the CPU cost a bit.)
*/
- startup_cost += qpqual_cost.startup + tid_qual_cost.per_tuple;
+ startup_cost = qpqual_cost.startup + tid_qual_cost.per_tuple;
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple -
tid_qual_cost.per_tuple;
- run_cost += cpu_per_tuple * ntuples;
+ cpu_run_cost = cpu_per_tuple * ntuples;
/* tlist eval costs are paid per output row, not per tuple scanned */
startup_cost += path->pathtarget->cost.startup;
- run_cost += path->pathtarget->cost.per_tuple * path->rows;
+ cpu_run_cost += path->pathtarget->cost.per_tuple * path->rows;
+
+ /* Adjust costing for parallelism, if used. */
+ if (path->parallel_workers > 0)
+ {
+ double parallel_divisor = get_parallel_divisor(path);
+
+ /* The CPU cost is divided among all the workers. */
+ cpu_run_cost /= parallel_divisor;
+
+ /*
+ * In the case of a parallel plan, the row count needs to represent
+ * the number of tuples processed per worker.
+ */
+ path->rows = clamp_row_est(path->rows / parallel_divisor);
+ }
/* we should not generate this path type when enable_tidscan=false */
Assert(enable_tidscan);
path->disabled_nodes = 0;
path->startup_cost = startup_cost;
- path->total_cost = startup_cost + run_cost;
+ path->total_cost = startup_cost + cpu_run_cost + disk_run_cost;
}
/*
diff --git a/src/backend/optimizer/path/tidpath.c b/src/backend/optimizer/path/tidpath.c
index 2bfb338b81c..e9e52e4abe7 100644
--- a/src/backend/optimizer/path/tidpath.c
+++ b/src/backend/optimizer/path/tidpath.c
@@ -492,7 +492,8 @@ ec_member_matches_ctid(PlannerInfo *root, RelOptInfo *rel,
* create_tidscan_paths
* Create paths corresponding to direct TID scans of the given rel.
*
- * Candidate paths are added to the rel's pathlist (using add_path).
+ * Candidate paths are added to the rel's pathlist (using add_path) and to
+ * partial_pathlist via add_partial_path, in the case of partial paths.
*/
bool
create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
@@ -553,7 +554,24 @@ create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
add_path(rel, (Path *) create_tidrangescan_path(root, rel,
tidrangequals,
- required_outer));
+ required_outer,
+ 0));
+
+ /* If appropriate, consider parallel tid range scan. */
+ if (rel->consider_parallel && required_outer == NULL)
+ {
+ int parallel_workers;
+
+ parallel_workers = compute_parallel_worker(rel, rel->pages, -1,
+ max_parallel_workers_per_gather);
+
+ if (parallel_workers > 0)
+ add_partial_path(rel, (Path *) create_tidrangescan_path(root,
+ rel,
+ tidrangequals,
+ required_outer,
+ parallel_workers));
+ }
}
/*
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e8ea78c0c97..4f2a7ddd23f 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1173,12 +1173,14 @@ subquery_planner(PlannerGlobal *glob, Query *parse, char *plan_name,
foreach(l, (List *) parse->havingQual)
{
Node *havingclause = (Node *) lfirst(l);
+ Relids having_relids;
if (contain_agg_clause(havingclause) ||
contain_volatile_functions(havingclause) ||
contain_subplans(havingclause) ||
(parse->groupClause && parse->groupingSets &&
- bms_is_member(root->group_rtindex, pull_varnos(root, havingclause))))
+ ((having_relids = pull_varnos(root, havingclause)) == NULL ||
+ bms_is_member(root->group_rtindex, having_relids))))
{
/* keep it in HAVING */
newHaving = lappend(newHaving, havingclause);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index bca51b4067b..1bb613d44b0 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1261,7 +1261,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
*/
TidRangePath *
create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
- List *tidrangequals, Relids required_outer)
+ List *tidrangequals, Relids required_outer,
+ int parallel_workers)
{
TidRangePath *pathnode = makeNode(TidRangePath);
@@ -1270,9 +1271,9 @@ create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->path.pathtarget = rel->reltarget;
pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
required_outer);
- pathnode->path.parallel_aware = false;
+ pathnode->path.parallel_aware = (parallel_workers > 0);
pathnode->path.parallel_safe = rel->consider_parallel;
- pathnode->path.parallel_workers = 0;
+ pathnode->path.parallel_workers = parallel_workers;
pathnode->path.pathkeys = NIL; /* always unordered */
pathnode->tidrangequals = tidrangequals;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..3da43557a13 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -96,6 +96,8 @@ typedef struct ParallelBlockTableScanDescData
BlockNumber phs_nblocks; /* # blocks in relation at start of scan */
slock_t phs_mutex; /* mutual exclusion for setting startblock */
BlockNumber phs_startblock; /* starting block number */
+ BlockNumber phs_numblock; /* # blocks to scan, or InvalidBlockNumber if
+ * no limit */
pg_atomic_uint64 phs_nallocated; /* number of blocks allocated to
* workers so far. */
} ParallelBlockTableScanDescData;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e16bf025692..3c241d10a52 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1130,6 +1130,18 @@ extern void table_parallelscan_initialize(Relation rel,
extern TableScanDesc table_beginscan_parallel(Relation relation,
ParallelTableScanDesc pscan);
+/*
+ * Begin a parallel tidrange scan. `pscan` needs to have been initialized with
+ * table_parallelscan_initialize(), for the same relation. The initialization
+ * does not need to have happened in this backend.
+ *
+ * Caller must hold a suitable lock on the relation.
+ */
+extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation,
+ ParallelTableScanDesc pscan,
+ ItemPointer mintid,
+ ItemPointer maxtid);
+
/*
* Restart a parallel scan. Call this in the leader process. Caller is
* responsible for making sure that all workers have finished the scan
diff --git a/src/include/executor/nodeTidrangescan.h b/src/include/executor/nodeTidrangescan.h
index a831f1202ca..2b5465b3ce4 100644
--- a/src/include/executor/nodeTidrangescan.h
+++ b/src/include/executor/nodeTidrangescan.h
@@ -14,6 +14,7 @@
#ifndef NODETIDRANGESCAN_H
#define NODETIDRANGESCAN_H
+#include "access/parallel.h"
#include "nodes/execnodes.h"
extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
@@ -21,4 +22,10 @@ extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
extern void ExecEndTidRangeScan(TidRangeScanState *node);
extern void ExecReScanTidRangeScan(TidRangeScanState *node);
+/* parallel scan support */
+extern void ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanReInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeWorker(TidRangeScanState *node, ParallelWorkerContext *pwcxt);
+
#endif /* NODETIDRANGESCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a36653c37f9..c6e3ccff046 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1924,6 +1924,9 @@ typedef struct TidScanState
* trss_mintid the lowest TID in the scan range
* trss_maxtid the highest TID in the scan range
* trss_inScan is a scan currently in progress?
+ * trss_rangeCalcDone has the TID range been calculated yet?
+ * trss_rangeIsEmpty true if the TID range is certainly empty
+ * trss_pscanlen size of parallel heap scan descriptor
* ----------------
*/
typedef struct TidRangeScanState
@@ -1933,6 +1936,9 @@ typedef struct TidRangeScanState
ItemPointerData trss_mintid;
ItemPointerData trss_maxtid;
bool trss_inScan;
+ bool trss_rangeCalcDone;
+ bool trss_rangeIsEmpty;
+ Size trss_pscanlen;
} TidRangeScanState;
/* ----------------
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index da60383c2aa..090e836cfc9 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -67,7 +67,8 @@ extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
extern TidRangePath *create_tidrangescan_path(PlannerInfo *root,
RelOptInfo *rel,
List *tidrangequals,
- Relids required_outer);
+ Relids required_outer,
+ int parallel_workers);
extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
List *subpaths, List *partial_subpaths,
List *pathkeys, Relids required_outer,
diff --git a/src/test/regress/expected/tidrangescan.out b/src/test/regress/expected/tidrangescan.out
index 721f3b94e04..bbb1b87fa89 100644
--- a/src/test/regress/expected/tidrangescan.out
+++ b/src/test/regress/expected/tidrangescan.out
@@ -297,4 +297,110 @@ FETCH LAST c;
COMMIT;
DROP TABLE tidrangescan;
+-- Tests for parallel tidrangescans
+SET parallel_setup_cost TO 0;
+SET parallel_tuple_cost TO 0;
+SET min_parallel_table_scan_size TO 0;
+SET max_parallel_workers_per_gather TO 4;
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor = 10);
+-- Insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i, repeat('x', 100) FROM generate_series(1,200) AS s(i);
+-- Ensure there are 40 pages for parallel test
+SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
+ min | max
+-------+--------
+ (0,1) | (39,5)
+(1 row)
+
+-- Parallel range scans with upper bound
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid < '(30,1)'::tid)
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
+ count
+-------
+ 150
+(1 row)
+
+-- Parallel range scans with lower bound
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid > '(10,0)'::tid)
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
+ count
+-------
+ 150
+(1 row)
+
+-- Parallel range scans with both bounds
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
+ QUERY PLAN
+-----------------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: ((ctid > '(10,0)'::tid) AND (ctid < '(30,1)'::tid))
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
+ count
+-------
+ 100
+(1 row)
+
+-- Parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ QUERY PLAN
+----------------------------------------------------------------
+ Nested Loop
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on parallel_tidrangescan t
+ TID Cond: (ctid < '(1,0)'::tid)
+ -> Aggregate
+ -> Tid Range Scan on parallel_tidrangescan t2
+ TID Cond: (ctid <= t.ctid)
+(8 rows)
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ ctid | c
+-------+---
+ (0,1) | 1
+ (0,2) | 2
+ (0,3) | 3
+ (0,4) | 4
+ (0,5) | 5
+(5 rows)
+
+DROP TABLE parallel_tidrangescan;
+RESET max_parallel_workers_per_gather;
+RESET min_parallel_table_scan_size;
+RESET parallel_tuple_cost;
+RESET parallel_setup_cost;
RESET enable_seqscan;
diff --git a/src/test/regress/sql/tidrangescan.sql b/src/test/regress/sql/tidrangescan.sql
index ac09ebb6262..f86493530e2 100644
--- a/src/test/regress/sql/tidrangescan.sql
+++ b/src/test/regress/sql/tidrangescan.sql
@@ -98,4 +98,49 @@ COMMIT;
DROP TABLE tidrangescan;
+-- Tests for parallel tidrangescans
+SET parallel_setup_cost TO 0;
+SET parallel_tuple_cost TO 0;
+SET min_parallel_table_scan_size TO 0;
+SET max_parallel_workers_per_gather TO 4;
+
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor = 10);
+
+-- Insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i, repeat('x', 100) FROM generate_series(1,200) AS s(i);
+
+-- Ensure there are 40 pages for parallel test
+SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
+
+-- Parallel range scans with upper bound
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
+
+-- Parallel range scans with lower bound
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
+
+-- Parallel range scans with both bounds
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
+
+-- Parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+DROP TABLE parallel_tidrangescan;
+
+RESET max_parallel_workers_per_gather;
+RESET min_parallel_table_scan_size;
+RESET parallel_tuple_cost;
+RESET parallel_setup_cost;
RESET enable_seqscan;
On Oct 13, 2025, at 11:48, David Rowley <dgrowleyml@gmail.com> wrote:
Updated patch attached.
David
<doc_n_distinct_inherited_v4.patch>
The doc change of v4 looks good to me now.
But v4 contains a lot of code changes, are they incidentally included?
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Sunday, October 12, 2025, David Rowley <dgrowleyml@gmail.com> wrote:
Updated patch attached.
The patch picked up a whole other set of changes as well.
How about:
n_distinct_inherited exists to allow the distinct value estimate to be
overwritten for the children-included statistics gathered for inheritance
parent tables, and for partitioned tables.
I can live with “ordinarily” with that wording, it is just the negative of
“children-included”.
David J.
On Mon, 13 Oct 2025 at 17:09, David G. Johnston
<david.g.johnston@gmail.com> wrote:
How about:
n_distinct_inherited exists to allow the distinct value estimate to be overwritten for the children-included statistics gathered for inheritance parent tables, and for partitioned tables.
I can live with “ordinarily” with that wording, it is just the negative of “children-included”.
I've ended up reverting the text back to become more similar to how it
was but explicitly adding ", and for the statistics gathered for
partitioned tables.", since that's what the original bug report was
confused about.
I've kept the other document changes as they were in the previous patch.
David
Attachments:
doc_n_distinct_inherited_v5.patchapplication/octet-stream; name=doc_n_distinct_inherited_v5.patchDownload
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 8867da6c693..bea9f90138b 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -376,24 +376,22 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
<literal>n_distinct_inherited</literal>, which override the
number-of-distinct-values estimates made by subsequent
<link linkend="sql-analyze"><command>ANALYZE</command></link>
- operations. <literal>n_distinct</literal> affects the statistics for the table
- itself, while <literal>n_distinct_inherited</literal> affects the statistics
- gathered for the table plus its inheritance children. When set to a
- positive value, <command>ANALYZE</command> will assume that the column contains
- exactly the specified number of distinct nonnull values. When set to a
- negative value, which must be greater
- than or equal to -1, <command>ANALYZE</command> will assume that the number of
- distinct nonnull values in the column is linear in the size of the
- table; the exact count is to be computed by multiplying the estimated
- table size by the absolute value of the given number. For example,
- a value of -1 implies that all values in the column are distinct, while
- a value of -0.5 implies that each value appears twice on the average.
- This can be useful when the size of the table changes over time, since
- the multiplication by the number of rows in the table is not performed
- until query planning time. Specify a value of 0 to revert to estimating
- the number of distinct values normally. For more information on the use
- of statistics by the <productname>PostgreSQL</productname> query
- planner, refer to <xref linkend="planner-stats"/>.
+ operations. <literal>n_distinct</literal> affects the statistics for the
+ table itself, while <literal>n_distinct_inherited</literal> affects the
+ statistics gathered for the table plus its inheritance children, and for
+ the statistics gathered for partitioned tables. When the value
+ specified is a positive value, the query planner will assume that the
+ column contains exactly the specified number of distinct nonnull values.
+ Fractional values may also be specified by using values below 0 and
+ above or equal to -1. This instructs the query planner to estimate the
+ number of distinct values by multiplying the absolute value of the
+ specified number by the estimated number of rows in the table. For
+ example, a value of -1 implies that all values in the column are
+ distinct, while a value of -0.5 implies that each value appears twice on
+ average. This can be useful when the size of the table changes over
+ time. For more information on the use of statistics by the
+ <productname>PostgreSQL</productname> query planner, refer to
+ <xref linkend="planner-stats"/>.
</para>
<para>
Changing per-attribute options acquires a
On Monday, October 13, 2025, David Rowley <dgrowleyml@gmail.com> wrote:
On Mon, 13 Oct 2025 at 17:09, David G. Johnston
<david.g.johnston@gmail.com> wrote:How about:
n_distinct_inherited exists to allow the distinct value estimate to be
overwritten for the children-included statistics gathered for inheritance
parent tables, and for partitioned tables.I can live with “ordinarily” with that wording, it is just the negative
of “children-included”.
I've ended up reverting the text back to become more similar to how it
was but explicitly adding ", and for the statistics gathered for
partitioned tables.", since that's what the original bug report was
confused about.I've kept the other document changes as they were in the previous patch.
WFM. Since we allow 0 as an input I’m not convinced we should remove that
sentence, though I concur that using reset is preferred.
David J.
On Tue, 14 Oct 2025 at 00:55, David G. Johnston
<david.g.johnston@gmail.com> wrote:
WFM. Since we allow 0 as an input I’m not convinced we should remove that sentence, though I concur that using reset is preferred.
I've pushed the v5 version. Thanks to both of you for looking.
David