incorrect information in documentation

Started by PG Bug reporting formover 4 years ago6 messagesdocs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/13/row-estimation-examples.html
Description:

Hello, on page
https://www.postgresql.org/docs/current/row-estimation-examples.html - there
is a example:
selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1,
1/num_distinct2)
= (1 - 0) * (1 - 0) / max(10000, 10000)
= 0.0001
in the first string " * min" and in the second " / max"
as I understand it isn't correct.

#2David G. Johnston
david.g.johnston@gmail.com
In reply to: PG Bug reporting form (#1)
Re: incorrect information in documentation

On Mon, Aug 9, 2021 at 8:02 AM PG Doc comments form <noreply@postgresql.org>
wrote:

Hello, on page
https://www.postgresql.org/docs/current/row-estimation-examples.html -
there
is a example:
selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1,
1/num_distinct2)
= (1 - 0) * (1 - 0) / max(10000, 10000)
= 0.0001
in the first string " * min" and in the second " / max"
as I understand it isn't correct.

Division is just multiplication by the reciprocal so while the presentation
here is inconsistent it is correct. Likewise, the larger a number the
smaller its reciprocal, so the change from min to max also works.

David J.

#3David G. Johnston
david.g.johnston@gmail.com
In reply to: David G. Johnston (#2)
Re: incorrect information in documentation

On Mon, Aug 9, 2021 at 9:06 AM David G. Johnston <david.g.johnston@gmail.com>
wrote:

On Mon, Aug 9, 2021 at 8:02 AM PG Doc comments form <
noreply@postgresql.org> wrote:

Hello, on page
https://www.postgresql.org/docs/current/row-estimation-examples.html -
there
is a example:
selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1,
1/num_distinct2)
= (1 - 0) * (1 - 0) / max(10000, 10000)
= 0.0001
in the first string " * min" and in the second " / max"
as I understand it isn't correct.

Division is just multiplication by the reciprocal so while the
presentation here is inconsistent it is correct. Likewise, the larger a
number the smaller its reciprocal, so the change from min to max also
works.

FWIW this used to be presented with the calculation and formula in sync,
but the original had a simple typo in the calculation. When the typo got
fixed back in December of 2007 [1] the author of the patch simplified the
calculation at the same time. I suggest we update the formula line to
match the calculation presented.

David J.

1.
https://github.com/postgres/postgres/commit/f5678e8e07563e34ae4dc832546977d13edcd665

#4Bruce Momjian
bruce@momjian.us
In reply to: David G. Johnston (#3)
Re: incorrect information in documentation

On Mon, Aug 9, 2021 at 09:20:53AM -0700, David G. Johnston wrote:

On Mon, Aug 9, 2021 at 9:06 AM David G. Johnston <david.g.johnston@gmail.com>
wrote:

On Mon, Aug 9, 2021 at 8:02 AM PG Doc comments form <noreply@postgresql.org

wrote:

Hello, on page
https://www.postgresql.org/docs/current/row-estimation-examples.html -
there
is a example:
selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/
num_distinct1,
1/num_distinct2)
            = (1 - 0) * (1 - 0) / max(10000, 10000)
            = 0.0001
in the first string  " * min" and in the second " / max"
as I understand it isn't correct.

Division is just multiplication by the reciprocal so while the presentation
here is inconsistent it is correct.  Likewise, the larger a number the
smaller its reciprocal, so the change from min to max also works.  

FWIW this used to be presented with the calculation and formula in sync, but
the original had a simple typo in the calculation.  When the typo got fixed
back in December of 2007 [1] the author of the patch simplified the calculation
at the same time.  I suggest we update the formula line to match the
calculation presented.

Nice, can you provide a patch please?

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#5David G. Johnston
david.g.johnston@gmail.com
In reply to: Bruce Momjian (#4)
Re: incorrect information in documentation

On Mon, Aug 9, 2021 at 11:05 AM Bruce Momjian <bruce@momjian.us> wrote:

selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/
num_distinct1,
1/num_distinct2)
= (1 - 0) * (1 - 0) / max(10000, 10000)
= 0.0001

Nice, can you provide a patch please?

Change the line:

selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1,
1/num_distinct2)

to be:

selectivity = (1 - null_frac1) * (1 - null_frac2) / max(num_distinct1,
num_distinct2)

The wording already talks about "divide by max".

Though:

"so we use an algorithm that relies only on the number of distinct values
for both relations together with their null fractions:"

maybe adds a parenthetical note:

"so we use an algorithm that relies only on the number of distinct values
(the row count estimate for the whole table, not the -1 in the column
statistics) for both relations together with their null fractions:"

Just note I haven't tried to absorb that whole page, let alone the
implementation, and am not all that familiar with this part of PostgreSQL.
Its seems right, though, in isolation.

David J.

#6David G. Johnston
david.g.johnston@gmail.com
In reply to: David G. Johnston (#5)
Re: incorrect information in documentation

On Mon, Aug 9, 2021 at 8:40 PM David G. Johnston <david.g.johnston@gmail.com>
wrote:

On Mon, Aug 9, 2021 at 11:05 AM Bruce Momjian <bruce@momjian.us> wrote:

selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/
num_distinct1,
1/num_distinct2)
= (1 - 0) * (1 - 0) / max(10000, 10000)
= 0.0001

Nice, can you provide a patch please?

Change the line:

Concretely, as attached and inline.

David J.

commit 73fa486a855d75d74a1a695bb350bfbfe27c7751
Author: David G. Johnston <david.g.johnston@gmail.com>
Date: Tue Apr 12 21:23:53 2022 +0000

doc: make unique non-null join selectivity example match the prose

The description of the computation for the unique, non-null,
join selectivity describes a division by the maximum of two values,
while the example shows a multiplication by their reciprocal. While
equivalent the max phrasing is easier to understand; which seems
more important here than precisely adhering to the formula use
in the code (for which either variant is still an approximation).

While both num_distinct and num_rows are equal for a unique column
both the concept and formula use row count (10,000) and the
field num_distinct has already been set to mean the specific value
present in the pg_stats table (i.e, -1), so use num_rows here.

diff --git a/doc/src/sgml/planstats.sgml b/doc/src/sgml/planstats.sgml
index 78053d7c49..f72bc4b274 100644
--- a/doc/src/sgml/planstats.sgml
+++ b/doc/src/sgml/planstats.sgml
@@ -391,18 +391,20 @@ tablename  | null_frac | n_distinct | most_common_vals
 </programlisting>
    In this case there is no <acronym>MCV</acronym> information for
-   <structfield>unique2</structfield> because all the values appear to be
-   unique, so we use an algorithm that relies only on the number of
-   distinct values for both relations together with their null fractions:
+   <structname>unique2</structname> and all the values appear to be
+   unique (n_distinct = -1), so we use an algorithm that relies on the row
+   count estimates for both relations (num_rows, not shown, but "tenk")
+   together with the column null fractions (zero for both):
 <programlisting>
-selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1,
1/num_distinct2)
+selectivity = (1 - null_frac1) * (1 - null_frac2) / max(num_rows1,
num_rows2)
             = (1 - 0) * (1 - 0) / max(10000, 10000)
             = 0.0001
 </programlisting>
    This is, subtract the null fraction from one for each of the relations,
-   and divide by the maximum of the numbers of distinct values.
+   and divide by the row count of the larger relation (this value does get
+   scaled in the non-unique case).
    The number of rows
    that the join is likely to emit is calculated as the cardinality of the
    Cartesian product of the two inputs, multiplied by the

Attachments:

v0001-doc-make-row-estimation-example-match-prose.patchapplication/octet-stream; name=v0001-doc-make-row-estimation-example-match-prose.patchDownload+7-5