doc: Make selectivity example match wording
Hi,
Reposting this to its own thread.
/messages/by-id/CAKFQuwby1aMsJDMeibaBaohgoaZhivAo4WcqHC1=9-GDZ3TSng@mail.gmail.com
doc: make unique non-null join selectivity example match the prose
The description of the computation for the unique, non-null,
join selectivity describes a division by the maximum of two values,
while the example shows a multiplication by their reciprocal. While
equivalent the max phrasing is easier to understand; which seems
more important here than precisely adhering to the formula used
in the code (for which either variant is still an approximation).
While both num_distinct and num_rows are equal for a unique column
both the concept and formula use row count (10,000) and the
field num_distinct has already been set to mean the specific value
present in the pg_stats table (i.e, -1), so use num_rows here.
David J.
Attachments:
0001-doc-Make-selectivity-example-match-text.patchapplication/octet-stream; name=0001-doc-Make-selectivity-example-match-text.patchDownload+7-6
On Thu Jun 9, 2022 at 11:57 AM EDT, David G. Johnston wrote:
Reposting this to its own thread.
/messages/by-id/CAKFQuwby1aMsJDMeibaBaohgoaZhivAo4WcqHC1=9-GDZ3TSng@mail.gmail.com
doc: make unique non-null join selectivity example match the prose
The description of the computation for the unique, non-null,
join selectivity describes a division by the maximum of two values,
while the example shows a multiplication by their reciprocal. While
equivalent the max phrasing is easier to understand; which seems
more important here than precisely adhering to the formula used
in the code (for which either variant is still an approximation).While both num_distinct and num_rows are equal for a unique column
both the concept and formula use row count (10,000) and the
field num_distinct has already been set to mean the specific value
present in the pg_stats table (i.e, -1), so use num_rows here.
Pointing out that n_distinct = -1 is helpful but changing "because" to
"and" suggests that the missing MCV info is coincidental or a side
effect. Is there any case in which the stronger "because" wouldn't be
appropriate?
The second parenthetical (num_rows, not shown, but "tenk") took me a
minute to get since the row counts are only apparent on looking somewhat
closely at the other examples in the chapter. num_rows also isn't a
column in pg_stats which the "not shown" could be taken to imply; it's
sourced from somewhere else and only given as num_rows in this example.
How's '(as num_rowsN, 10,000 for both "tenk" example tables)'?
By "this value does get scaled in the non-unique case" do you mean it
relies on n_distinct as in the uncorrected algorithm listing? If so I
think it'd help to specify that.
You didn't take this line on but "This is, subtract the null
fraction..." omits the step of multiplying the complements of the null
fractions together before dividing.
Should n_distinct and num_rows be <structname>d in the text?
On Sat, Jul 2, 2022 at 12:42 PM Dian M Fay <dian.m.fay@gmail.com> wrote:
On Thu Jun 9, 2022 at 11:57 AM EDT, David G. Johnston wrote:
Reposting this to its own thread.
/messages/by-id/CAKFQuwby1aMsJDMeibaBaohgoaZhivAo4WcqHC1=9-GDZ3TSng@mail.gmail.com
doc: make unique non-null join selectivity example match the prose
The description of the computation for the unique, non-null,
join selectivity describes a division by the maximum of two values,
while the example shows a multiplication by their reciprocal. While
equivalent the max phrasing is easier to understand; which seems
more important here than precisely adhering to the formula used
in the code (for which either variant is still an approximation).Should n_distinct and num_rows be <structname>d in the text?
Thanks for the review. I generally like everything you said but it made me
realize that I still didn't really understand the intent behind the
formula. I spent way too much time working that out for myself, then
turned what I found useful into this v2 patch.
It may need some semantic markup still but figured I'd see if the idea
makes sense.
I basically rewrote, in a bit different style, the same material into the
code comments, then proceeded to rework the proof that was already present
there.
I did do this in somewhat of a vacuum. I'm not inclined to learn this all
start-to-end though. If the abrupt style change is unwanted so be it. I'm
not really sure how much benefit the proof really provides. The comments
in the docs are probably sufficient for the code as well - just define why
the three pieces of the formula exist and are packaged into a single
multiplier called selectivity as an API choice. I suspect once someone
gets to that comment it is fair to assume some prior knowledge.
Admittedly, I didn't really come into this that way...
David J.
Attachments:
0002-doc-Make-selectivity-example-match-text-rewrite-proo.patchapplication/octet-stream; name=0002-doc-Make-selectivity-example-match-text-rewrite-proo.patchDownload+7-6
On Sat Jul 16, 2022 at 11:23 PM EDT, David G. Johnston wrote:
Thanks for the review. I generally like everything you said but it made me
realize that I still didn't really understand the intent behind the
formula. I spent way too much time working that out for myself, then
turned what I found useful into this v2 patch.It may need some semantic markup still but figured I'd see if the idea
makes sense.I basically rewrote, in a bit different style, the same material into the
code comments, then proceeded to rework the proof that was already present
there.I did do this in somewhat of a vacuum. I'm not inclined to learn this all
start-to-end though. If the abrupt style change is unwanted so be it. I'm
not really sure how much benefit the proof really provides. The comments
in the docs are probably sufficient for the code as well - just define why
the three pieces of the formula exist and are packaged into a single
multiplier called selectivity as an API choice. I suspect once someone
gets to that comment it is fair to assume some prior knowledge.
Admittedly, I didn't really come into this that way...
Fair enough, I only know what I can glean from the comments in
eqjoinsel_inner and friends myself. I do think even this smaller change
is valuable because the current example talks about using an algorithm
based on the number of distinct values immediately after showing
n_distinct == -1, so making it clear that this case uses num_rows
instead is helpful.
"This value does get scaled in the non-unique case" again could be more
specific ("since here all values are unique, otherwise the calculation
uses num_distinct" perhaps?). But past that quibble I'm good.
On Sun, Jul 17, 2022 at 07:07:15PM -0400, Dian M Fay wrote:
On Sat Jul 16, 2022 at 11:23 PM EDT, David G. Johnston wrote:
Thanks for the review. I generally like everything you said but it made me
realize that I still didn't really understand the intent behind the
formula. I spent way too much time working that out for myself, then
turned what I found useful into this v2 patch.It may need some semantic markup still but figured I'd see if the idea
makes sense.I basically rewrote, in a bit different style, the same material into the
code comments, then proceeded to rework the proof that was already present
there.I did do this in somewhat of a vacuum. I'm not inclined to learn this all
start-to-end though. If the abrupt style change is unwanted so be it. I'm
not really sure how much benefit the proof really provides. The comments
in the docs are probably sufficient for the code as well - just define why
the three pieces of the formula exist and are packaged into a single
multiplier called selectivity as an API choice. I suspect once someone
gets to that comment it is fair to assume some prior knowledge.
Admittedly, I didn't really come into this that way...Fair enough, I only know what I can glean from the comments in
eqjoinsel_inner and friends myself. I do think even this smaller change
is valuable because the current example talks about using an algorithm
based on the number of distinct values immediately after showing
n_distinct == -1, so making it clear that this case uses num_rows
instead is helpful."This value does get scaled in the non-unique case" again could be more
specific ("since here all values are unique, otherwise the calculation
uses num_distinct" perhaps?). But past that quibble I'm good.
Patch applied to master.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Only you can decide what is important to you.