FETCH FIRST clause WITH TIES option
hello ,
The WITH TIES keyword is sql standard that specifies any peers of retained
rows
to retained in the result set too .which means according to ordering key
the result set can includes additional rows which have ties on the last
position, if there are any and It work with ORDER BY query
The attach patch is a finished implementation of it except it did not have
a regression test because am not sure of changing the test data set for it
and I can’t find fetch first clause regression test too.
Regards
Surafel
Attachments:
fetch_first_with_ties_v1.patchtext/x-patch; charset=US-ASCII; name=fetch_first_with_ties_v1.patchDownload+213-22
Hello Surafel,
On 10/26/2018 12:28 PM, Surafel Temesgen wrote:
hello ,
The WITH TIES keyword is sql standard that specifies any peers of
retained rows to retained in the result set too .which means
according to ordering key the result set can includes additional rows
which have ties on the last position, if there are any and It work
with ORDER BY query.
Thanks for the patch. I've looked at it today, and it seems mostly OK,
with a couple of minor issues. Most of it is code formatting and comment
wording issues, so I'm not going to go through them here - see the
attached 0002 patch (0001 is your patch, rebased to current master).
There's one place that I think is incorrect, and that's this bit from
ExecLimit:
}else
/*
* Get next tuple from subplan, if any.
*/
slot = ExecProcNode(outerPlan);
if (TupIsNull(slot))
{
node->lstate = LIMIT_SUBPLANEOF;
return NULL;
}
if (node->limitOption == WITH_TIES)
{
ExecCopySlot(node->last_slot, slot);
}
node->subSlot = slot;
node->position++;
Note that the "else" guards only the very first line, not the whole
block. This seems wrong, i.e. there should be {} around the whole block.
I'm also suggesting to slightly change the create_limit_plan(), to keep
a single make_limit call. IMHO that makes it easier to understand,
although I admit it's fairly subjective.
One question is whether to make this work for LIMIT too, not just for
FETCH FIRST. I'm not sure how difficult would that be (it should be a
grammar-only tweak I guess), or if it's a good idea in general. But I
find it quite confusing that various comments say things like this:
LimitOption limitOption; /* LIMIT with ties or exact number */
while in fact it does not work with LIMIT.
The attach patch is a finished implementation of it except it did not
have a regression test because am not sure of changing the test data set
for it and I can’t find fetch first clause regression test too.
Well, that's not a very good reason not to have tests for this new
improvement. FWIW there are a couple of places in regression tests where
FETCH FIRST is used, see this:
$ grep -i 'fetch first' src/test/regress/sql/*
src/test/regress/sql/hs_standby_allowed.sql:fetch first from hsc;
src/test/regress/sql/tablesample.sql:FETCH FIRST FROM tablesample_cur;
src/test/regress/sql/tablesample.sql:FETCH FIRST FROM tablesample_cur;
src/test/regress/sql/tidscan.sql:FETCH FIRST FROM c;
But those places don't seem like very good match for testing the new
stuff, because those are primarily testing something else. I suggest we
add the new tests into limit.sql.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
"Tomas" == Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
On 10/26/2018 12:28 PM, Surafel Temesgen wrote:
hello ,
The WITH TIES keyword is sql standard that specifies any peers of
retained rows to retained in the result set too .which means
according to ordering key the result set can includes additional rows
which have ties on the last position, if there are any and It work
with ORDER BY query.
Tomas> Thanks for the patch. I've looked at it today, and it seems
Tomas> mostly OK, with a couple of minor issues. Most of it is code
Tomas> formatting and comment wording issues, so I'm not going to go
Tomas> through them here - see the attached 0002 patch (0001 is your
Tomas> patch, rebased to current master).
I still think that this is the wrong approach. Implementing WITH TIES
and PERCENT together using an implicit window function call kills two
birds with one very small stone (the only executor change needed would
be teaching LIMIT to be able to stop on a boolean condition), with
maximum reuse of existing facilities.
--
Andrew (irc:RhodiumToad)
On 10/29/2018 04:17 PM, Andrew Gierth wrote:
"Tomas" == Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
On 10/26/2018 12:28 PM, Surafel Temesgen wrote:
hello ,
The WITH TIES keyword is sql standard that specifies any peers of
retained rows to retained in the result set too .which means
according to ordering key the result set can includes additional rows
which have ties on the last position, if there are any and It work
with ORDER BY query.Tomas> Thanks for the patch. I've looked at it today, and it seems
Tomas> mostly OK, with a couple of minor issues. Most of it is code
Tomas> formatting and comment wording issues, so I'm not going to go
Tomas> through them here - see the attached 0002 patch (0001 is your
Tomas> patch, rebased to current master).I still think that this is the wrong approach. Implementing WITH TIES
and PERCENT together using an implicit window function call kills two
birds with one very small stone (the only executor change needed would
be teaching LIMIT to be able to stop on a boolean condition), with
maximum reuse of existing facilities.
Hmmm, maybe. How would that work, exactly? Wouldn't that mean extra
overhead (the window functions are hardly free) and limitations? Perhaps
that was discussed in some other thread in the past?
FWIW, I doubt the patch can be much smaller/simpler - a significant part
of the new stuff is in gram.y and node read/out infrastructure, the
changes to LIMIT node are fairly minimal.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
"Tomas" == Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
I still think that this is the wrong approach. Implementing WITH
TIES and PERCENT together using an implicit window function call
kills two birds with one very small stone (the only executor change
needed would be teaching LIMIT to be able to stop on a boolean
condition), with maximum reuse of existing facilities.
Tomas> Hmmm, maybe. How would that work, exactly? Wouldn't that mean
Tomas> extra overhead (the window functions are hardly free) and
Tomas> limitations?
They're not free, but then this case probably shouldn't be treated as a
particularly hot code path.
The basic idea is this: extend the Limit node to be able to stop on a
boolean expression, such that the first false value ends the result.
Then FETCH FIRST N WITH TIES becomes "stop when the expression
rank() over (order by ...) <= N
is no longer true" (where the ... is the existing top level order by)
and FETCH FIRST N PERCENT is the same but with percent_rank (or
cume_dist, I'd have to check the exact semantics)
rank() doesn't need to read ahead in the input. percent_rank of course
does, but it's clearly impossible to implement PERCENT without doing
that.
Also, one could consider extending LIMIT to support arbitrary
expressions, with some syntax like LIMIT WHEN (condition) which would be
the general form.
Tomas> FWIW, I doubt the patch can be much smaller/simpler - a
Tomas> significant part of the new stuff is in gram.y and node read/out
Tomas> infrastructure, the changes to LIMIT node are fairly minimal.
It's not a matter of patch size as such, but reuse of mechanisms rather
than adding new ones. Also doing WITH TIES as a special case in the
Limit node doesn't make adding PERCENT any easier at all, whereas the
above gets it basically for free.
--
Andrew (irc:RhodiumToad)
On 10/29/2018 05:47 PM, Andrew Gierth wrote:
"Tomas" == Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
I still think that this is the wrong approach. Implementing WITH
TIES and PERCENT together using an implicit window function call
kills two birds with one very small stone (the only executor change
needed would be teaching LIMIT to be able to stop on a boolean
condition), with maximum reuse of existing facilities.Tomas> Hmmm, maybe. How would that work, exactly? Wouldn't that mean
Tomas> extra overhead (the window functions are hardly free) and
Tomas> limitations?They're not free, but then this case probably shouldn't be treated as a
particularly hot code path.The basic idea is this: extend the Limit node to be able to stop on a
boolean expression, such that the first false value ends the result.Then FETCH FIRST N WITH TIES becomes "stop when the expression
rank() over (order by ...) <= N
is no longer true" (where the ... is the existing top level order by)and FETCH FIRST N PERCENT is the same but with percent_rank (or
cume_dist, I'd have to check the exact semantics)rank() doesn't need to read ahead in the input. percent_rank of course
does, but it's clearly impossible to implement PERCENT without doing
that.
OK. I was under the impression the window function would need to see all
the input rows, but that seems not to be the case in general.
Also, one could consider extending LIMIT to support arbitrary
expressions, with some syntax like LIMIT WHEN (condition) which would be
the general form.
Hmmm. I can't really recall needing such thing, but of course - that
does not prove it'd not be a good thing.
Tomas> FWIW, I doubt the patch can be much smaller/simpler - a
Tomas> significant part of the new stuff is in gram.y and node read/out
Tomas> infrastructure, the changes to LIMIT node are fairly minimal.It's not a matter of patch size as such, but reuse of mechanisms rather
than adding new ones. Also doing WITH TIES as a special case in the
Limit node doesn't make adding PERCENT any easier at all, whereas the
above gets it basically for free.
Makes sense, I guess. At first I was thinking that this certainly does
not add more new mechanisms than allowing LIMIT to terminate on boolean
expression. But you're right about the WITH PERCENT part - using window
functions would make adding this much simpler.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Oct 29, 2018 at 12:48 PM Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
Then FETCH FIRST N WITH TIES becomes "stop when the expression
rank() over (order by ...) <= N
is no longer true" (where the ... is the existing top level order by)
Wouldn't that be wicked expensive compared to something hard-coded
that does the same thing?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 10/31/2018 06:17 PM, Robert Haas wrote:
On Mon, Oct 29, 2018 at 12:48 PM Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:Then FETCH FIRST N WITH TIES becomes "stop when the expression
rank() over (order by ...) <= N
is no longer true" (where the ... is the existing top level order by)Wouldn't that be wicked expensive compared to something hard-coded
that does the same thing?
Not sure, but that was one of my concerns too, particularly for the
simple FETCH FIRST N WITH TIES case. But I think Andrew has a point it
would make it much easier to implement the PERCENT case.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
hi,
The attached patch include all the comment given by Tomas and i check sql
standard about LIMIT and this feature
it did not say anything about it but I think its good idea to include it to
LIMIT too and I will add it if we have consensus on it.
regards
surafel
Attachments:
fetch_first_with_ties_v2.patchtext/x-patch; charset=US-ASCII; name=fetch_first_with_ties_v2.patchDownload+276-30
Attach is rebased patch against the current master
regards
Surafel
On Thu, Nov 1, 2018 at 2:28 PM Surafel Temesgen <surafel3000@gmail.com>
wrote:
Show quoted text
hi,
The attached patch include all the comment given by Tomas and i check sql
standard about LIMIT and this featureit did not say anything about it but I think its good idea to include it
to LIMIT too and I will add it if we have consensus on it.regards
surafel
Attachments:
fetch_first_with_ties_v3.patchtext/x-patch; charset=US-ASCII; name=fetch_first_with_ties_v3.patchDownload+279-31
Hi,
On 11/24/18 10:28 AM, Surafel Temesgen wrote:
Attach is rebased patch against the current master
regards
SurafelOn Thu, Nov 1, 2018 at 2:28 PM Surafel Temesgen <surafel3000@gmail.com
<mailto:surafel3000@gmail.com>> wrote:hi,
The attached patch include all the comment given by Tomas and i
check sql standard about LIMIT and this feature
Unfortunately, it seems the "limit" regression tests fail for some
reason - the output mismatches the expected results for some reason. It
seems as if the WITH TIES code affects ordering of the results within
the group. See the attached file.
it did not say anything about it but I think its good idea to
include it to LIMIT too and I will add it if we have consensus on it.
Hmm, I'm not sure that's needed. I don't see an urgent need to do that
in v1 of the patch.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
regression.diffstext/plain; charset=UTF-8; name=regression.diffsDownload+7-7
On Tue, Jan 1, 2019 at 8:38 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
The attached patch include all the comment given by Tomas and i
check sql standard about LIMIT and this featureUnfortunately, it seems the "limit" regression tests fail for some
reason - the output mismatches the expected results for some reason. It
seems as if the WITH TIES code affects ordering of the results within
the group. See the attached file.
Yes the reason is the order of returned row is not always the same. I
remove other columns from the result set to get constant result
Regards
Surafel
Attachments:
fetch_first_with_ties_v4.patchtext/x-patch; charset=US-ASCII; name=fetch_first_with_ties_v4.patchDownload+279-31
On 1/2/19 11:51 AM, Surafel Temesgen wrote:
On Tue, Jan 1, 2019 at 8:38 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:The attached patch include all the comment given by Tomas and i
check sql standard about LIMIT and this featureUnfortunately, it seems the "limit" regression tests fail for some
reason - the output mismatches the expected results for some reason. It
seems as if the WITH TIES code affects ordering of the results within
the group. See the attached file.Yes the reason is the order of returned row is not always the same. I
remove other columns from the result set to get constant result
Thanks, that seems reasonable.
After looking at the "FETCH FIRST ... PERCENT" patch, I wonder if this
patch should tweak estimates in some way. Currently, the cardinality
estimate is the same as for plain LIMIT, using the requested number of
rows. But let's say there are very few large groups - that will
naturally increase the number of rows produced.
As an example, let's say the subplan produces 1M rows, and there are
1000 groups (when split according to the ORDER BY clause). Consider a
query with "FETCH FIRST 10 ROWS WITH TIES". AFAICS the current estimate
will be 10, but in fact we know that it's likely to produce ~1000 rows
(because that's the average group size).
So I think the patch should tweak the estimate to be
limitCount + (avgGroupSize/2);
or perhaps
Max(avgGroupSize, limitCount + (avgGroupSize/2))
The 1/2 is there because we don't know where the group starts.
Of course, using average group size like this is rather crude, but it's
about the best thing we can do. In principle, increasing the cardinality
estimate is the right thing to do.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jan 2, 2019 at 6:19 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
After looking at the "FETCH FIRST ... PERCENT" patch, I wonder if this
patch should tweak estimates in some way. Currently, the cardinality
estimate is the same as for plain LIMIT, using the requested number of
rows. But let's say there are very few large groups - that will
naturally increase the number of rows produced.As an example, let's say the subplan produces 1M rows, and there are
1000 groups (when split according to the ORDER BY clause).
can we use ORDER BY column raw statistic in limit node reliably? because it
seems to me it can be affected by other operation in a subplan like filter
condition
regards
Surafel
On 1/15/19 11:07 AM, Surafel Temesgen wrote:
On Wed, Jan 2, 2019 at 6:19 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:After looking at the "FETCH FIRST ... PERCENT" patch, I wonder if this
patch should tweak estimates in some way. Currently, the cardinality
estimate is the same as for plain LIMIT, using the requested number of
rows. But let's say there are very few large groups - that will
naturally increase the number of rows produced.As an example, let's say the subplan produces 1M rows, and there are
1000 groups (when split according to the ORDER BY clause).
can we use ORDER BY column raw statistic in limit node reliably?
because it seems to me it can be affected by other operation in a
subplan like filter condition
What do you mean by "raw statistic"? Using stats from the underlying
table is not quite possible, because you might be operating on top of
join or something like that.
IMHO the best thing you can do is call estimate_num_groups() and combine
that with the number of input rows. That shall benefit from ndistinct
coefficients when available, etc. I've been thinking that considering
the unreliability of grouping estimates we should use a multiple of the
average size (because there may be much larger groups), but I think
that's quite unprecipled and I'd much rather try without it first.
But maybe we can do better when there really is a single table to
consider, in which case we might look at MCV lists and estimate the
largest group. That would give us a much better idea of the worst case.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 15, 2019 at 2:51 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
What do you mean by "raw statistic"?
I mean without further calculation to consider other operation
IMHO the best thing you can do is call estimate_num_groups() and combine
that with the number of input rows. That shall benefit from ndistinct
coefficients when available, etc. I've been thinking that considering
the unreliability of grouping estimates we should use a multiple of the
average size (because there may be much larger groups), but I think
that's quite unprecipled and I'd much rather try without it first.
thank you for clarifying.
The attache patch use your suggestion uptread
regards
Surafel
Attachments:
fetch_first_with_ties_v5.patchtext/x-patch; charset=US-ASCII; name=fetch_first_with_ties_v5.patchDownload+293-31
On Wed, Jan 16, 2019 at 11:45:46AM +0300, Surafel Temesgen wrote:
The attached patch use your suggestion uptread
This patch needs a rebase because of conflicts done recently for
pluggable storage, so moved to next CF, waiting on author.
--
Michael
On Mon, Feb 4, 2019 at 8:29 AM Michael Paquier <michael@paquier.xyz> wrote:
This patch needs a rebase because of conflicts done recently for
pluggable storage, so moved to next CF, waiting on author.
--
Thank you . rebased against current master
regards
Surafel
Attachments:
fetch_first_with_ties_v6.patchtext/x-patch; charset=US-ASCII; name=fetch_first_with_ties_v6.patchDownload+293-31
On 2/4/19 2:37 PM, Surafel Temesgen wrote:
On Mon, Feb 4, 2019 at 8:29 AM Michael Paquier <michael@paquier.xyz
<mailto:michael@paquier.xyz>> wrote:This patch needs a rebase because of conflicts done recently for
pluggable storage, so moved to next CF, waiting on author.
--Thank you . rebased against current master
This patch no longer passes testing so marked Waiting on Author.
Regards,
--
-David
david@pgmasters.net
On Mon, Mar 25, 2019 at 11:56 AM David Steele <david@pgmasters.net> wrote:
This patch no longer passes testing so marked Waiting on Author.
Thank you for informing. Fixed