Reusing abbreviated keys during second pass of ordered [set] aggregates

Started by Peter Geogheganalmost 11 years ago41 messageshackers
Jump to latest

Currently, there are certain aggregates that sort some tuples before
making a second pass over their memtuples array (through the tuplesort
"gettuple" interface), calling one or more attribute equality operator
functions as they go. For example, this occurs during the execution of
the following queries:

-- Salary is a numeric column:
SELECT count(DISTINCT salary) AS distinct_compensation_levels FROM employees;
-- Get most common distinct salary in organization:
SELECT mode() WITHIN GROUP (ORDER BY salary) AS
most_common_compensation FROM employees;

Since these examples involve numeric comparisons, the equality
operator calls involved in that second pass are expensive. There is no
equality fast-path for numeric equality as there is for text; I
imagine that in practice most calls to texteq() are resolved based on
raw datum size mismatch, which is dirt cheap (while the alternative, a
memcmp(), is still very cheap). Furthermore, numeric abbreviated keys
have a good chance of capturing the entropy of the original values
more effectively than we might expect with a text attribute. That
means that in the case of numeric, even sorted, adjacent pairs of
abbreviated keys have a pretty good chance of being distinct in the
real world (if their corresponding authoritative values are themselves
distinct).

I think we can avoid this expense much of the time.

Patch, Performance
===============

Attached patch makes several cases like this try and get away with
only using the pre-existing abbreviated keys from the sort (to
determine inequality). On my laptop, it removes 15% - 25% of the run
time of cases similar to the above, with a high cardinality numeric
attribute of uniformly distributed values. As usual with my work on
sorting, multiple concurrent sorts by multiple backends are helped
most; that puts the improvements for realistic cases involving a text
attribute at about 5% (the texteq() memcmp() is cheap, but not free)
with 4 clients using pgbench.

The numeric cases above are now at about the same speed as similar
queries that use a double precision floating point number attribute.
9.5-era testing shows that sort-heavy numeric cases are often about as
fast as "equivalent" double cases, so it seems worthwhile to mostly
eliminate this additional numeric overhead that cases like the above
still incur.

I imagine citext would benefit much more here once it has abbreviation
support (which should be a TODO item).

Omissions
--------------

I haven't attempted to accelerate AGG_SORTED strategy aggregates,
which looked like it would require more invasive changes. It seemed
like more trouble than it was worth, given that crossing group
boundaries is something that won't occur very often with typical
usage, and given the fact that text is naturally pretty fast there
anyway (who groups by a numeric attribute?). Similarly, I did not
teach setop_retrieve_direct() to consider the possibility that a set
operation (like a UNION) could reuse abbreviated keys if it happens to
be fed by a sort node. Finally, I did not accelerate the merging of
spools during B-Tree index builds, even though that actually seems
quite possible -- that case just seemed marginal. We currently have no
test coverage for that case [1]/messages/by-id/CAM3SWZTvEtVDysXeE7Py-8Ar+-+XRAPkJCZe-FH_V+zWxmuS9A@mail.gmail.com -- Peter Geoghegan, so I guess its performance can't be
too important.

SortSupport contract
================

As explained in the first patch's commit message, I revised the
contract that SortSupport has with opclasses. Should this proposal be
accepted, we should backpatch the first patch to 9.5, to prevent
anyone from building abbreviation support that won't work on 9.6 (even
if that is very unlikely). In general, it seems pretty silly to me to
not make use of the abbreviated keys that the sort already went to the
trouble of building, and it seems like the revision to the SortSupport
contract is not likely to notably limit any interesting cases in the
future, so I think that this will be uncontroversial.

[1]: /messages/by-id/CAM3SWZTvEtVDysXeE7Py-8Ar+-+XRAPkJCZe-FH_V+zWxmuS9A@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

0002-Reuse-abbreviated-keys-in-ordered-set-aggregates.patchtext/x-patch; charset=US-ASCII; name=0002-Reuse-abbreviated-keys-in-ordered-set-aggregates.patchDownload+77-27
0001-Tighten-SortSupport-abbreviated-key-contract.patchtext/x-patch; charset=US-ASCII; name=0001-Tighten-SortSupport-abbreviated-key-contract.patchDownload+13-2
#2Thomas Munro
thomas.munro@gmail.com
In reply to: Peter Geoghegan (#1)
Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Sat, Jul 11, 2015 at 10:05 AM, Peter Geoghegan <pg@heroku.com> wrote:

Currently, there are certain aggregates that sort some tuples before
making a second pass over their memtuples array (through the tuplesort
"gettuple" interface), calling one or more attribute equality operator
functions as they go. For example, this occurs during the execution of
the following queries:

-- Salary is a numeric column:
SELECT count(DISTINCT salary) AS distinct_compensation_levels FROM
employees;
-- Get most common distinct salary in organization:
SELECT mode() WITHIN GROUP (ORDER BY salary) AS
most_common_compensation FROM employees;

Since these examples involve numeric comparisons, the equality
operator calls involved in that second pass are expensive. There is no
equality fast-path for numeric equality as there is for text; I
imagine that in practice most calls to texteq() are resolved based on
raw datum size mismatch, which is dirt cheap (while the alternative, a
memcmp(), is still very cheap).

Forgive me for going off on a tangent here: I understand that the point of
your patch is to avoid calls to numeric_eq altogether in these scenarios
when cardinality is high, but the above sentence made me wonder how likely
it is that numeric values in typical workloads have the same size, weight
and sign and could therefore be handled in numeric_eq with memcmp() == 0
rather than numeric_cmp() == 0, and whether there would be any benefit.

Running various contrived aggregate queries on a large low cardinality
dataset in a small range (therefore frequently the same weight & size), I
managed to measure a small improvement of up to a few percent with the
attached patch. I also wonder whether numeric_cmp could be profitably
implemented with memcmp on big endian systems and some unrolled loops on
little endian systems when size & weight match.

Of course there are a ton of other overheads involved with numeric. I
wonder how crazy or difficult it would be to make it so that we could
optionally put a pass-by-value NUMERIC in a Datum, setting a low order tag
bit to say 'I'm an immediate value, not a pointer', and then packing 3
digits (= 12 significant figures) + sign + weight into the other 63 bits.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

numeric-eq-memcmp.patchapplication/octet-stream; name=numeric-eq-memcmp.patchDownload+7-1
#3Jeff Janes
jeff.janes@gmail.com
In reply to: Peter Geoghegan (#1)
Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

This needs a rebase, there are several conflicts in src/backend/executor/nodeAgg.c

Thanks,

Jeff

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Thomas Munro (#2)
Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Sun, Sep 6, 2015 at 9:35 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Running various contrived aggregate queries on a large low cardinality
dataset in a small range (therefore frequently the same weight & size), I
managed to measure a small improvement of up to a few percent with the
attached patch. I also wonder whether numeric_cmp could be profitably
implemented with memcmp on big endian systems and some unrolled loops on
little endian systems when size & weight match.

I think that setting up numeric SortSupport to have scratch memory
used across authoritative numeric comparator calls would also help.

We should prefer to do this kind of thing in a datatype independent
way, of course. I'm not opposed to doing what you outline too, but I
don't think it will be especially helpful for the cases here. I think
that what you're talking about would live in the SortSupport
authoritative comparator, and would benefit non-leading-attribute
comparisons most.

Of course there are a ton of other overheads involved with numeric. I
wonder how crazy or difficult it would be to make it so that we could
optionally put a pass-by-value NUMERIC in a Datum, setting a low order tag
bit to say 'I'm an immediate value, not a pointer', and then packing 3
digits (= 12 significant figures) + sign + weight into the other 63 bits.

That seems possible, but very invasive. I'd want to get a good sense
of the pay-off before undertaking such a project.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Jeff Janes (#3)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Fri, Sep 25, 2015 at 2:39 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

This needs a rebase, there are several conflicts in src/backend/executor/nodeAgg.c

I attached a revised version of the second patch in the series, fixing
this bitrot.

I also noticed a bug in tuplesort's TSS_SORTEDONTAPE case with the
previous patch, where no final on-the-fly merge step is required (no
merge step is required whatsoever, because replacement selection
managed to produce only one run). The function mergeruns() previously
only "abandoned" abbreviated ahead of any merge step iff there was
one. When there was only one run (not requiring a merge) it happened
to continue to keep its state consistent with abbreviated keys still
being in use. It didn't matter before, because abbreviated keys were
only for tuplesort to compare, but that's different now.

That bug is fixed in this revision by reordering things within
mergeruns(). The previous order of the two things that were switched
is not at all significant (I should know, I wrote that code).

Thanks
--
Peter Geoghegan

Attachments:

0002-Reuse-abbreviated-keys-in-ordered-set-aggregates.patchtext/x-patch; charset=US-ASCII; name=0002-Reuse-abbreviated-keys-in-ordered-set-aggregates.patchDownload+93-43
#6Andreas Karlsson
andreas.karlsson@percona.com
In reply to: Peter Geoghegan (#5)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

Hi,

I have reviewed this patch and it still applies to master, compiles and
passes the test suite.

I like the goal of the patch, making use of the already existing
abbreviation machinery in more cases is something we should do and the
patch itself looks clean.

I can also confirm the roughly 25% speedup in the best case (numerics
which are all distinct) with no measurable slowdown in the worst case.

Given this speedup and the small size of the patch I think we should
apply it. I will set this patch to "Ready for Commiter".

Andreas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Andreas Karlsson (#6)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Sun, Dec 6, 2015 at 7:14 PM, Andreas Karlsson <andreas@proxel.se> wrote:

I can also confirm the roughly 25% speedup in the best case (numerics which
are all distinct) with no measurable slowdown in the worst case.

Given this speedup and the small size of the patch I think we should apply
it. I will set this patch to "Ready for Commiter".

Thanks for the review, Andreas.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#5)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Sat, Oct 10, 2015 at 9:03 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Fri, Sep 25, 2015 at 2:39 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

This needs a rebase, there are several conflicts in src/backend/executor/nodeAgg.c

I attached a revised version of the second patch in the series, fixing
this bitrot.

I also noticed a bug in tuplesort's TSS_SORTEDONTAPE case with the
previous patch, where no final on-the-fly merge step is required (no
merge step is required whatsoever, because replacement selection
managed to produce only one run). The function mergeruns() previously
only "abandoned" abbreviated ahead of any merge step iff there was
one. When there was only one run (not requiring a merge) it happened
to continue to keep its state consistent with abbreviated keys still
being in use. It didn't matter before, because abbreviated keys were
only for tuplesort to compare, but that's different now.

That bug is fixed in this revision by reordering things within
mergeruns(). The previous order of the two things that were switched
is not at all significant (I should know, I wrote that code).

I find the references to a "void" representation in this patch to be
completely opaque. I see that there are some such references in
tuplesort.c already, and most likely they were put there by commits
that I did, so I guess I have nobody but myself to blame, but I don't
know what this means, and I don't think we should let this terminology
proliferate.

My understanding is that the "void" representation is intended to
whatever Datum we originally got, which might be a pointer. Why not
just say that instead of referring to it this way?

My understanding is also that it's OK if the abbreviated key stays the
same even though the value has changed, but that the reverse could
cause queries to return wrong answers. The first part of that
justifies why this is safe when no abbreviation is available: we'll
return an abbreviated value of 0 for everything, but that's fine.
However, using the original Datum (which might be a pointer) seems
unsafe, because two binary-identical values could be stored at
different addresses and thus have different pointer representations.

I'm probably missing something here, so clue me in...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Robert Haas (#8)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Wed, Dec 9, 2015 at 11:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I find the references to a "void" representation in this patch to be
completely opaque. I see that there are some such references in
tuplesort.c already, and most likely they were put there by commits
that I did, so I guess I have nobody but myself to blame, but I don't
know what this means, and I don't think we should let this terminology
proliferate.

My understanding is that the "void" representation is intended to
whatever Datum we originally got, which might be a pointer. Why not
just say that instead of referring to it this way?

That isn't what is intended. "void" is the state that macros like
index_getattr() leave NULL leading attributes (that go in the
SortTuple.datum1 field) in. However, the function tuplesort_putdatum()
requires callers to initialize their Datum to 0 now, which is new. A
"void" representation is a would-be NULL pointer in the case of
pass-by-value types, and a NULL pointer for pass-by-reference types.

My understanding is also that it's OK if the abbreviated key stays the
same even though the value has changed, but that the reverse could
cause queries to return wrong answers. The first part of that
justifies why this is safe when no abbreviation is available: we'll
return an abbreviated value of 0 for everything, but that's fine.
However, using the original Datum (which might be a pointer) seems
unsafe, because two binary-identical values could be stored at
different addresses and thus have different pointer representations.

I'm probably missing something here, so clue me in...

I think that you're missing that patch 0001 formally forbids
abbreviated keys that are pass-by-value, by revising the contract
(this is proposed for backpatch to 9.5 -- only comments are changed).
This is already something that is all but forbidden, although the
datum case does tacitly acknowledge the possibility by not allowing
abbreviation to work with the pass-by-value-and-yet-abbreviated case.

I think that this revision is also useful for putting abbreviated keys
in indexes, something that may happen yet.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Peter Geoghegan (#9)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Wed, Dec 9, 2015 at 2:15 PM, Peter Geoghegan <pg@heroku.com> wrote:

I think that you're missing that patch 0001 formally forbids
abbreviated keys that are pass-by-value

Sorry. I do of course mean it forbids abbreviated keys that are *not*
pass-by-value (that are pass-by-reference).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Peter Geoghegan (#9)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Wed, Dec 9, 2015 at 2:15 PM, Peter Geoghegan <pg@heroku.com> wrote:

I think that you're missing that patch 0001 formally forbids
abbreviated keys that are pass-by-value, by revising the contract
(this is proposed for backpatch to 9.5 -- only comments are changed).
This is already something that is all but forbidden, although the
datum case does tacitly acknowledge the possibility by not allowing
abbreviation to work with the pass-by-value-and-yet-abbreviated case.

I think that this revision is also useful for putting abbreviated keys
in indexes, something that may happen yet.

I'm also depending on this for the "quicksort for every sort run" patch, BTW:

+           /*
+            * Kludge:  Trigger abbreviated tie-breaker if in-memory tuples
+            * use abbreviation (writing tuples to tape never preserves
+            * abbreviated keys).  Do this by assigning in-memory
+            * abbreviated tuple to tape tuple directly.
+            *
+            * It doesn't seem worth generating a new abbreviated key for
+            * the tape tuple, and this approach is simpler than
+            * "unabbreviating" the memtuple tuple from a "common" routine
+            * like this.
+            */
+           if (state->sortKeys != NULL &&
state->sortKeys->abbrev_converter != NULL)
+               stup->datum1 = state->memtuples[state->current].datum1;

I could, as an alternative approach, revise tuplesort so that
self-comparison works (something we currently assert against [1]Commit c5a03256c -- Peter Geoghegan),
something that would probably *also* require and update to the
sortsupport.h contract, but this seemed simpler and more general.

In general, I think that there are plenty of reasons to forbid
pass-by-reference abbreviated keys (where the abbreviated comparator
itself is a pointer, something much more complicated than an integer
3-way comparison or similar).

[1]: Commit c5a03256c -- Peter Geoghegan
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#9)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Wed, Dec 9, 2015 at 5:15 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Dec 9, 2015 at 11:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I find the references to a "void" representation in this patch to be
completely opaque. I see that there are some such references in
tuplesort.c already, and most likely they were put there by commits
that I did, so I guess I have nobody but myself to blame, but I don't
know what this means, and I don't think we should let this terminology
proliferate.

My understanding is that the "void" representation is intended to
whatever Datum we originally got, which might be a pointer. Why not
just say that instead of referring to it this way?

That isn't what is intended. "void" is the state that macros like
index_getattr() leave NULL leading attributes (that go in the
SortTuple.datum1 field) in.

What kind of state is that? Can't we define this in terms of what it
is rather than how it gets that way?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Robert Haas (#12)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Wed, Dec 16, 2015 at 9:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:

That isn't what is intended. "void" is the state that macros like
index_getattr() leave NULL leading attributes (that go in the
SortTuple.datum1 field) in.

What kind of state is that? Can't we define this in terms of what it
is rather than how it gets that way?

It's zeroed.

I guess we can update everything, including existing comments, to reflect that.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Peter Geoghegan (#13)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Wed, Dec 16, 2015 at 12:04 PM, Peter Geoghegan <pg@heroku.com> wrote:

What kind of state is that? Can't we define this in terms of what it
is rather than how it gets that way?

It's zeroed.

I guess we can update everything, including existing comments, to reflect that.

Attached revision updates both the main commit (the optimization), and
the backpatch commit (updated the contract).

Changes:

* Changes commit intended for backpatch to 9.5 to use new "zeroed"
terminology (not "void" representation).

* Puts fix within mergerun() (for this patch) in the same backpatch
commit. We might as well keep this consistent, even though the
correctness of 9.5 does not actually hinge upon having this change --
there is zero risk of breaking something in 9.5, and avoiding
backpatching this part can only lead to confusion in the future.

* Some minor tweaks to tuplesort.c. Make sure that
tuplesort_putdatum() zeroes datum1 directly for NULL values. Comment
tweaks.

* Add comment to the datum case, discussing restriction there.

I was wrong when I said that the datum sort case currently tacitly
acknowledges as possible/useful something that the revised SortSupport
contract will forbid (I wrote that e-mail in haste).

What I propose to make the SortSupport contract forbid here is
abbreviated keys that are *themselves* pass-by-reference. It is true
that today, we do not support pass-by-value types with abbreviated
keys for the datum case only (such opclass support could be slightly
useful for float8, for example -- int8 comparisons of an alternative
representation might be decisively cheaper). But that existing
restriction is entirely distinct to the new restriction I propose to
create. It seems like this distinction should be pointed out in
passing, which I've done.

--
Peter Geoghegan

Attachments:

0002-Reuse-abbreviated-keys-in-ordered-set-aggregates.patchtext/x-patch; charset=US-ASCII; name=0002-Reuse-abbreviated-keys-in-ordered-set-aggregates.patchDownload+79-27
0001-Tighten-SortSupport-abbreviated-key-contract.patchtext/x-patch; charset=US-ASCII; name=0001-Tighten-SortSupport-abbreviated-key-contract.patchDownload+46-23
#15Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#14)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Thu, Dec 17, 2015 at 11:43 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Dec 16, 2015 at 12:04 PM, Peter Geoghegan <pg@heroku.com> wrote:

What kind of state is that? Can't we define this in terms of what it
is rather than how it gets that way?

It's zeroed.

I guess we can update everything, including existing comments, to reflect that.

Thanks, this looks much easier to understand from here.

Attached revision updates both the main commit (the optimization), and
the backpatch commit (updated the contract).

-       /* abbreviation is possible here only for by-reference types */
+       /*
+        * Abbreviation is possible here only for by-reference types.
Note that a
+        * pass-by-value representation for abbreviated values is forbidden, but
+        * that's a distinct, generic restriction imposed by the SortSupport
+        * contract.

I think that you have not written what you meant to write here. I
think what you mean is "Note that a pass-by-REFERENCE representation
for abbreviated values is forbidden...".

+       /*
+        * If we produced only one initial run (quite likely if the total data
+        * volume is between 1X and 2X workMem), we can just use that
tape as the
+        * finished output, rather than doing a useless merge.  (This obvious
+        * optimization is not in Knuth's algorithm.)
+        */
+       if (state->currentRun == 1)
+       {
+               state->result_tape = state->tp_tapenum[state->destTape];
+               /* must freeze and rewind the finished output tape */
+               LogicalTapeFreeze(state->tapeset, state->result_tape);
+               state->status = TSS_SORTEDONTAPE;
+               return;
+       }

I don't understand the point of moving this code. If there's some
reason to do this after rewinding the tapes rather than beforehand, I
think we should articulate that reason in the comment block.

The last hunk in your 0001 patch properly belongs in 0002.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Robert Haas (#15)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Fri, Dec 18, 2015 at 9:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Attached revision updates both the main commit (the optimization), and
the backpatch commit (updated the contract).

-       /* abbreviation is possible here only for by-reference types */
+       /*
+        * Abbreviation is possible here only for by-reference types.
Note that a
+        * pass-by-value representation for abbreviated values is forbidden, but
+        * that's a distinct, generic restriction imposed by the SortSupport
+        * contract.

I think that you have not written what you meant to write here. I
think what you mean is "Note that a pass-by-REFERENCE representation
for abbreviated values is forbidden...".

You're right. Sorry about that.

+       /*
+        * If we produced only one initial run (quite likely if the total data
+        * volume is between 1X and 2X workMem), we can just use that
tape as the
+        * finished output, rather than doing a useless merge.  (This obvious
+        * optimization is not in Knuth's algorithm.)
+        */
+       if (state->currentRun == 1)
+       {
+               state->result_tape = state->tp_tapenum[state->destTape];
+               /* must freeze and rewind the finished output tape */
+               LogicalTapeFreeze(state->tapeset, state->result_tape);
+               state->status = TSS_SORTEDONTAPE;
+               return;
+       }

I don't understand the point of moving this code. If there's some
reason to do this after rewinding the tapes rather than beforehand, I
think we should articulate that reason in the comment block.

I thought that was made clear by the 0001 commit message. Think of
what happens when we don't disable abbreviated in the final
TSS_SORTEDONTAPE phase should the "if (state->currentRun == 1)" path
have been taken (but *not* the path that also ends in
TSS_SORTEDONTAPE, when caller requires randomAccess but we spill to
tape, or any other case). What happens is: The code in 0002 gets
confused, and attempts to pass back a pointer value as an "abbreviated
key". That's a bug.

The last hunk in your 0001 patch properly belongs in 0002.

You could certainly argue that the last hunk of 0001 belongs in 0002.
I only moved it to 0001 when I realized that we might as well keep the
branches in sync, since the ordering is insignificant from a 9.5
perspective (although it might still be tidier), and there is a need
to backpatch anyway. I'm not insisting on doing it that way.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#16)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Fri, Dec 18, 2015 at 2:22 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Fri, Dec 18, 2015 at 9:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Attached revision updates both the main commit (the optimization), and
the backpatch commit (updated the contract).

-       /* abbreviation is possible here only for by-reference types */
+       /*
+        * Abbreviation is possible here only for by-reference types.
Note that a
+        * pass-by-value representation for abbreviated values is forbidden, but
+        * that's a distinct, generic restriction imposed by the SortSupport
+        * contract.

I think that you have not written what you meant to write here. I
think what you mean is "Note that a pass-by-REFERENCE representation
for abbreviated values is forbidden...".

You're right. Sorry about that.

PFA my proposal for comment changes for 9.5 and master. This is based
on your 0001, but I edited somewhat. Please let me know your
thoughts. I am not willing to go further and rearrange actual code in
9.5 at this point; it just isn't necessary.

Looking at this whole system again, I wonder if we're missing a trick
here. How about if we decree from on high that the abbreviated-key
comparator is always just the application of an integer comparison
operator? The only abbreviation function that we have right now that
wouldn't be happy with that is the one for numeric, and that looks
like it could be fixed. Then we could hard-code knowledge of this
representation in tuplesort.c in such a way that it wouldn't need to
call a comparator function at all - instead of doing
ApplySortComparator() and then maybe ApplySortAbbrevFullComparator(),
it would do something like:

if (using_abbreviation && (compare = ApplyAbbrevComparator(...)) != 0)
return compare;

I'm not sure if that would save enough cycles vs. the status quo to be
worth bothering with, but it seems like it might. You may have a
better feeling for that than I do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

sortsupport-comment-fixes-rmh.patchapplication/x-patch; name=sortsupport-comment-fixes-rmh.patchDownload+31-23
In reply to: Robert Haas (#17)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Mon, Dec 21, 2015 at 10:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:

PFA my proposal for comment changes for 9.5 and master. This is based
on your 0001, but I edited somewhat. Please let me know your
thoughts. I am not willing to go further and rearrange actual code in
9.5 at this point; it just isn't necessary.

Fine by me. But this revision hasn't made the important point at all
-- which is that 0002 is safe. That's a stronger guarantee than the
abbreviated key representation being pass-by-value.

Looking at this whole system again, I wonder if we're missing a trick
here. How about if we decree from on high that the abbreviated-key
comparator is always just the application of an integer comparison
operator? The only abbreviation function that we have right now that
wouldn't be happy with that is the one for numeric, and that looks
like it could be fixed.

I gather you mean that we'd decree that they were always compared
using a text or uuid style 3-way unsigned integer comparison, allowing
us to hardcode that.

Then we could hard-code knowledge of this
representation in tuplesort.c in such a way that it wouldn't need to
call a comparator function at all - instead of doing
ApplySortComparator() and then maybe ApplySortAbbrevFullComparator(),
it would do something like:

if (using_abbreviation && (compare = ApplyAbbrevComparator(...)) != 0)
return compare;

I'm not sure if that would save enough cycles vs. the status quo to be
worth bothering with, but it seems like it might. You may have a
better feeling for that than I do.

I like this idea. I thought of it myself already, actually.

It has some nice properties, because unsigned integers are so simple
and flexible. For example, we can do it with int4 sometimes, and then
compare two attributes at once on 64-bit platforms. Maybe the second
attribute (the second set of 4 bytes) isn't an int4, though -- it's
any other type's abbreviated key (which can be assumed to have a
compatible representation). That's one more advanced possibility.

It seems worthwhile because numeric is the odd one out. Once I or
someone else gets around to adding network types, which could happen
in 9.6, then we are done (because there is already a pending patch
that adds support for remaining character-like types and alternative
opclasses). I really don't foresee much additional use of abbreviated
keys beyond these cases. There'd be a small but nice boost by not
doing pointer chasing for the abbreviated key comparison, I imagine,
but that benefit would be felt everywhere. Numeric is the odd one out
currently, but as you say it can be fixed, and I foresee no other
trouble from any other opclass/type that cares about abbreviation.

Another issue is that abbreviated keys in indexes are probably not
going to take 8 bytes, because they'll go in the ItemId array. It's
likely to be very useful to be able to take the first two bytes, and
know that a uint16 comparison is all that is needed, and have a basis
to perform an interpolation search rather than a binary search
(although that needs to be adaptive to different distributions, I
think it could be an effective technique -- there are many cache
misses for binary searches on internal B-Tree pages).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#18)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Mon, Dec 21, 2015 at 2:55 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Mon, Dec 21, 2015 at 10:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:

PFA my proposal for comment changes for 9.5 and master. This is based
on your 0001, but I edited somewhat. Please let me know your
thoughts. I am not willing to go further and rearrange actual code in
9.5 at this point; it just isn't necessary.

Fine by me. But this revision hasn't made the important point at all
-- which is that 0002 is safe. That's a stronger guarantee than the
abbreviated key representation being pass-by-value.

Right. I don't think that we should back-patch that stuff into 9.5.

Looking at this whole system again, I wonder if we're missing a trick
here. How about if we decree from on high that the abbreviated-key
comparator is always just the application of an integer comparison
operator? The only abbreviation function that we have right now that
wouldn't be happy with that is the one for numeric, and that looks
like it could be fixed.

I gather you mean that we'd decree that they were always compared
using a text or uuid style 3-way unsigned integer comparison, allowing
us to hardcode that.

Right.

Then we could hard-code knowledge of this
representation in tuplesort.c in such a way that it wouldn't need to
call a comparator function at all - instead of doing
ApplySortComparator() and then maybe ApplySortAbbrevFullComparator(),
it would do something like:

if (using_abbreviation && (compare = ApplyAbbrevComparator(...)) != 0)
return compare;

I'm not sure if that would save enough cycles vs. the status quo to be
worth bothering with, but it seems like it might. You may have a
better feeling for that than I do.

I like this idea. I thought of it myself already, actually.

It has some nice properties, because unsigned integers are so simple
and flexible. For example, we can do it with int4 sometimes, and then
compare two attributes at once on 64-bit platforms. Maybe the second
attribute (the second set of 4 bytes) isn't an int4, though -- it's
any other type's abbreviated key (which can be assumed to have a
compatible representation). That's one more advanced possibility.

Yikes.

It seems worthwhile because numeric is the odd one out. Once I or
someone else gets around to adding network types, which could happen
in 9.6, then we are done (because there is already a pending patch
that adds support for remaining character-like types and alternative
opclasses). I really don't foresee much additional use of abbreviated
keys beyond these cases. There'd be a small but nice boost by not
doing pointer chasing for the abbreviated key comparison, I imagine,
but that benefit would be felt everywhere. Numeric is the odd one out
currently, but as you say it can be fixed, and I foresee no other
trouble from any other opclass/type that cares about abbreviation.

Another issue is that abbreviated keys in indexes are probably not
going to take 8 bytes, because they'll go in the ItemId array. It's
likely to be very useful to be able to take the first two bytes, and
know that a uint16 comparison is all that is needed, and have a basis
to perform an interpolation search rather than a binary search
(although that needs to be adaptive to different distributions, I
think it could be an effective technique -- there are many cache
misses for binary searches on internal B-Tree pages).

Hmm, that seems iffy to me. There are plenty of data sets where 8
bytes is enough to get some meaningful information about what part of
the keyspace you're in, but 2 bytes isn't.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Robert Haas (#19)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Tue, Dec 22, 2015 at 9:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

It has some nice properties, because unsigned integers are so simple
and flexible. For example, we can do it with int4 sometimes, and then
compare two attributes at once on 64-bit platforms. Maybe the second
attribute (the second set of 4 bytes) isn't an int4, though -- it's
any other type's abbreviated key (which can be assumed to have a
compatible representation). That's one more advanced possibility.

Yikes.

I'm not suggesting we'd want to do that immediately, or even at all.
My point is -- stuff like this becomes possible. My intuition is that
it might come up in a bunch of places.

Another issue is that abbreviated keys in indexes are probably not
going to take 8 bytes, because they'll go in the ItemId array. It's
likely to be very useful to be able to take the first two bytes, and
know that a uint16 comparison is all that is needed, and have a basis
to perform an interpolation search rather than a binary search
(although that needs to be adaptive to different distributions, I
think it could be an effective technique -- there are many cache
misses for binary searches on internal B-Tree pages).

Hmm, that seems iffy to me. There are plenty of data sets where 8
bytes is enough to get some meaningful information about what part of
the keyspace you're in, but 2 bytes isn't.

Your experience with abbreviated keys for sorting is unlikely to
perfectly carry over here. That's because the 2 bytes are only put in
the internal pages (typically less than 1% of the total). These
internal pages typically have relatively low density anyway (we don't
use the The B* tree technique), so I think that the level of bloat
would be tiny. At the same time, these are the range of values that
naturally have very wide fan-out. For example, the entire range of
values in the index is represented at the root page.

All of that having been said, yes, it could still be useless. But
then, the overhead will still tend to be nill, due to the fact that
abbreviated key generation can be so effectively amortized (it happens
only once, per page split), and because branch prediction will tend to
remove any penalty.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#19)
In reply to: Robert Haas (#21)
In reply to: Robert Haas (#21)
In reply to: Peter Geoghegan (#20)
In reply to: Peter Geoghegan (#23)
#26Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#25)
In reply to: Alvaro Herrera (#26)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#26)
#29Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#26)
#30Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Michael Paquier (#29)
#31Pavel Stehule
pavel.stehule@gmail.com
In reply to: Jim Nasby (#30)
#32Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#23)
In reply to: Robert Haas (#32)
#34Magnus Hagander
magnus@hagander.net
In reply to: Pavel Stehule (#31)
#35Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Magnus Hagander (#34)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#35)
In reply to: Tom Lane (#36)
#38Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#36)
#39Michael Paquier
michael@paquier.xyz
In reply to: Magnus Hagander (#38)
#40Magnus Hagander
magnus@hagander.net
In reply to: Michael Paquier (#39)
In reply to: Magnus Hagander (#40)