Reusing abbreviated keys during second pass of ordered [set] aggregates

Started by Peter Geogheganover 10 years ago41 messages
#1Peter Geoghegan
pg@heroku.com
2 attachment(s)

Currently, there are certain aggregates that sort some tuples before
making a second pass over their memtuples array (through the tuplesort
"gettuple" interface), calling one or more attribute equality operator
functions as they go. For example, this occurs during the execution of
the following queries:

-- Salary is a numeric column:
SELECT count(DISTINCT salary) AS distinct_compensation_levels FROM employees;
-- Get most common distinct salary in organization:
SELECT mode() WITHIN GROUP (ORDER BY salary) AS
most_common_compensation FROM employees;

Since these examples involve numeric comparisons, the equality
operator calls involved in that second pass are expensive. There is no
equality fast-path for numeric equality as there is for text; I
imagine that in practice most calls to texteq() are resolved based on
raw datum size mismatch, which is dirt cheap (while the alternative, a
memcmp(), is still very cheap). Furthermore, numeric abbreviated keys
have a good chance of capturing the entropy of the original values
more effectively than we might expect with a text attribute. That
means that in the case of numeric, even sorted, adjacent pairs of
abbreviated keys have a pretty good chance of being distinct in the
real world (if their corresponding authoritative values are themselves
distinct).

I think we can avoid this expense much of the time.

Patch, Performance
===============

Attached patch makes several cases like this try and get away with
only using the pre-existing abbreviated keys from the sort (to
determine inequality). On my laptop, it removes 15% - 25% of the run
time of cases similar to the above, with a high cardinality numeric
attribute of uniformly distributed values. As usual with my work on
sorting, multiple concurrent sorts by multiple backends are helped
most; that puts the improvements for realistic cases involving a text
attribute at about 5% (the texteq() memcmp() is cheap, but not free)
with 4 clients using pgbench.

The numeric cases above are now at about the same speed as similar
queries that use a double precision floating point number attribute.
9.5-era testing shows that sort-heavy numeric cases are often about as
fast as "equivalent" double cases, so it seems worthwhile to mostly
eliminate this additional numeric overhead that cases like the above
still incur.

I imagine citext would benefit much more here once it has abbreviation
support (which should be a TODO item).

Omissions
--------------

I haven't attempted to accelerate AGG_SORTED strategy aggregates,
which looked like it would require more invasive changes. It seemed
like more trouble than it was worth, given that crossing group
boundaries is something that won't occur very often with typical
usage, and given the fact that text is naturally pretty fast there
anyway (who groups by a numeric attribute?). Similarly, I did not
teach setop_retrieve_direct() to consider the possibility that a set
operation (like a UNION) could reuse abbreviated keys if it happens to
be fed by a sort node. Finally, I did not accelerate the merging of
spools during B-Tree index builds, even though that actually seems
quite possible -- that case just seemed marginal. We currently have no
test coverage for that case [1]/messages/by-id/CAM3SWZTvEtVDysXeE7Py-8Ar+-+XRAPkJCZe-FH_V+zWxmuS9A@mail.gmail.com -- Peter Geoghegan, so I guess its performance can't be
too important.

SortSupport contract
================

As explained in the first patch's commit message, I revised the
contract that SortSupport has with opclasses. Should this proposal be
accepted, we should backpatch the first patch to 9.5, to prevent
anyone from building abbreviation support that won't work on 9.6 (even
if that is very unlikely). In general, it seems pretty silly to me to
not make use of the abbreviated keys that the sort already went to the
trouble of building, and it seems like the revision to the SortSupport
contract is not likely to notably limit any interesting cases in the
future, so I think that this will be uncontroversial.

[1]: /messages/by-id/CAM3SWZTvEtVDysXeE7Py-8Ar+-+XRAPkJCZe-FH_V+zWxmuS9A@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

0002-Reuse-abbreviated-keys-in-ordered-set-aggregates.patchtext/x-patch; charset=US-ASCII; name=0002-Reuse-abbreviated-keys-in-ordered-set-aggregates.patchDownload
From 7ae66ebcda9de24b4b148ad90630d89a401feb68 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Mon, 6 Jul 2015 13:37:26 -0700
Subject: [PATCH 2/2] Reuse abbreviated keys in ordered [set] aggregates

When processing ordered aggregates following a sort that could make use
of the abbreviated key optimization, only call the equality operator to
compare successive pairs of tuples when their abbreviated keys were not
equal.  Only strict abbreviated key binary inequality is considered,
which is safe.
---
 src/backend/catalog/index.c            |  2 +-
 src/backend/executor/nodeAgg.c         | 20 ++++++++++++----
 src/backend/executor/nodeSort.c        |  2 +-
 src/backend/utils/adt/orderedsetaggs.c | 33 ++++++++++++++++++--------
 src/backend/utils/sort/tuplesort.c     | 42 ++++++++++++++++++++++++++++------
 src/include/utils/tuplesort.h          |  4 ++--
 6 files changed, 77 insertions(+), 26 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4246554..52d6f19 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2982,7 +2982,7 @@ validate_index_heapscan(Relation heapRelation,
 			}
 
 			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  &ts_val, &ts_isnull);
+												  &ts_val, &ts_isnull, NULL);
 			Assert(tuplesort_empty || !ts_isnull);
 			indexcursor = (ItemPointer) DatumGetPointer(ts_val);
 		}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2bf48c5..9b932c4 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -476,7 +476,8 @@ fetch_input_tuple(AggState *aggstate)
 
 	if (aggstate->sort_in)
 	{
-		if (!tuplesort_gettupleslot(aggstate->sort_in, true, aggstate->sort_slot))
+		if (!tuplesort_gettupleslot(aggstate->sort_in, true,
+									aggstate->sort_slot, NULL))
 			return NULL;
 		slot = aggstate->sort_slot;
 	}
@@ -831,8 +832,8 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
  * The one-input case is handled separately from the multi-input case
  * for performance reasons: for single by-value inputs, such as the
  * common case of count(distinct id), the tuplesort_getdatum code path
- * is around 300% faster.  (The speedup for by-reference types is less
- * but still noticeable.)
+ * is around 300% faster.  (The speedup for by-reference types without
+ * abbreviated key support is less but still noticeable.)
  *
  * This function handles only one grouping set (already set in
  * aggstate->current_set).
@@ -850,6 +851,8 @@ process_ordered_aggregate_single(AggState *aggstate,
 	MemoryContext workcontext = aggstate->tmpcontext->ecxt_per_tuple_memory;
 	MemoryContext oldContext;
 	bool		isDistinct = (peraggstate->numDistinctCols > 0);
+	Datum		newAbbrevVal = (Datum) 0;
+	Datum		oldAbbrevVal = (Datum) 0;
 	FunctionCallInfo fcinfo = &peraggstate->transfn_fcinfo;
 	Datum	   *newVal;
 	bool	   *isNull;
@@ -869,7 +872,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 	 */
 
 	while (tuplesort_getdatum(peraggstate->sortstates[aggstate->current_set],
-							  true, newVal, isNull))
+							  true, newVal, isNull, &newAbbrevVal))
 	{
 		/*
 		 * Clear and select the working context for evaluation of the equality
@@ -887,6 +890,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 			haveOldVal &&
 			((oldIsNull && *isNull) ||
 			 (!oldIsNull && !*isNull &&
+			  oldAbbrevVal == newAbbrevVal &&
 			  DatumGetBool(FunctionCall2(&peraggstate->equalfns[0],
 										 oldVal, *newVal)))))
 		{
@@ -902,6 +906,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 				pfree(DatumGetPointer(oldVal));
 			/* and remember the new one for subsequent equality checks */
 			oldVal = *newVal;
+			oldAbbrevVal = newAbbrevVal;
 			oldIsNull = *isNull;
 			haveOldVal = true;
 		}
@@ -939,6 +944,8 @@ process_ordered_aggregate_multi(AggState *aggstate,
 	TupleTableSlot *slot2 = peraggstate->uniqslot;
 	int			numTransInputs = peraggstate->numTransInputs;
 	int			numDistinctCols = peraggstate->numDistinctCols;
+	Datum		newAbbrevVal = (Datum) 0;
+	Datum		oldAbbrevVal = (Datum) 0;
 	bool		haveOldValue = false;
 	int			i;
 
@@ -949,7 +956,7 @@ process_ordered_aggregate_multi(AggState *aggstate,
 		ExecClearTuple(slot2);
 
 	while (tuplesort_gettupleslot(peraggstate->sortstates[aggstate->current_set],
-								  true, slot1))
+								  true, slot1, &newAbbrevVal))
 	{
 		/*
 		 * Extract the first numTransInputs columns as datums to pass to the
@@ -960,6 +967,7 @@ process_ordered_aggregate_multi(AggState *aggstate,
 
 		if (numDistinctCols == 0 ||
 			!haveOldValue ||
+			newAbbrevVal != oldAbbrevVal ||
 			!execTuplesMatch(slot1, slot2,
 							 numDistinctCols,
 							 peraggstate->sortColIdx,
@@ -983,6 +991,8 @@ process_ordered_aggregate_multi(AggState *aggstate,
 
 				slot2 = slot1;
 				slot1 = tmpslot;
+				/* avoid execTuplesMatch() calls by reusing abbreviated keys */
+				oldAbbrevVal = newAbbrevVal;
 				haveOldValue = true;
 			}
 		}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index af1dccf..35dd993 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -137,7 +137,7 @@ ExecSort(SortState *node)
 	slot = node->ss.ps.ps_ResultTupleSlot;
 	(void) tuplesort_gettupleslot(tuplesortstate,
 								  ScanDirectionIsForward(dir),
-								  slot);
+								  slot, NULL);
 	return slot;
 }
 
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 39ed85b..1a5f696 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -453,7 +453,7 @@ percentile_disc_final(PG_FUNCTION_ARGS)
 			elog(ERROR, "missing row in percentile_disc");
 	}
 
-	if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull))
+	if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull, NULL))
 		elog(ERROR, "missing row in percentile_disc");
 
 	/*
@@ -553,7 +553,7 @@ percentile_cont_final_common(FunctionCallInfo fcinfo,
 	if (!tuplesort_skiptuples(osastate->sortstate, first_row, true))
 		elog(ERROR, "missing row in percentile_cont");
 
-	if (!tuplesort_getdatum(osastate->sortstate, true, &first_val, &isnull))
+	if (!tuplesort_getdatum(osastate->sortstate, true, &first_val, &isnull, NULL))
 		elog(ERROR, "missing row in percentile_cont");
 	if (isnull)
 		PG_RETURN_NULL();
@@ -564,7 +564,7 @@ percentile_cont_final_common(FunctionCallInfo fcinfo,
 	}
 	else
 	{
-		if (!tuplesort_getdatum(osastate->sortstate, true, &second_val, &isnull))
+		if (!tuplesort_getdatum(osastate->sortstate, true, &second_val, &isnull, NULL))
 			elog(ERROR, "missing row in percentile_cont");
 
 		if (isnull)
@@ -792,7 +792,7 @@ percentile_disc_multi_final(PG_FUNCTION_ARGS)
 				if (!tuplesort_skiptuples(osastate->sortstate, target_row - rownum - 1, true))
 					elog(ERROR, "missing row in percentile_disc");
 
-				if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull))
+				if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull, NULL))
 					elog(ERROR, "missing row in percentile_disc");
 
 				rownum = target_row;
@@ -921,7 +921,8 @@ percentile_cont_multi_final_common(FunctionCallInfo fcinfo,
 				if (!tuplesort_skiptuples(osastate->sortstate, first_row - rownum - 1, true))
 					elog(ERROR, "missing row in percentile_cont");
 
-				if (!tuplesort_getdatum(osastate->sortstate, true, &first_val, &isnull) || isnull)
+				if (!tuplesort_getdatum(osastate->sortstate, true, &first_val,
+										&isnull, NULL) || isnull)
 					elog(ERROR, "missing row in percentile_cont");
 
 				rownum = first_row;
@@ -941,7 +942,8 @@ percentile_cont_multi_final_common(FunctionCallInfo fcinfo,
 			/* Fetch second_row if needed */
 			if (second_row > rownum)
 			{
-				if (!tuplesort_getdatum(osastate->sortstate, true, &second_val, &isnull) || isnull)
+				if (!tuplesort_getdatum(osastate->sortstate, true, &second_val,
+										&isnull, NULL) || isnull)
 					elog(ERROR, "missing row in percentile_cont");
 				rownum++;
 			}
@@ -1016,6 +1018,8 @@ mode_final(PG_FUNCTION_ARGS)
 	int64		last_val_freq = 0;
 	bool		last_val_is_mode = false;
 	FmgrInfo   *equalfn;
+	Datum		abbrev_val = (Datum) 0;
+	Datum		last_abbrev_val = (Datum) 0;
 	bool		shouldfree;
 
 	Assert(AggCheckCallContext(fcinfo, NULL) == AGG_CONTEXT_AGGREGATE);
@@ -1042,7 +1046,7 @@ mode_final(PG_FUNCTION_ARGS)
 	tuplesort_performsort(osastate->sortstate);
 
 	/* Scan tuples and count frequencies */
-	while (tuplesort_getdatum(osastate->sortstate, true, &val, &isnull))
+	while (tuplesort_getdatum(osastate->sortstate, true, &val, &isnull, &abbrev_val))
 	{
 		/* we don't expect any nulls, but ignore them if found */
 		if (isnull)
@@ -1054,8 +1058,10 @@ mode_final(PG_FUNCTION_ARGS)
 			mode_val = last_val = val;
 			mode_freq = last_val_freq = 1;
 			last_val_is_mode = true;
+			last_abbrev_val = abbrev_val;
 		}
-		else if (DatumGetBool(FunctionCall2(equalfn, val, last_val)))
+		else if (abbrev_val == last_abbrev_val &&
+				 DatumGetBool(FunctionCall2(equalfn, val, last_val)))
 		{
 			/* value equal to previous value, count it */
 			if (last_val_is_mode)
@@ -1078,6 +1084,8 @@ mode_final(PG_FUNCTION_ARGS)
 			if (shouldfree && !last_val_is_mode)
 				pfree(DatumGetPointer(last_val));
 			last_val = val;
+			/* avoid equality function calls by reusing abbreviated keys */
+			last_abbrev_val = abbrev_val;
 			last_val_freq = 1;
 			last_val_is_mode = false;
 		}
@@ -1181,7 +1189,7 @@ hypothetical_rank_common(FunctionCallInfo fcinfo, int flag,
 	tuplesort_performsort(osastate->sortstate);
 
 	/* iterate till we find the hypothetical row */
-	while (tuplesort_gettupleslot(osastate->sortstate, true, slot))
+	while (tuplesort_gettupleslot(osastate->sortstate, true, slot, NULL))
 	{
 		bool		isnull;
 		Datum		d = slot_getattr(slot, nargs + 1, &isnull);
@@ -1266,6 +1274,8 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 	int64		duplicate_count = 0;
 	OSAPerGroupState *osastate;
 	int			numDistinctCols;
+	Datum		abbrevVal = (Datum) 0;
+	Datum		abbrevOld = (Datum) 0;
 	AttrNumber *sortColIdx;
 	FmgrInfo   *equalfns;
 	TupleTableSlot *slot;
@@ -1342,7 +1352,7 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 	slot2 = extraslot;
 
 	/* iterate till we find the hypothetical row */
-	while (tuplesort_gettupleslot(osastate->sortstate, true, slot))
+	while (tuplesort_gettupleslot(osastate->sortstate, true, slot, &abbrevVal))
 	{
 		bool		isnull;
 		Datum		d = slot_getattr(slot, nargs + 1, &isnull);
@@ -1353,6 +1363,7 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 
 		/* count non-distinct tuples */
 		if (!TupIsNull(slot2) &&
+			abbrevVal == abbrevOld &&
 			execTuplesMatch(slot, slot2,
 							numDistinctCols,
 							sortColIdx,
@@ -1363,6 +1374,8 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 		tmpslot = slot2;
 		slot2 = slot;
 		slot = tmpslot;
+		/* avoid execTuplesMatch() calls by reusing abbreviated keys */
+		abbrevOld = abbrevVal;
 
 		rank++;
 
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 435041a..0c1629d 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1265,7 +1265,8 @@ tuplesort_putindextuplevalues(Tuplesortstate *state, Relation rel,
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to "void" representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup.datum1 = original;
 	}
@@ -1327,7 +1328,9 @@ tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull)
 	 * stup.tuple and is treated as the canonical copy (e.g. to return via
 	 * tuplesort_getdatum or when writing to tape); stup.datum1 gets the
 	 * abbreviated value if abbreviation is happening, otherwise it's
-	 * identical to stup.tuple.
+	 * identical to stup.tuple. stup.datum1 has a "void" representation for
+	 * NULL values, which abbreviated keys require for cheap inequality
+	 * checks.
 	 */
 
 	if (isNull || state->datumTypeByVal)
@@ -1847,10 +1850,17 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
  * Fetch the next tuple in either forward or back direction.
  * If successful, put tuple in slot and return TRUE; else, clear the slot
  * and return FALSE.
+ *
+ * Caller may optionally be passed back abbreviated value (on TRUE return
+ * value) when abbreviation was used, which can be used to cheaply avoid
+ * equality checks that might otherwise be required.  Caller can safely make a
+ * determination of "non-equal tuple" based on simple binary inequality.  NULL
+ * value in leading attribute will set abbreviated value to "void" abbreviated
+ * representation, which caller may rely on in abbreviated inequality check.
  */
 bool
 tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
-					   TupleTableSlot *slot)
+					   TupleTableSlot *slot, Datum *abbrev)
 {
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
 	SortTuple	stup;
@@ -1863,6 +1873,10 @@ tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
 
 	if (stup.tuple)
 	{
+		/* Record abbreviated key for caller */
+		if (state->sortKeys->abbrev_converter && abbrev)
+			*abbrev = stup.datum1;
+
 		ExecStoreMinimalTuple((MinimalTuple) stup.tuple, slot, should_free);
 		return true;
 	}
@@ -1918,10 +1932,17 @@ tuplesort_getindextuple(Tuplesortstate *state, bool forward,
  *
  * If the Datum is pass-by-ref type, the returned value is freshly palloc'd
  * and is now owned by the caller.
+ *
+ * Caller may optionally be passed back abbreviated value (on TRUE return
+ * value) when abbreviation was used, which can be used to cheaply avoid
+ * equality checks that might otherwise be required.  Caller can safely make a
+ * determination of "non-equal tuple" based on simple binary inequality.  NULL
+ * value in leading attribute will set abbreviated value to "void" abbreviated
+ * representation, which caller may rely on in abbreviated inequality check.
  */
 bool
 tuplesort_getdatum(Tuplesortstate *state, bool forward,
-				   Datum *val, bool *isNull)
+				   Datum *val, bool *isNull, Datum *abbrev)
 {
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
 	SortTuple	stup;
@@ -1933,6 +1954,10 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
 		return false;
 	}
 
+	/* Record abbreviated key for caller */
+	if (state->sortKeys->abbrev_converter && abbrev)
+		*abbrev = stup.datum1;
+
 	if (stup.isnull1 || state->datumTypeByVal)
 	{
 		*val = stup.datum1;
@@ -3145,7 +3170,8 @@ copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to "void" representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup->datum1 = original;
 	}
@@ -3385,7 +3411,8 @@ copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to "void" representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup->datum1 = original;
 	}
@@ -3687,7 +3714,8 @@ copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to "void" representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup->datum1 = original;
 	}
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index de6fc56..9af1159 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -93,13 +93,13 @@ extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 extern void tuplesort_performsort(Tuplesortstate *state);
 
 extern bool tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
-					   TupleTableSlot *slot);
+					   TupleTableSlot *slot, Datum *abbrev);
 extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward,
 					   bool *should_free);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward,
 						bool *should_free);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward,
-				   Datum *val, bool *isNull);
+				   Datum *val, bool *isNull, Datum *abbrev);
 
 extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 					 bool forward);
-- 
1.9.1

0001-Tighten-SortSupport-abbreviated-key-contract.patchtext/x-patch; charset=US-ASCII; name=0001-Tighten-SortSupport-abbreviated-key-contract.patchDownload
From 8b1343903a68a1315a3b3334bcebc42cb722a470 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Thu, 9 Jul 2015 15:59:07 -0700
Subject: [PATCH 1/2] Tighten SortSupport abbreviated key contract

Revise the contract that SortSupport has with its clients to forbid an
abbreviated comparator ever returning 0 ("indeterminate") in the event
of two abbreviated keys not being bitwise equal.  This gives certain
cases the leeway to inexpensively determine tuple inequality from afar,
by comparing abbreviated keys using simple unsigned integer (Datum)
comparisons, and trusting that bitwise inequality amounts to leading
attribute inequality (and, therefore, tuple inequality).  Direct access
to SortSupport state is consequently not required, which will be
convenient for tuplesort clients (that only have an opaque state
pointer), while also saving us a few additional cycles.

Cheap inequality checks of this nature might otherwise give wrong
answers due (for example) to some opclass abbreviated comparator
interpreting non-identical abbreviated keys with respect to some state
built during memtuple copying (although that is probably still safe,
depending on the exact details).  Pass-by-reference abbreviated keys
could similarly cause problems, and so are now forbidden.  No existing
opclass with abbreviation support does anything like this, so no change
in any opclass SortSupport routine.

Backpatch to 9.5, where abbreviated keys were introduced.
---
 src/include/utils/sortsupport.h | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/src/include/utils/sortsupport.h b/src/include/utils/sortsupport.h
index 787404e..3d78c3a 100644
--- a/src/include/utils/sortsupport.h
+++ b/src/include/utils/sortsupport.h
@@ -116,7 +116,7 @@ typedef struct SortSupportData
 	 *
 	 * This allows opclass authors to supply a conversion routine, used to
 	 * create an alternative representation of the underlying type (an
-	 * "abbreviated key").  Typically, this representation is an ad-hoc,
+	 * "abbreviated key").  Invariably, this representation is an ad-hoc,
 	 * pass-by-value Datum format that only the opclass has knowledge of.  An
 	 * alternative comparator, used only with this alternative representation
 	 * must also be provided (which is assigned to "comparator").  This
@@ -135,6 +135,18 @@ typedef struct SortSupportData
 	 * cache miss penalties are expensive; to get good overall performance,
 	 * sort infrastructure must heavily weigh cache performance.
 	 *
+	 * There is one limited sense in which the core code has knowledge of the
+	 * abbreviated key format:  inequality.  The core system makes a generic
+	 * assumption that two non-bitwise-identical abbreviated keys could never
+	 * have the abbreviated comparator return a 0 or "inconclusive" result for
+	 * two non-identical abbreviated keys (this enables cheap inequality tests
+	 * made from a distance, without opclass involvement.  Furthermore, it
+	 * implies that abbreviated keys must always be pass-by-value).  Note that
+	 * it is still permitted for the abbreviated comparator to use state
+	 * generated during calls to abbrev_converter -- for example, a lookup
+	 * table could be used, provided any identifier baked into the abbreviated
+	 * representation is canonical.
+	 *
 	 * Opclass authors must consider the final cardinality of abbreviated keys
 	 * when devising an encoding scheme.  It's possible for a strategy to work
 	 * better than an alternative strategy with one usage pattern, while the
-- 
1.9.1

#2Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Peter Geoghegan (#1)
1 attachment(s)
Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Sat, Jul 11, 2015 at 10:05 AM, Peter Geoghegan <pg@heroku.com> wrote:

Currently, there are certain aggregates that sort some tuples before
making a second pass over their memtuples array (through the tuplesort
"gettuple" interface), calling one or more attribute equality operator
functions as they go. For example, this occurs during the execution of
the following queries:

-- Salary is a numeric column:
SELECT count(DISTINCT salary) AS distinct_compensation_levels FROM
employees;
-- Get most common distinct salary in organization:
SELECT mode() WITHIN GROUP (ORDER BY salary) AS
most_common_compensation FROM employees;

Since these examples involve numeric comparisons, the equality
operator calls involved in that second pass are expensive. There is no
equality fast-path for numeric equality as there is for text; I
imagine that in practice most calls to texteq() are resolved based on
raw datum size mismatch, which is dirt cheap (while the alternative, a
memcmp(), is still very cheap).

Forgive me for going off on a tangent here: I understand that the point of
your patch is to avoid calls to numeric_eq altogether in these scenarios
when cardinality is high, but the above sentence made me wonder how likely
it is that numeric values in typical workloads have the same size, weight
and sign and could therefore be handled in numeric_eq with memcmp() == 0
rather than numeric_cmp() == 0, and whether there would be any benefit.

Running various contrived aggregate queries on a large low cardinality
dataset in a small range (therefore frequently the same weight & size), I
managed to measure a small improvement of up to a few percent with the
attached patch. I also wonder whether numeric_cmp could be profitably
implemented with memcmp on big endian systems and some unrolled loops on
little endian systems when size & weight match.

Of course there are a ton of other overheads involved with numeric. I
wonder how crazy or difficult it would be to make it so that we could
optionally put a pass-by-value NUMERIC in a Datum, setting a low order tag
bit to say 'I'm an immediate value, not a pointer', and then packing 3
digits (= 12 significant figures) + sign + weight into the other 63 bits.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

numeric-eq-memcmp.patchapplication/octet-stream; name=numeric-eq-memcmp.patchDownload
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 1bfa29e..6bf36ff 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -2000,7 +2000,13 @@ numeric_eq(PG_FUNCTION_ARGS)
 	Numeric		num2 = PG_GETARG_NUMERIC(1);
 	bool		result;
 
-	result = cmp_numerics(num1, num2) == 0;
+	if (VARSIZE(num1) == VARSIZE(num2) && NUMERIC_IS_SHORT(num1) &&
+		num1->choice.n_header == num2->choice.n_header)
+		result = memcmp(num1->choice.n_short.n_data,
+						num2->choice.n_short.n_data,
+						VARSIZE(num1) - NUMERIC_HDRSZ_SHORT) == 0;
+	else
+		result = cmp_numerics(num1, num2) == 0;
 
 	PG_FREE_IF_COPY(num1, 0);
 	PG_FREE_IF_COPY(num2, 1);
#3Jeff Janes
jeff.janes@gmail.com
In reply to: Peter Geoghegan (#1)
Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

This needs a rebase, there are several conflicts in src/backend/executor/nodeAgg.c

Thanks,

Jeff

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Peter Geoghegan
pg@heroku.com
In reply to: Thomas Munro (#2)
Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Sun, Sep 6, 2015 at 9:35 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Running various contrived aggregate queries on a large low cardinality
dataset in a small range (therefore frequently the same weight & size), I
managed to measure a small improvement of up to a few percent with the
attached patch. I also wonder whether numeric_cmp could be profitably
implemented with memcmp on big endian systems and some unrolled loops on
little endian systems when size & weight match.

I think that setting up numeric SortSupport to have scratch memory
used across authoritative numeric comparator calls would also help.

We should prefer to do this kind of thing in a datatype independent
way, of course. I'm not opposed to doing what you outline too, but I
don't think it will be especially helpful for the cases here. I think
that what you're talking about would live in the SortSupport
authoritative comparator, and would benefit non-leading-attribute
comparisons most.

Of course there are a ton of other overheads involved with numeric. I
wonder how crazy or difficult it would be to make it so that we could
optionally put a pass-by-value NUMERIC in a Datum, setting a low order tag
bit to say 'I'm an immediate value, not a pointer', and then packing 3
digits (= 12 significant figures) + sign + weight into the other 63 bits.

That seems possible, but very invasive. I'd want to get a good sense
of the pay-off before undertaking such a project.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Peter Geoghegan
pg@heroku.com
In reply to: Jeff Janes (#3)
1 attachment(s)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Fri, Sep 25, 2015 at 2:39 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

This needs a rebase, there are several conflicts in src/backend/executor/nodeAgg.c

I attached a revised version of the second patch in the series, fixing
this bitrot.

I also noticed a bug in tuplesort's TSS_SORTEDONTAPE case with the
previous patch, where no final on-the-fly merge step is required (no
merge step is required whatsoever, because replacement selection
managed to produce only one run). The function mergeruns() previously
only "abandoned" abbreviated ahead of any merge step iff there was
one. When there was only one run (not requiring a merge) it happened
to continue to keep its state consistent with abbreviated keys still
being in use. It didn't matter before, because abbreviated keys were
only for tuplesort to compare, but that's different now.

That bug is fixed in this revision by reordering things within
mergeruns(). The previous order of the two things that were switched
is not at all significant (I should know, I wrote that code).

Thanks
--
Peter Geoghegan

Attachments:

0002-Reuse-abbreviated-keys-in-ordered-set-aggregates.patchtext/x-patch; charset=US-ASCII; name=0002-Reuse-abbreviated-keys-in-ordered-set-aggregates.patchDownload
From a82f0f723e1f5206d9c19b1b277acc0625aa54b1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Mon, 6 Jul 2015 13:37:26 -0700
Subject: [PATCH 2/2] Reuse abbreviated keys in ordered [set] aggregates

When processing ordered aggregates following a sort that could make use
of the abbreviated key optimization, only call the equality operator to
compare successive pairs of tuples when their abbreviated keys were not
equal.  Only strict abbreviated key binary inequality is considered,
which is safe.
---
 src/backend/catalog/index.c            |  2 +-
 src/backend/executor/nodeAgg.c         | 20 ++++++---
 src/backend/executor/nodeSort.c        |  2 +-
 src/backend/utils/adt/orderedsetaggs.c | 33 ++++++++++-----
 src/backend/utils/sort/tuplesort.c     | 74 +++++++++++++++++++++++-----------
 src/include/utils/tuplesort.h          |  4 +-
 6 files changed, 93 insertions(+), 42 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..bb61018 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3024,7 +3024,7 @@ validate_index_heapscan(Relation heapRelation,
 			}
 
 			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  &ts_val, &ts_isnull);
+												  &ts_val, &ts_isnull, NULL);
 			Assert(tuplesort_empty || !ts_isnull);
 			indexcursor = (ItemPointer) DatumGetPointer(ts_val);
 		}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2e36855..2972180 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -539,7 +539,8 @@ fetch_input_tuple(AggState *aggstate)
 
 	if (aggstate->sort_in)
 	{
-		if (!tuplesort_gettupleslot(aggstate->sort_in, true, aggstate->sort_slot))
+		if (!tuplesort_gettupleslot(aggstate->sort_in, true,
+									aggstate->sort_slot, NULL))
 			return NULL;
 		slot = aggstate->sort_slot;
 	}
@@ -894,8 +895,8 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
  * The one-input case is handled separately from the multi-input case
  * for performance reasons: for single by-value inputs, such as the
  * common case of count(distinct id), the tuplesort_getdatum code path
- * is around 300% faster.  (The speedup for by-reference types is less
- * but still noticeable.)
+ * is around 300% faster.  (The speedup for by-reference types without
+ * abbreviated key support is less but still noticeable.)
  *
  * This function handles only one grouping set (already set in
  * aggstate->current_set).
@@ -913,6 +914,8 @@ process_ordered_aggregate_single(AggState *aggstate,
 	MemoryContext workcontext = aggstate->tmpcontext->ecxt_per_tuple_memory;
 	MemoryContext oldContext;
 	bool		isDistinct = (pertrans->numDistinctCols > 0);
+	Datum		newAbbrevVal = (Datum) 0;
+	Datum		oldAbbrevVal = (Datum) 0;
 	FunctionCallInfo fcinfo = &pertrans->transfn_fcinfo;
 	Datum	   *newVal;
 	bool	   *isNull;
@@ -932,7 +935,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 	 */
 
 	while (tuplesort_getdatum(pertrans->sortstates[aggstate->current_set],
-							  true, newVal, isNull))
+							  true, newVal, isNull, &newAbbrevVal))
 	{
 		/*
 		 * Clear and select the working context for evaluation of the equality
@@ -950,6 +953,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 			haveOldVal &&
 			((oldIsNull && *isNull) ||
 			 (!oldIsNull && !*isNull &&
+			  oldAbbrevVal == newAbbrevVal &&
 			  DatumGetBool(FunctionCall2(&pertrans->equalfns[0],
 										 oldVal, *newVal)))))
 		{
@@ -965,6 +969,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 				pfree(DatumGetPointer(oldVal));
 			/* and remember the new one for subsequent equality checks */
 			oldVal = *newVal;
+			oldAbbrevVal = newAbbrevVal;
 			oldIsNull = *isNull;
 			haveOldVal = true;
 		}
@@ -1002,6 +1007,8 @@ process_ordered_aggregate_multi(AggState *aggstate,
 	TupleTableSlot *slot2 = pertrans->uniqslot;
 	int			numTransInputs = pertrans->numTransInputs;
 	int			numDistinctCols = pertrans->numDistinctCols;
+	Datum		newAbbrevVal = (Datum) 0;
+	Datum		oldAbbrevVal = (Datum) 0;
 	bool		haveOldValue = false;
 	int			i;
 
@@ -1012,7 +1019,7 @@ process_ordered_aggregate_multi(AggState *aggstate,
 		ExecClearTuple(slot2);
 
 	while (tuplesort_gettupleslot(pertrans->sortstates[aggstate->current_set],
-								  true, slot1))
+								  true, slot1, &newAbbrevVal))
 	{
 		/*
 		 * Extract the first numTransInputs columns as datums to pass to the
@@ -1023,6 +1030,7 @@ process_ordered_aggregate_multi(AggState *aggstate,
 
 		if (numDistinctCols == 0 ||
 			!haveOldValue ||
+			newAbbrevVal != oldAbbrevVal ||
 			!execTuplesMatch(slot1, slot2,
 							 numDistinctCols,
 							 pertrans->sortColIdx,
@@ -1046,6 +1054,8 @@ process_ordered_aggregate_multi(AggState *aggstate,
 
 				slot2 = slot1;
 				slot1 = tmpslot;
+				/* avoid execTuplesMatch() calls by reusing abbreviated keys */
+				oldAbbrevVal = newAbbrevVal;
 				haveOldValue = true;
 			}
 		}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index af1dccf..35dd993 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -137,7 +137,7 @@ ExecSort(SortState *node)
 	slot = node->ss.ps.ps_ResultTupleSlot;
 	(void) tuplesort_gettupleslot(tuplesortstate,
 								  ScanDirectionIsForward(dir),
-								  slot);
+								  slot, NULL);
 	return slot;
 }
 
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 39ed85b..1a5f696 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -453,7 +453,7 @@ percentile_disc_final(PG_FUNCTION_ARGS)
 			elog(ERROR, "missing row in percentile_disc");
 	}
 
-	if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull))
+	if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull, NULL))
 		elog(ERROR, "missing row in percentile_disc");
 
 	/*
@@ -553,7 +553,7 @@ percentile_cont_final_common(FunctionCallInfo fcinfo,
 	if (!tuplesort_skiptuples(osastate->sortstate, first_row, true))
 		elog(ERROR, "missing row in percentile_cont");
 
-	if (!tuplesort_getdatum(osastate->sortstate, true, &first_val, &isnull))
+	if (!tuplesort_getdatum(osastate->sortstate, true, &first_val, &isnull, NULL))
 		elog(ERROR, "missing row in percentile_cont");
 	if (isnull)
 		PG_RETURN_NULL();
@@ -564,7 +564,7 @@ percentile_cont_final_common(FunctionCallInfo fcinfo,
 	}
 	else
 	{
-		if (!tuplesort_getdatum(osastate->sortstate, true, &second_val, &isnull))
+		if (!tuplesort_getdatum(osastate->sortstate, true, &second_val, &isnull, NULL))
 			elog(ERROR, "missing row in percentile_cont");
 
 		if (isnull)
@@ -792,7 +792,7 @@ percentile_disc_multi_final(PG_FUNCTION_ARGS)
 				if (!tuplesort_skiptuples(osastate->sortstate, target_row - rownum - 1, true))
 					elog(ERROR, "missing row in percentile_disc");
 
-				if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull))
+				if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull, NULL))
 					elog(ERROR, "missing row in percentile_disc");
 
 				rownum = target_row;
@@ -921,7 +921,8 @@ percentile_cont_multi_final_common(FunctionCallInfo fcinfo,
 				if (!tuplesort_skiptuples(osastate->sortstate, first_row - rownum - 1, true))
 					elog(ERROR, "missing row in percentile_cont");
 
-				if (!tuplesort_getdatum(osastate->sortstate, true, &first_val, &isnull) || isnull)
+				if (!tuplesort_getdatum(osastate->sortstate, true, &first_val,
+										&isnull, NULL) || isnull)
 					elog(ERROR, "missing row in percentile_cont");
 
 				rownum = first_row;
@@ -941,7 +942,8 @@ percentile_cont_multi_final_common(FunctionCallInfo fcinfo,
 			/* Fetch second_row if needed */
 			if (second_row > rownum)
 			{
-				if (!tuplesort_getdatum(osastate->sortstate, true, &second_val, &isnull) || isnull)
+				if (!tuplesort_getdatum(osastate->sortstate, true, &second_val,
+										&isnull, NULL) || isnull)
 					elog(ERROR, "missing row in percentile_cont");
 				rownum++;
 			}
@@ -1016,6 +1018,8 @@ mode_final(PG_FUNCTION_ARGS)
 	int64		last_val_freq = 0;
 	bool		last_val_is_mode = false;
 	FmgrInfo   *equalfn;
+	Datum		abbrev_val = (Datum) 0;
+	Datum		last_abbrev_val = (Datum) 0;
 	bool		shouldfree;
 
 	Assert(AggCheckCallContext(fcinfo, NULL) == AGG_CONTEXT_AGGREGATE);
@@ -1042,7 +1046,7 @@ mode_final(PG_FUNCTION_ARGS)
 	tuplesort_performsort(osastate->sortstate);
 
 	/* Scan tuples and count frequencies */
-	while (tuplesort_getdatum(osastate->sortstate, true, &val, &isnull))
+	while (tuplesort_getdatum(osastate->sortstate, true, &val, &isnull, &abbrev_val))
 	{
 		/* we don't expect any nulls, but ignore them if found */
 		if (isnull)
@@ -1054,8 +1058,10 @@ mode_final(PG_FUNCTION_ARGS)
 			mode_val = last_val = val;
 			mode_freq = last_val_freq = 1;
 			last_val_is_mode = true;
+			last_abbrev_val = abbrev_val;
 		}
-		else if (DatumGetBool(FunctionCall2(equalfn, val, last_val)))
+		else if (abbrev_val == last_abbrev_val &&
+				 DatumGetBool(FunctionCall2(equalfn, val, last_val)))
 		{
 			/* value equal to previous value, count it */
 			if (last_val_is_mode)
@@ -1078,6 +1084,8 @@ mode_final(PG_FUNCTION_ARGS)
 			if (shouldfree && !last_val_is_mode)
 				pfree(DatumGetPointer(last_val));
 			last_val = val;
+			/* avoid equality function calls by reusing abbreviated keys */
+			last_abbrev_val = abbrev_val;
 			last_val_freq = 1;
 			last_val_is_mode = false;
 		}
@@ -1181,7 +1189,7 @@ hypothetical_rank_common(FunctionCallInfo fcinfo, int flag,
 	tuplesort_performsort(osastate->sortstate);
 
 	/* iterate till we find the hypothetical row */
-	while (tuplesort_gettupleslot(osastate->sortstate, true, slot))
+	while (tuplesort_gettupleslot(osastate->sortstate, true, slot, NULL))
 	{
 		bool		isnull;
 		Datum		d = slot_getattr(slot, nargs + 1, &isnull);
@@ -1266,6 +1274,8 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 	int64		duplicate_count = 0;
 	OSAPerGroupState *osastate;
 	int			numDistinctCols;
+	Datum		abbrevVal = (Datum) 0;
+	Datum		abbrevOld = (Datum) 0;
 	AttrNumber *sortColIdx;
 	FmgrInfo   *equalfns;
 	TupleTableSlot *slot;
@@ -1342,7 +1352,7 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 	slot2 = extraslot;
 
 	/* iterate till we find the hypothetical row */
-	while (tuplesort_gettupleslot(osastate->sortstate, true, slot))
+	while (tuplesort_gettupleslot(osastate->sortstate, true, slot, &abbrevVal))
 	{
 		bool		isnull;
 		Datum		d = slot_getattr(slot, nargs + 1, &isnull);
@@ -1353,6 +1363,7 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 
 		/* count non-distinct tuples */
 		if (!TupIsNull(slot2) &&
+			abbrevVal == abbrevOld &&
 			execTuplesMatch(slot, slot2,
 							numDistinctCols,
 							sortColIdx,
@@ -1363,6 +1374,8 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 		tmpslot = slot2;
 		slot2 = slot;
 		slot = tmpslot;
+		/* avoid execTuplesMatch() calls by reusing abbreviated keys */
+		abbrevOld = abbrevVal;
 
 		rank++;
 
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d532e87..49cf4cc 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1272,7 +1272,8 @@ tuplesort_putindextuplevalues(Tuplesortstate *state, Relation rel,
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to "void" representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup.datum1 = original;
 	}
@@ -1334,7 +1335,9 @@ tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull)
 	 * stup.tuple and is treated as the canonical copy (e.g. to return via
 	 * tuplesort_getdatum or when writing to tape); stup.datum1 gets the
 	 * abbreviated value if abbreviation is happening, otherwise it's
-	 * identical to stup.tuple.
+	 * identical to stup.tuple. stup.datum1 has a "void" representation for
+	 * NULL values, which abbreviated keys require for cheap inequality
+	 * checks.
 	 */
 
 	if (isNull || state->datumTypeByVal)
@@ -1854,10 +1857,17 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
  * Fetch the next tuple in either forward or back direction.
  * If successful, put tuple in slot and return TRUE; else, clear the slot
  * and return FALSE.
+ *
+ * Caller may optionally be passed back abbreviated value (on TRUE return
+ * value) when abbreviation was used, which can be used to cheaply avoid
+ * equality checks that might otherwise be required.  Caller can safely make a
+ * determination of "non-equal tuple" based on simple binary inequality.  NULL
+ * value in leading attribute will set abbreviated value to "void" abbreviated
+ * representation, which caller may rely on in abbreviated inequality check.
  */
 bool
 tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
-					   TupleTableSlot *slot)
+					   TupleTableSlot *slot, Datum *abbrev)
 {
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
 	SortTuple	stup;
@@ -1870,6 +1880,10 @@ tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
 
 	if (stup.tuple)
 	{
+		/* Record abbreviated key for caller */
+		if (state->sortKeys->abbrev_converter && abbrev)
+			*abbrev = stup.datum1;
+
 		ExecStoreMinimalTuple((MinimalTuple) stup.tuple, slot, should_free);
 		return true;
 	}
@@ -1925,10 +1939,17 @@ tuplesort_getindextuple(Tuplesortstate *state, bool forward,
  *
  * If the Datum is pass-by-ref type, the returned value is freshly palloc'd
  * and is now owned by the caller.
+ *
+ * Caller may optionally be passed back abbreviated value (on TRUE return
+ * value) when abbreviation was used, which can be used to cheaply avoid
+ * equality checks that might otherwise be required.  Caller can safely make a
+ * determination of "non-equal tuple" based on simple binary inequality.  NULL
+ * value in leading attribute will set abbreviated value to "void" abbreviated
+ * representation, which caller may rely on in abbreviated inequality check.
  */
 bool
 tuplesort_getdatum(Tuplesortstate *state, bool forward,
-				   Datum *val, bool *isNull)
+				   Datum *val, bool *isNull, Datum *abbrev)
 {
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
 	SortTuple	stup;
@@ -1940,6 +1961,10 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
 		return false;
 	}
 
+	/* Record abbreviated key for caller */
+	if (state->sortKeys->abbrev_converter && abbrev)
+		*abbrev = stup.datum1;
+
 	if (stup.isnull1 || state->datumTypeByVal)
 	{
 		*val = stup.datum1;
@@ -2220,6 +2245,22 @@ mergeruns(Tuplesortstate *state)
 	Assert(state->status == TSS_BUILDRUNS);
 	Assert(state->memtupcount == 0);
 
+	if (state->sortKeys != NULL && state->sortKeys->abbrev_converter != NULL)
+	{
+		/*
+		 * If there are multiple runs to be merged, when we go to read back
+		 * tuples from disk, abbreviated keys will not have been stored, and
+		 * we don't care to regenerate them.  Disable abbreviation from this
+		 * point on.
+		 */
+		state->sortKeys->abbrev_converter = NULL;
+		state->sortKeys->comparator = state->sortKeys->abbrev_full_comparator;
+
+		/* Not strictly necessary, but be tidy */
+		state->sortKeys->abbrev_abort = NULL;
+		state->sortKeys->abbrev_full_comparator = NULL;
+	}
+
 	/*
 	 * If we produced only one initial run (quite likely if the total data
 	 * volume is between 1X and 2X workMem), we can just use that tape as the
@@ -2235,22 +2276,6 @@ mergeruns(Tuplesortstate *state)
 		return;
 	}
 
-	if (state->sortKeys != NULL && state->sortKeys->abbrev_converter != NULL)
-	{
-		/*
-		 * If there are multiple runs to be merged, when we go to read back
-		 * tuples from disk, abbreviated keys will not have been stored, and
-		 * we don't care to regenerate them.  Disable abbreviation from this
-		 * point on.
-		 */
-		state->sortKeys->abbrev_converter = NULL;
-		state->sortKeys->comparator = state->sortKeys->abbrev_full_comparator;
-
-		/* Not strictly necessary, but be tidy */
-		state->sortKeys->abbrev_abort = NULL;
-		state->sortKeys->abbrev_full_comparator = NULL;
-	}
-
 	/* End of step D2: rewind all output tapes to prepare for merging */
 	for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
 		LogicalTapeRewind(state->tapeset, tapenum, false);
@@ -3152,7 +3177,8 @@ copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to "void" representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup->datum1 = original;
 	}
@@ -3392,7 +3418,8 @@ copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to "void" representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup->datum1 = original;
 	}
@@ -3694,7 +3721,8 @@ copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to "void" representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup->datum1 = original;
 	}
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index de6fc56..9af1159 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -93,13 +93,13 @@ extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 extern void tuplesort_performsort(Tuplesortstate *state);
 
 extern bool tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
-					   TupleTableSlot *slot);
+					   TupleTableSlot *slot, Datum *abbrev);
 extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward,
 					   bool *should_free);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward,
 						bool *should_free);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward,
-				   Datum *val, bool *isNull);
+				   Datum *val, bool *isNull, Datum *abbrev);
 
 extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 					 bool forward);
-- 
1.9.1

#6Andreas Karlsson
andreas@proxel.se
In reply to: Peter Geoghegan (#5)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

Hi,

I have reviewed this patch and it still applies to master, compiles and
passes the test suite.

I like the goal of the patch, making use of the already existing
abbreviation machinery in more cases is something we should do and the
patch itself looks clean.

I can also confirm the roughly 25% speedup in the best case (numerics
which are all distinct) with no measurable slowdown in the worst case.

Given this speedup and the small size of the patch I think we should
apply it. I will set this patch to "Ready for Commiter".

Andreas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Peter Geoghegan
pg@heroku.com
In reply to: Andreas Karlsson (#6)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Sun, Dec 6, 2015 at 7:14 PM, Andreas Karlsson <andreas@proxel.se> wrote:

I can also confirm the roughly 25% speedup in the best case (numerics which
are all distinct) with no measurable slowdown in the worst case.

Given this speedup and the small size of the patch I think we should apply
it. I will set this patch to "Ready for Commiter".

Thanks for the review, Andreas.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#5)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Sat, Oct 10, 2015 at 9:03 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Fri, Sep 25, 2015 at 2:39 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

This needs a rebase, there are several conflicts in src/backend/executor/nodeAgg.c

I attached a revised version of the second patch in the series, fixing
this bitrot.

I also noticed a bug in tuplesort's TSS_SORTEDONTAPE case with the
previous patch, where no final on-the-fly merge step is required (no
merge step is required whatsoever, because replacement selection
managed to produce only one run). The function mergeruns() previously
only "abandoned" abbreviated ahead of any merge step iff there was
one. When there was only one run (not requiring a merge) it happened
to continue to keep its state consistent with abbreviated keys still
being in use. It didn't matter before, because abbreviated keys were
only for tuplesort to compare, but that's different now.

That bug is fixed in this revision by reordering things within
mergeruns(). The previous order of the two things that were switched
is not at all significant (I should know, I wrote that code).

I find the references to a "void" representation in this patch to be
completely opaque. I see that there are some such references in
tuplesort.c already, and most likely they were put there by commits
that I did, so I guess I have nobody but myself to blame, but I don't
know what this means, and I don't think we should let this terminology
proliferate.

My understanding is that the "void" representation is intended to
whatever Datum we originally got, which might be a pointer. Why not
just say that instead of referring to it this way?

My understanding is also that it's OK if the abbreviated key stays the
same even though the value has changed, but that the reverse could
cause queries to return wrong answers. The first part of that
justifies why this is safe when no abbreviation is available: we'll
return an abbreviated value of 0 for everything, but that's fine.
However, using the original Datum (which might be a pointer) seems
unsafe, because two binary-identical values could be stored at
different addresses and thus have different pointer representations.

I'm probably missing something here, so clue me in...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#8)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Wed, Dec 9, 2015 at 11:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I find the references to a "void" representation in this patch to be
completely opaque. I see that there are some such references in
tuplesort.c already, and most likely they were put there by commits
that I did, so I guess I have nobody but myself to blame, but I don't
know what this means, and I don't think we should let this terminology
proliferate.

My understanding is that the "void" representation is intended to
whatever Datum we originally got, which might be a pointer. Why not
just say that instead of referring to it this way?

That isn't what is intended. "void" is the state that macros like
index_getattr() leave NULL leading attributes (that go in the
SortTuple.datum1 field) in. However, the function tuplesort_putdatum()
requires callers to initialize their Datum to 0 now, which is new. A
"void" representation is a would-be NULL pointer in the case of
pass-by-value types, and a NULL pointer for pass-by-reference types.

My understanding is also that it's OK if the abbreviated key stays the
same even though the value has changed, but that the reverse could
cause queries to return wrong answers. The first part of that
justifies why this is safe when no abbreviation is available: we'll
return an abbreviated value of 0 for everything, but that's fine.
However, using the original Datum (which might be a pointer) seems
unsafe, because two binary-identical values could be stored at
different addresses and thus have different pointer representations.

I'm probably missing something here, so clue me in...

I think that you're missing that patch 0001 formally forbids
abbreviated keys that are pass-by-value, by revising the contract
(this is proposed for backpatch to 9.5 -- only comments are changed).
This is already something that is all but forbidden, although the
datum case does tacitly acknowledge the possibility by not allowing
abbreviation to work with the pass-by-value-and-yet-abbreviated case.

I think that this revision is also useful for putting abbreviated keys
in indexes, something that may happen yet.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#9)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Wed, Dec 9, 2015 at 2:15 PM, Peter Geoghegan <pg@heroku.com> wrote:

I think that you're missing that patch 0001 formally forbids
abbreviated keys that are pass-by-value

Sorry. I do of course mean it forbids abbreviated keys that are *not*
pass-by-value (that are pass-by-reference).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#9)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Wed, Dec 9, 2015 at 2:15 PM, Peter Geoghegan <pg@heroku.com> wrote:

I think that you're missing that patch 0001 formally forbids
abbreviated keys that are pass-by-value, by revising the contract
(this is proposed for backpatch to 9.5 -- only comments are changed).
This is already something that is all but forbidden, although the
datum case does tacitly acknowledge the possibility by not allowing
abbreviation to work with the pass-by-value-and-yet-abbreviated case.

I think that this revision is also useful for putting abbreviated keys
in indexes, something that may happen yet.

I'm also depending on this for the "quicksort for every sort run" patch, BTW:

+           /*
+            * Kludge:  Trigger abbreviated tie-breaker if in-memory tuples
+            * use abbreviation (writing tuples to tape never preserves
+            * abbreviated keys).  Do this by assigning in-memory
+            * abbreviated tuple to tape tuple directly.
+            *
+            * It doesn't seem worth generating a new abbreviated key for
+            * the tape tuple, and this approach is simpler than
+            * "unabbreviating" the memtuple tuple from a "common" routine
+            * like this.
+            */
+           if (state->sortKeys != NULL &&
state->sortKeys->abbrev_converter != NULL)
+               stup->datum1 = state->memtuples[state->current].datum1;

I could, as an alternative approach, revise tuplesort so that
self-comparison works (something we currently assert against [1]Commit c5a03256c -- Peter Geoghegan),
something that would probably *also* require and update to the
sortsupport.h contract, but this seemed simpler and more general.

In general, I think that there are plenty of reasons to forbid
pass-by-reference abbreviated keys (where the abbreviated comparator
itself is a pointer, something much more complicated than an integer
3-way comparison or similar).

[1]: Commit c5a03256c -- Peter Geoghegan
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#9)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Wed, Dec 9, 2015 at 5:15 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Dec 9, 2015 at 11:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I find the references to a "void" representation in this patch to be
completely opaque. I see that there are some such references in
tuplesort.c already, and most likely they were put there by commits
that I did, so I guess I have nobody but myself to blame, but I don't
know what this means, and I don't think we should let this terminology
proliferate.

My understanding is that the "void" representation is intended to
whatever Datum we originally got, which might be a pointer. Why not
just say that instead of referring to it this way?

That isn't what is intended. "void" is the state that macros like
index_getattr() leave NULL leading attributes (that go in the
SortTuple.datum1 field) in.

What kind of state is that? Can't we define this in terms of what it
is rather than how it gets that way?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#12)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Wed, Dec 16, 2015 at 9:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:

That isn't what is intended. "void" is the state that macros like
index_getattr() leave NULL leading attributes (that go in the
SortTuple.datum1 field) in.

What kind of state is that? Can't we define this in terms of what it
is rather than how it gets that way?

It's zeroed.

I guess we can update everything, including existing comments, to reflect that.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#13)
2 attachment(s)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Wed, Dec 16, 2015 at 12:04 PM, Peter Geoghegan <pg@heroku.com> wrote:

What kind of state is that? Can't we define this in terms of what it
is rather than how it gets that way?

It's zeroed.

I guess we can update everything, including existing comments, to reflect that.

Attached revision updates both the main commit (the optimization), and
the backpatch commit (updated the contract).

Changes:

* Changes commit intended for backpatch to 9.5 to use new "zeroed"
terminology (not "void" representation).

* Puts fix within mergerun() (for this patch) in the same backpatch
commit. We might as well keep this consistent, even though the
correctness of 9.5 does not actually hinge upon having this change --
there is zero risk of breaking something in 9.5, and avoiding
backpatching this part can only lead to confusion in the future.

* Some minor tweaks to tuplesort.c. Make sure that
tuplesort_putdatum() zeroes datum1 directly for NULL values. Comment
tweaks.

* Add comment to the datum case, discussing restriction there.

I was wrong when I said that the datum sort case currently tacitly
acknowledges as possible/useful something that the revised SortSupport
contract will forbid (I wrote that e-mail in haste).

What I propose to make the SortSupport contract forbid here is
abbreviated keys that are *themselves* pass-by-reference. It is true
that today, we do not support pass-by-value types with abbreviated
keys for the datum case only (such opclass support could be slightly
useful for float8, for example -- int8 comparisons of an alternative
representation might be decisively cheaper). But that existing
restriction is entirely distinct to the new restriction I propose to
create. It seems like this distinction should be pointed out in
passing, which I've done.

--
Peter Geoghegan

Attachments:

0002-Reuse-abbreviated-keys-in-ordered-set-aggregates.patchtext/x-patch; charset=US-ASCII; name=0002-Reuse-abbreviated-keys-in-ordered-set-aggregates.patchDownload
From c740eac2a4ba42ac2b5ab34c342534ff8d944666 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Mon, 6 Jul 2015 13:37:26 -0700
Subject: [PATCH 2/2] Reuse abbreviated keys in ordered [set] aggregates

When processing ordered aggregates following a sort that could make use
of the abbreviated key optimization, only call the equality operator to
compare successive pairs of tuples when their abbreviated keys were not
equal.  Only strict abbreviated key binary inequality is considered,
which is safe.
---
 src/backend/catalog/index.c            |  2 +-
 src/backend/executor/nodeAgg.c         | 20 ++++++++++++----
 src/backend/executor/nodeSort.c        |  2 +-
 src/backend/utils/adt/orderedsetaggs.c | 33 +++++++++++++++++--------
 src/backend/utils/sort/tuplesort.c     | 44 ++++++++++++++++++++++++++++------
 src/include/utils/tuplesort.h          |  4 ++--
 6 files changed, 79 insertions(+), 26 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c10be3d..c6737c2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3073,7 +3073,7 @@ validate_index_heapscan(Relation heapRelation,
 			}
 
 			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  &ts_val, &ts_isnull);
+												  &ts_val, &ts_isnull, NULL);
 			Assert(tuplesort_empty || !ts_isnull);
 			if (!tuplesort_empty)
 			{
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2e36855..2972180 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -539,7 +539,8 @@ fetch_input_tuple(AggState *aggstate)
 
 	if (aggstate->sort_in)
 	{
-		if (!tuplesort_gettupleslot(aggstate->sort_in, true, aggstate->sort_slot))
+		if (!tuplesort_gettupleslot(aggstate->sort_in, true,
+									aggstate->sort_slot, NULL))
 			return NULL;
 		slot = aggstate->sort_slot;
 	}
@@ -894,8 +895,8 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
  * The one-input case is handled separately from the multi-input case
  * for performance reasons: for single by-value inputs, such as the
  * common case of count(distinct id), the tuplesort_getdatum code path
- * is around 300% faster.  (The speedup for by-reference types is less
- * but still noticeable.)
+ * is around 300% faster.  (The speedup for by-reference types without
+ * abbreviated key support is less but still noticeable.)
  *
  * This function handles only one grouping set (already set in
  * aggstate->current_set).
@@ -913,6 +914,8 @@ process_ordered_aggregate_single(AggState *aggstate,
 	MemoryContext workcontext = aggstate->tmpcontext->ecxt_per_tuple_memory;
 	MemoryContext oldContext;
 	bool		isDistinct = (pertrans->numDistinctCols > 0);
+	Datum		newAbbrevVal = (Datum) 0;
+	Datum		oldAbbrevVal = (Datum) 0;
 	FunctionCallInfo fcinfo = &pertrans->transfn_fcinfo;
 	Datum	   *newVal;
 	bool	   *isNull;
@@ -932,7 +935,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 	 */
 
 	while (tuplesort_getdatum(pertrans->sortstates[aggstate->current_set],
-							  true, newVal, isNull))
+							  true, newVal, isNull, &newAbbrevVal))
 	{
 		/*
 		 * Clear and select the working context for evaluation of the equality
@@ -950,6 +953,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 			haveOldVal &&
 			((oldIsNull && *isNull) ||
 			 (!oldIsNull && !*isNull &&
+			  oldAbbrevVal == newAbbrevVal &&
 			  DatumGetBool(FunctionCall2(&pertrans->equalfns[0],
 										 oldVal, *newVal)))))
 		{
@@ -965,6 +969,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 				pfree(DatumGetPointer(oldVal));
 			/* and remember the new one for subsequent equality checks */
 			oldVal = *newVal;
+			oldAbbrevVal = newAbbrevVal;
 			oldIsNull = *isNull;
 			haveOldVal = true;
 		}
@@ -1002,6 +1007,8 @@ process_ordered_aggregate_multi(AggState *aggstate,
 	TupleTableSlot *slot2 = pertrans->uniqslot;
 	int			numTransInputs = pertrans->numTransInputs;
 	int			numDistinctCols = pertrans->numDistinctCols;
+	Datum		newAbbrevVal = (Datum) 0;
+	Datum		oldAbbrevVal = (Datum) 0;
 	bool		haveOldValue = false;
 	int			i;
 
@@ -1012,7 +1019,7 @@ process_ordered_aggregate_multi(AggState *aggstate,
 		ExecClearTuple(slot2);
 
 	while (tuplesort_gettupleslot(pertrans->sortstates[aggstate->current_set],
-								  true, slot1))
+								  true, slot1, &newAbbrevVal))
 	{
 		/*
 		 * Extract the first numTransInputs columns as datums to pass to the
@@ -1023,6 +1030,7 @@ process_ordered_aggregate_multi(AggState *aggstate,
 
 		if (numDistinctCols == 0 ||
 			!haveOldValue ||
+			newAbbrevVal != oldAbbrevVal ||
 			!execTuplesMatch(slot1, slot2,
 							 numDistinctCols,
 							 pertrans->sortColIdx,
@@ -1046,6 +1054,8 @@ process_ordered_aggregate_multi(AggState *aggstate,
 
 				slot2 = slot1;
 				slot1 = tmpslot;
+				/* avoid execTuplesMatch() calls by reusing abbreviated keys */
+				oldAbbrevVal = newAbbrevVal;
 				haveOldValue = true;
 			}
 		}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index af1dccf..35dd993 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -137,7 +137,7 @@ ExecSort(SortState *node)
 	slot = node->ss.ps.ps_ResultTupleSlot;
 	(void) tuplesort_gettupleslot(tuplesortstate,
 								  ScanDirectionIsForward(dir),
-								  slot);
+								  slot, NULL);
 	return slot;
 }
 
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 39ed85b..1a5f696 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -453,7 +453,7 @@ percentile_disc_final(PG_FUNCTION_ARGS)
 			elog(ERROR, "missing row in percentile_disc");
 	}
 
-	if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull))
+	if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull, NULL))
 		elog(ERROR, "missing row in percentile_disc");
 
 	/*
@@ -553,7 +553,7 @@ percentile_cont_final_common(FunctionCallInfo fcinfo,
 	if (!tuplesort_skiptuples(osastate->sortstate, first_row, true))
 		elog(ERROR, "missing row in percentile_cont");
 
-	if (!tuplesort_getdatum(osastate->sortstate, true, &first_val, &isnull))
+	if (!tuplesort_getdatum(osastate->sortstate, true, &first_val, &isnull, NULL))
 		elog(ERROR, "missing row in percentile_cont");
 	if (isnull)
 		PG_RETURN_NULL();
@@ -564,7 +564,7 @@ percentile_cont_final_common(FunctionCallInfo fcinfo,
 	}
 	else
 	{
-		if (!tuplesort_getdatum(osastate->sortstate, true, &second_val, &isnull))
+		if (!tuplesort_getdatum(osastate->sortstate, true, &second_val, &isnull, NULL))
 			elog(ERROR, "missing row in percentile_cont");
 
 		if (isnull)
@@ -792,7 +792,7 @@ percentile_disc_multi_final(PG_FUNCTION_ARGS)
 				if (!tuplesort_skiptuples(osastate->sortstate, target_row - rownum - 1, true))
 					elog(ERROR, "missing row in percentile_disc");
 
-				if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull))
+				if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull, NULL))
 					elog(ERROR, "missing row in percentile_disc");
 
 				rownum = target_row;
@@ -921,7 +921,8 @@ percentile_cont_multi_final_common(FunctionCallInfo fcinfo,
 				if (!tuplesort_skiptuples(osastate->sortstate, first_row - rownum - 1, true))
 					elog(ERROR, "missing row in percentile_cont");
 
-				if (!tuplesort_getdatum(osastate->sortstate, true, &first_val, &isnull) || isnull)
+				if (!tuplesort_getdatum(osastate->sortstate, true, &first_val,
+										&isnull, NULL) || isnull)
 					elog(ERROR, "missing row in percentile_cont");
 
 				rownum = first_row;
@@ -941,7 +942,8 @@ percentile_cont_multi_final_common(FunctionCallInfo fcinfo,
 			/* Fetch second_row if needed */
 			if (second_row > rownum)
 			{
-				if (!tuplesort_getdatum(osastate->sortstate, true, &second_val, &isnull) || isnull)
+				if (!tuplesort_getdatum(osastate->sortstate, true, &second_val,
+										&isnull, NULL) || isnull)
 					elog(ERROR, "missing row in percentile_cont");
 				rownum++;
 			}
@@ -1016,6 +1018,8 @@ mode_final(PG_FUNCTION_ARGS)
 	int64		last_val_freq = 0;
 	bool		last_val_is_mode = false;
 	FmgrInfo   *equalfn;
+	Datum		abbrev_val = (Datum) 0;
+	Datum		last_abbrev_val = (Datum) 0;
 	bool		shouldfree;
 
 	Assert(AggCheckCallContext(fcinfo, NULL) == AGG_CONTEXT_AGGREGATE);
@@ -1042,7 +1046,7 @@ mode_final(PG_FUNCTION_ARGS)
 	tuplesort_performsort(osastate->sortstate);
 
 	/* Scan tuples and count frequencies */
-	while (tuplesort_getdatum(osastate->sortstate, true, &val, &isnull))
+	while (tuplesort_getdatum(osastate->sortstate, true, &val, &isnull, &abbrev_val))
 	{
 		/* we don't expect any nulls, but ignore them if found */
 		if (isnull)
@@ -1054,8 +1058,10 @@ mode_final(PG_FUNCTION_ARGS)
 			mode_val = last_val = val;
 			mode_freq = last_val_freq = 1;
 			last_val_is_mode = true;
+			last_abbrev_val = abbrev_val;
 		}
-		else if (DatumGetBool(FunctionCall2(equalfn, val, last_val)))
+		else if (abbrev_val == last_abbrev_val &&
+				 DatumGetBool(FunctionCall2(equalfn, val, last_val)))
 		{
 			/* value equal to previous value, count it */
 			if (last_val_is_mode)
@@ -1078,6 +1084,8 @@ mode_final(PG_FUNCTION_ARGS)
 			if (shouldfree && !last_val_is_mode)
 				pfree(DatumGetPointer(last_val));
 			last_val = val;
+			/* avoid equality function calls by reusing abbreviated keys */
+			last_abbrev_val = abbrev_val;
 			last_val_freq = 1;
 			last_val_is_mode = false;
 		}
@@ -1181,7 +1189,7 @@ hypothetical_rank_common(FunctionCallInfo fcinfo, int flag,
 	tuplesort_performsort(osastate->sortstate);
 
 	/* iterate till we find the hypothetical row */
-	while (tuplesort_gettupleslot(osastate->sortstate, true, slot))
+	while (tuplesort_gettupleslot(osastate->sortstate, true, slot, NULL))
 	{
 		bool		isnull;
 		Datum		d = slot_getattr(slot, nargs + 1, &isnull);
@@ -1266,6 +1274,8 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 	int64		duplicate_count = 0;
 	OSAPerGroupState *osastate;
 	int			numDistinctCols;
+	Datum		abbrevVal = (Datum) 0;
+	Datum		abbrevOld = (Datum) 0;
 	AttrNumber *sortColIdx;
 	FmgrInfo   *equalfns;
 	TupleTableSlot *slot;
@@ -1342,7 +1352,7 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 	slot2 = extraslot;
 
 	/* iterate till we find the hypothetical row */
-	while (tuplesort_gettupleslot(osastate->sortstate, true, slot))
+	while (tuplesort_gettupleslot(osastate->sortstate, true, slot, &abbrevVal))
 	{
 		bool		isnull;
 		Datum		d = slot_getattr(slot, nargs + 1, &isnull);
@@ -1353,6 +1363,7 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 
 		/* count non-distinct tuples */
 		if (!TupIsNull(slot2) &&
+			abbrevVal == abbrevOld &&
 			execTuplesMatch(slot, slot2,
 							numDistinctCols,
 							sortColIdx,
@@ -1363,6 +1374,8 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 		tmpslot = slot2;
 		slot2 = slot;
 		slot = tmpslot;
+		/* avoid execTuplesMatch() calls by reusing abbreviated keys */
+		abbrevOld = abbrevVal;
 
 		rank++;
 
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 5e12c0b..4c7c2b1 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1282,7 +1282,8 @@ tuplesort_putindextuplevalues(Tuplesortstate *state, Relation rel,
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to zeroed representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup.datum1 = original;
 	}
@@ -1351,7 +1352,11 @@ tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull)
 
 	if (isNull || state->datumTypeByVal)
 	{
-		stup.datum1 = val;
+		/*
+		 * Set datum1 to zeroed representation for NULLs (to be consistent, and
+		 * to support cheap inequality tests for NULL abbreviated keys).
+		 */
+		stup.datum1 = !isNull ? val : (Datum) 0;
 		stup.isnull1 = isNull;
 		stup.tuple = NULL;		/* no separate storage */
 	}
@@ -1868,10 +1873,17 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
  * Fetch the next tuple in either forward or back direction.
  * If successful, put tuple in slot and return TRUE; else, clear the slot
  * and return FALSE.
+ *
+ * Caller may optionally be passed back abbreviated value (on TRUE return
+ * value) when abbreviation was used, which can be used to cheaply avoid
+ * equality checks that might otherwise be required.  Caller can safely make a
+ * determination of "non-equal tuple" based on simple binary inequality.  A
+ * NULL value in leading attribute will set abbreviated value to zeroed
+ * representation, which caller may rely on in abbreviated inequality check.
  */
 bool
 tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
-					   TupleTableSlot *slot)
+					   TupleTableSlot *slot, Datum *abbrev)
 {
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
 	SortTuple	stup;
@@ -1884,6 +1896,10 @@ tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
 
 	if (stup.tuple)
 	{
+		/* Record abbreviated key for caller */
+		if (state->sortKeys->abbrev_converter && abbrev)
+			*abbrev = stup.datum1;
+
 		ExecStoreMinimalTuple((MinimalTuple) stup.tuple, slot, should_free);
 		return true;
 	}
@@ -1939,10 +1955,17 @@ tuplesort_getindextuple(Tuplesortstate *state, bool forward,
  *
  * If the Datum is pass-by-ref type, the returned value is freshly palloc'd
  * and is now owned by the caller.
+ *
+ * Caller may optionally be passed back abbreviated value (on TRUE return
+ * value) when abbreviation was used, which can be used to cheaply avoid
+ * equality checks that might otherwise be required.  Caller can safely make a
+ * determination of "non-equal tuple" based on simple binary inequality.  A
+ * NULL value will have a zeroed abbreviated value representation, which caller
+ * may rely on in abbreviated inequality check.
  */
 bool
 tuplesort_getdatum(Tuplesortstate *state, bool forward,
-				   Datum *val, bool *isNull)
+				   Datum *val, bool *isNull, Datum *abbrev)
 {
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
 	SortTuple	stup;
@@ -1954,6 +1977,10 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
 		return false;
 	}
 
+	/* Record abbreviated key for caller */
+	if (state->sortKeys->abbrev_converter && abbrev)
+		*abbrev = stup.datum1;
+
 	if (stup.isnull1 || state->datumTypeByVal)
 	{
 		*val = stup.datum1;
@@ -3166,7 +3193,8 @@ copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to zeroed representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup->datum1 = original;
 	}
@@ -3408,7 +3436,8 @@ copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to zeroed representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup->datum1 = original;
 	}
@@ -3712,7 +3741,8 @@ copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to zeroed representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup->datum1 = original;
 	}
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index de6fc56..9af1159 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -93,13 +93,13 @@ extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 extern void tuplesort_performsort(Tuplesortstate *state);
 
 extern bool tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
-					   TupleTableSlot *slot);
+					   TupleTableSlot *slot, Datum *abbrev);
 extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward,
 					   bool *should_free);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward,
 						bool *should_free);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward,
-				   Datum *val, bool *isNull);
+				   Datum *val, bool *isNull, Datum *abbrev);
 
 extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 					 bool forward);
-- 
1.9.1

0001-Tighten-SortSupport-abbreviated-key-contract.patchtext/x-patch; charset=US-ASCII; name=0001-Tighten-SortSupport-abbreviated-key-contract.patchDownload
From 1fb207b7dfdba7265b2a7806416b9cf384931f4d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Thu, 9 Jul 2015 15:59:07 -0700
Subject: [PATCH 1/2] Tighten SortSupport abbreviated key contract

Revise the contract that SortSupport has with its clients to forbid an
abbreviated comparator ever returning 0 ("indeterminate") in the event
of two abbreviated keys not being bitwise equal.  This gives certain
cases the leeway to inexpensively determine tuple inequality from afar,
by comparing abbreviated keys using simple unsigned integer (Datum)
comparisons, and trusting that bitwise inequality amounts to leading
attribute inequality (and, therefore, tuple inequality).  Direct access
to SortSupport state is consequently not required, which will be
convenient for tuplesort clients (that only have an opaque state
pointer), while also saving us a few additional cycles.

Cheap inequality checks of this nature might otherwise give wrong
answers due (for example) to some opclass abbreviated comparator
interpreting non-identical abbreviated keys with respect to some state
built during memtuple copying (although that is probably still safe,
depending on the exact details).  Pass-by-reference abbreviated keys
could similarly cause problems, and so are now forbidden.  No existing
opclass with abbreviation support does anything like this, so no change
in any opclass SortSupport routine.

Finally, move code that handles the "one initial run" external sort
optimization in mergeruns() to occur immediately after (not immediately
before) SortSupport state has been set to be consistent with
abbreviation not being in play (this is needed because abbreviated keys
are never stored on tape).  The existing order of those two things is
not significant, but only because the question of further tuple
comparisons never arises with the "one initial run" external sort
optimization.  The question of cheap inequality tests can be expected to
arise in a future commit, though.  This is also tidier.

Backpatch to 9.5, where abbreviated keys were introduced.
---
 src/backend/utils/sort/tuplesort.c | 52 +++++++++++++++++++++++---------------
 src/include/utils/sortsupport.h    | 16 +++++++++++-
 2 files changed, 46 insertions(+), 22 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 6572af8..5e12c0b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -930,7 +930,17 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	state->sortKeys->ssup_collation = sortCollation;
 	state->sortKeys->ssup_nulls_first = nullsFirstFlag;
 
-	/* abbreviation is possible here only for by-reference types */
+	/*
+	 * Abbreviation is possible here only for by-reference types.  Note that a
+	 * pass-by-value representation for abbreviated values is forbidden, but
+	 * that's a distinct, generic restriction imposed by the SortSupport
+	 * contract.
+	 *
+	 * It's possible for a pass-by-value type to have a pass-by-value
+	 * abbreviated key representation, presumably with cheaper comparisons, but
+	 * that could never be used here.  Abbreviated keys are mostly about cache
+	 * performance, so this doesn't seem like a serious restriction.
+	 */
 	state->sortKeys->abbreviate = !typbyval;
 
 	PrepareSortSupportFromOrderingOp(sortOperator, state->sortKeys);
@@ -1271,7 +1281,7 @@ tuplesort_putindextuplevalues(Tuplesortstate *state, Relation rel,
 		 * Store ordinary Datum representation, or NULL value.  If there is a
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
-		 * converter and just set datum1 to "void" representation (to be
+		 * converter and just set datum1 to zeroed representation (to be
 		 * consistent).
 		 */
 		stup.datum1 = original;
@@ -2224,6 +2234,22 @@ mergeruns(Tuplesortstate *state)
 	Assert(state->status == TSS_BUILDRUNS);
 	Assert(state->memtupcount == 0);
 
+	if (state->sortKeys != NULL && state->sortKeys->abbrev_converter != NULL)
+	{
+		/*
+		 * If there are multiple runs to be merged, when we go to read back
+		 * tuples from disk, abbreviated keys will not have been stored, and
+		 * we don't care to regenerate them.  Disable abbreviation from this
+		 * point on.
+		 */
+		state->sortKeys->abbrev_converter = NULL;
+		state->sortKeys->comparator = state->sortKeys->abbrev_full_comparator;
+
+		/* Not strictly necessary, but be tidy */
+		state->sortKeys->abbrev_abort = NULL;
+		state->sortKeys->abbrev_full_comparator = NULL;
+	}
+
 	/*
 	 * If we produced only one initial run (quite likely if the total data
 	 * volume is between 1X and 2X workMem), we can just use that tape as the
@@ -2239,22 +2265,6 @@ mergeruns(Tuplesortstate *state)
 		return;
 	}
 
-	if (state->sortKeys != NULL && state->sortKeys->abbrev_converter != NULL)
-	{
-		/*
-		 * If there are multiple runs to be merged, when we go to read back
-		 * tuples from disk, abbreviated keys will not have been stored, and
-		 * we don't care to regenerate them.  Disable abbreviation from this
-		 * point on.
-		 */
-		state->sortKeys->abbrev_converter = NULL;
-		state->sortKeys->comparator = state->sortKeys->abbrev_full_comparator;
-
-		/* Not strictly necessary, but be tidy */
-		state->sortKeys->abbrev_abort = NULL;
-		state->sortKeys->abbrev_full_comparator = NULL;
-	}
-
 	/* End of step D2: rewind all output tapes to prepare for merging */
 	for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
 		LogicalTapeRewind(state->tapeset, tapenum, false);
@@ -3155,7 +3165,7 @@ copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * Store ordinary Datum representation, or NULL value.  If there is a
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
-		 * converter and just set datum1 to "void" representation (to be
+		 * converter and just set datum1 to zeroed representation (to be
 		 * consistent).
 		 */
 		stup->datum1 = original;
@@ -3397,7 +3407,7 @@ copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * Store ordinary Datum representation, or NULL value.  If there is a
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
-		 * converter and just set datum1 to "void" representation (to be
+		 * converter and just set datum1 to zeroed representation (to be
 		 * consistent).
 		 */
 		stup->datum1 = original;
@@ -3701,7 +3711,7 @@ copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * Store ordinary Datum representation, or NULL value.  If there is a
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
-		 * converter and just set datum1 to "void" representation (to be
+		 * converter and just set datum1 to zeroed representation (to be
 		 * consistent).
 		 */
 		stup->datum1 = original;
diff --git a/src/include/utils/sortsupport.h b/src/include/utils/sortsupport.h
index 4258630..5a110f1 100644
--- a/src/include/utils/sortsupport.h
+++ b/src/include/utils/sortsupport.h
@@ -116,7 +116,7 @@ typedef struct SortSupportData
 	 *
 	 * This allows opclass authors to supply a conversion routine, used to
 	 * create an alternative representation of the underlying type (an
-	 * "abbreviated key").  Typically, this representation is an ad-hoc,
+	 * "abbreviated key").  This representation must have an ad-hoc,
 	 * pass-by-value Datum format that only the opclass has knowledge of.  An
 	 * alternative comparator, used only with this alternative representation
 	 * must also be provided (which is assigned to "comparator").  This
@@ -135,6 +135,20 @@ typedef struct SortSupportData
 	 * cache miss penalties are expensive; to get good overall performance,
 	 * sort infrastructure must heavily weigh cache performance.
 	 *
+	 * There is one limited way in which the core code has knowledge of any
+	 * abbreviated key format:  it uses it directly for inequality.
+	 * Specifically, it assumes that two non-bitwise-identical abbreviated keys
+	 * could never have their abbreviated comparator return a 0 or
+	 * "inconclusive" result; in other words, it assumes that abbreviated
+	 * bitwise inequality is a reliable proxy for authoritative value
+	 * inequality.  This restriction enables cheap inequality tests made from a
+	 * distance, without opclass involvement.  Furthermore, it implies that
+	 * abbreviated keys themselves must always be pass-by-value, since this
+	 * won't work with pointers.  Note that it is still permitted for the
+	 * abbreviated comparator to use state generated during calls to
+	 * abbrev_converter -- for example, a lookup table could be used, provided
+	 * any identifier baked into the abbreviated representation is canonical.
+	 *
 	 * Opclass authors must consider the final cardinality of abbreviated keys
 	 * when devising an encoding scheme.  It's possible for a strategy to work
 	 * better than an alternative strategy with one usage pattern, while the
-- 
1.9.1

#15Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#14)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Thu, Dec 17, 2015 at 11:43 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Dec 16, 2015 at 12:04 PM, Peter Geoghegan <pg@heroku.com> wrote:

What kind of state is that? Can't we define this in terms of what it
is rather than how it gets that way?

It's zeroed.

I guess we can update everything, including existing comments, to reflect that.

Thanks, this looks much easier to understand from here.

Attached revision updates both the main commit (the optimization), and
the backpatch commit (updated the contract).

-       /* abbreviation is possible here only for by-reference types */
+       /*
+        * Abbreviation is possible here only for by-reference types.
Note that a
+        * pass-by-value representation for abbreviated values is forbidden, but
+        * that's a distinct, generic restriction imposed by the SortSupport
+        * contract.

I think that you have not written what you meant to write here. I
think what you mean is "Note that a pass-by-REFERENCE representation
for abbreviated values is forbidden...".

+       /*
+        * If we produced only one initial run (quite likely if the total data
+        * volume is between 1X and 2X workMem), we can just use that
tape as the
+        * finished output, rather than doing a useless merge.  (This obvious
+        * optimization is not in Knuth's algorithm.)
+        */
+       if (state->currentRun == 1)
+       {
+               state->result_tape = state->tp_tapenum[state->destTape];
+               /* must freeze and rewind the finished output tape */
+               LogicalTapeFreeze(state->tapeset, state->result_tape);
+               state->status = TSS_SORTEDONTAPE;
+               return;
+       }

I don't understand the point of moving this code. If there's some
reason to do this after rewinding the tapes rather than beforehand, I
think we should articulate that reason in the comment block.

The last hunk in your 0001 patch properly belongs in 0002.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#15)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Fri, Dec 18, 2015 at 9:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Attached revision updates both the main commit (the optimization), and
the backpatch commit (updated the contract).

-       /* abbreviation is possible here only for by-reference types */
+       /*
+        * Abbreviation is possible here only for by-reference types.
Note that a
+        * pass-by-value representation for abbreviated values is forbidden, but
+        * that's a distinct, generic restriction imposed by the SortSupport
+        * contract.

I think that you have not written what you meant to write here. I
think what you mean is "Note that a pass-by-REFERENCE representation
for abbreviated values is forbidden...".

You're right. Sorry about that.

+       /*
+        * If we produced only one initial run (quite likely if the total data
+        * volume is between 1X and 2X workMem), we can just use that
tape as the
+        * finished output, rather than doing a useless merge.  (This obvious
+        * optimization is not in Knuth's algorithm.)
+        */
+       if (state->currentRun == 1)
+       {
+               state->result_tape = state->tp_tapenum[state->destTape];
+               /* must freeze and rewind the finished output tape */
+               LogicalTapeFreeze(state->tapeset, state->result_tape);
+               state->status = TSS_SORTEDONTAPE;
+               return;
+       }

I don't understand the point of moving this code. If there's some
reason to do this after rewinding the tapes rather than beforehand, I
think we should articulate that reason in the comment block.

I thought that was made clear by the 0001 commit message. Think of
what happens when we don't disable abbreviated in the final
TSS_SORTEDONTAPE phase should the "if (state->currentRun == 1)" path
have been taken (but *not* the path that also ends in
TSS_SORTEDONTAPE, when caller requires randomAccess but we spill to
tape, or any other case). What happens is: The code in 0002 gets
confused, and attempts to pass back a pointer value as an "abbreviated
key". That's a bug.

The last hunk in your 0001 patch properly belongs in 0002.

You could certainly argue that the last hunk of 0001 belongs in 0002.
I only moved it to 0001 when I realized that we might as well keep the
branches in sync, since the ordering is insignificant from a 9.5
perspective (although it might still be tidier), and there is a need
to backpatch anyway. I'm not insisting on doing it that way.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#16)
1 attachment(s)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Fri, Dec 18, 2015 at 2:22 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Fri, Dec 18, 2015 at 9:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Attached revision updates both the main commit (the optimization), and
the backpatch commit (updated the contract).

-       /* abbreviation is possible here only for by-reference types */
+       /*
+        * Abbreviation is possible here only for by-reference types.
Note that a
+        * pass-by-value representation for abbreviated values is forbidden, but
+        * that's a distinct, generic restriction imposed by the SortSupport
+        * contract.

I think that you have not written what you meant to write here. I
think what you mean is "Note that a pass-by-REFERENCE representation
for abbreviated values is forbidden...".

You're right. Sorry about that.

PFA my proposal for comment changes for 9.5 and master. This is based
on your 0001, but I edited somewhat. Please let me know your
thoughts. I am not willing to go further and rearrange actual code in
9.5 at this point; it just isn't necessary.

Looking at this whole system again, I wonder if we're missing a trick
here. How about if we decree from on high that the abbreviated-key
comparator is always just the application of an integer comparison
operator? The only abbreviation function that we have right now that
wouldn't be happy with that is the one for numeric, and that looks
like it could be fixed. Then we could hard-code knowledge of this
representation in tuplesort.c in such a way that it wouldn't need to
call a comparator function at all - instead of doing
ApplySortComparator() and then maybe ApplySortAbbrevFullComparator(),
it would do something like:

if (using_abbreviation && (compare = ApplyAbbrevComparator(...)) != 0)
return compare;

I'm not sure if that would save enough cycles vs. the status quo to be
worth bothering with, but it seems like it might. You may have a
better feeling for that than I do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

sortsupport-comment-fixes-rmh.patchapplication/x-patch; name=sortsupport-comment-fixes-rmh.patchDownload
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 6572af8..cf1cdcb 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -930,7 +930,14 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	state->sortKeys->ssup_collation = sortCollation;
 	state->sortKeys->ssup_nulls_first = nullsFirstFlag;
 
-	/* abbreviation is possible here only for by-reference types */
+	/*
+	 * Abbreviation is possible here only for by-reference types.  In theory,
+	 * a pass-by-value datatype could have an abbreviated form that is cheaper
+	 * to compare.  In a tuple sort, we could support that, because we can
+	 * always extract the original datum from the tuple is needed.  Here, we
+	 * can't, because a datum sort only stores a single copy of the datum;
+	 * the "tuple" field of each sortTuple is NULL.
+	 */
 	state->sortKeys->abbreviate = !typbyval;
 
 	PrepareSortSupportFromOrderingOp(sortOperator, state->sortKeys);
@@ -1271,7 +1278,7 @@ tuplesort_putindextuplevalues(Tuplesortstate *state, Relation rel,
 		 * Store ordinary Datum representation, or NULL value.  If there is a
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
-		 * converter and just set datum1 to "void" representation (to be
+		 * converter and just set datum1 to zeroed representation (to be
 		 * consistent).
 		 */
 		stup.datum1 = original;
@@ -3155,7 +3162,7 @@ copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * Store ordinary Datum representation, or NULL value.  If there is a
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
-		 * converter and just set datum1 to "void" representation (to be
+		 * converter and just set datum1 to zeroed representation (to be
 		 * consistent).
 		 */
 		stup->datum1 = original;
@@ -3397,7 +3404,7 @@ copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * Store ordinary Datum representation, or NULL value.  If there is a
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
-		 * converter and just set datum1 to "void" representation (to be
+		 * converter and just set datum1 to zeroed representation (to be
 		 * consistent).
 		 */
 		stup->datum1 = original;
@@ -3701,7 +3708,7 @@ copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * Store ordinary Datum representation, or NULL value.  If there is a
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
-		 * converter and just set datum1 to "void" representation (to be
+		 * converter and just set datum1 to zeroed representation (to be
 		 * consistent).
 		 */
 		stup->datum1 = original;
diff --git a/src/include/utils/sortsupport.h b/src/include/utils/sortsupport.h
index 4258630..e890503 100644
--- a/src/include/utils/sortsupport.h
+++ b/src/include/utils/sortsupport.h
@@ -116,24 +116,25 @@ typedef struct SortSupportData
 	 *
 	 * This allows opclass authors to supply a conversion routine, used to
 	 * create an alternative representation of the underlying type (an
-	 * "abbreviated key").  Typically, this representation is an ad-hoc,
-	 * pass-by-value Datum format that only the opclass has knowledge of.  An
-	 * alternative comparator, used only with this alternative representation
-	 * must also be provided (which is assigned to "comparator").  This
-	 * representation is a simple approximation of the original Datum.  It
-	 * must be possible to compare datums of this representation with each
-	 * other using the supplied alternative comparator, and have any non-zero
-	 * return value be a reliable proxy for what a proper comparison would
-	 * indicate. Returning zero from the alternative comparator does not
-	 * indicate equality, as with a conventional support routine 1, though --
-	 * it indicates that it wasn't possible to determine how the two
-	 * abbreviated values compared.  A proper comparison, using
-	 * "abbrev_full_comparator"/ ApplySortAbbrevFullComparator() is therefore
-	 * required.  In many cases this results in most or all comparisons only
-	 * using the cheap alternative comparison func, which is typically
-	 * implemented as code that compiles to just a few CPU instructions.  CPU
-	 * cache miss penalties are expensive; to get good overall performance,
-	 * sort infrastructure must heavily weigh cache performance.
+	 * "abbreviated key").  This representation must be pass-by-value and
+	 * typically will use some ad-hoc format that only the opclass has
+	 * knowledge of.  An alternative comparator, used only with this
+	 * alternative representation must also be provided (which is assigned to
+	 * "comparator").  This representation is a simple approximation of the
+	 * original Datum.  It must be possible to compare datums of this
+	 * representation with each other using the supplied alternative
+	 * comparator, and have any non-zero return value be a reliable proxy for
+	 * what a proper comparison would indicate. Returning zero from the
+	 * alternative comparator does not indicate equality, as with a
+	 * conventional support routine 1, though -- it indicates that it wasn't
+	 * possible to determine how the two abbreviated values compared.  A
+	 * proper comparison, using "abbrev_full_comparator"/
+	 * ApplySortAbbrevFullComparator() is therefore required.  In many cases
+	 * this results in most or all comparisons only using the cheap
+	 * alternative comparison func, which is typically implemented as code
+	 * that compiles to just a few CPU instructions.  CPU cache miss penalties
+	 * are expensive; to get good overall performance, sort infrastructure
+	 * must heavily weigh cache performance.
 	 *
 	 * Opclass authors must consider the final cardinality of abbreviated keys
 	 * when devising an encoding scheme.  It's possible for a strategy to work
#18Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#17)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Mon, Dec 21, 2015 at 10:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:

PFA my proposal for comment changes for 9.5 and master. This is based
on your 0001, but I edited somewhat. Please let me know your
thoughts. I am not willing to go further and rearrange actual code in
9.5 at this point; it just isn't necessary.

Fine by me. But this revision hasn't made the important point at all
-- which is that 0002 is safe. That's a stronger guarantee than the
abbreviated key representation being pass-by-value.

Looking at this whole system again, I wonder if we're missing a trick
here. How about if we decree from on high that the abbreviated-key
comparator is always just the application of an integer comparison
operator? The only abbreviation function that we have right now that
wouldn't be happy with that is the one for numeric, and that looks
like it could be fixed.

I gather you mean that we'd decree that they were always compared
using a text or uuid style 3-way unsigned integer comparison, allowing
us to hardcode that.

Then we could hard-code knowledge of this
representation in tuplesort.c in such a way that it wouldn't need to
call a comparator function at all - instead of doing
ApplySortComparator() and then maybe ApplySortAbbrevFullComparator(),
it would do something like:

if (using_abbreviation && (compare = ApplyAbbrevComparator(...)) != 0)
return compare;

I'm not sure if that would save enough cycles vs. the status quo to be
worth bothering with, but it seems like it might. You may have a
better feeling for that than I do.

I like this idea. I thought of it myself already, actually.

It has some nice properties, because unsigned integers are so simple
and flexible. For example, we can do it with int4 sometimes, and then
compare two attributes at once on 64-bit platforms. Maybe the second
attribute (the second set of 4 bytes) isn't an int4, though -- it's
any other type's abbreviated key (which can be assumed to have a
compatible representation). That's one more advanced possibility.

It seems worthwhile because numeric is the odd one out. Once I or
someone else gets around to adding network types, which could happen
in 9.6, then we are done (because there is already a pending patch
that adds support for remaining character-like types and alternative
opclasses). I really don't foresee much additional use of abbreviated
keys beyond these cases. There'd be a small but nice boost by not
doing pointer chasing for the abbreviated key comparison, I imagine,
but that benefit would be felt everywhere. Numeric is the odd one out
currently, but as you say it can be fixed, and I foresee no other
trouble from any other opclass/type that cares about abbreviation.

Another issue is that abbreviated keys in indexes are probably not
going to take 8 bytes, because they'll go in the ItemId array. It's
likely to be very useful to be able to take the first two bytes, and
know that a uint16 comparison is all that is needed, and have a basis
to perform an interpolation search rather than a binary search
(although that needs to be adaptive to different distributions, I
think it could be an effective technique -- there are many cache
misses for binary searches on internal B-Tree pages).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#18)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Mon, Dec 21, 2015 at 2:55 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Mon, Dec 21, 2015 at 10:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:

PFA my proposal for comment changes for 9.5 and master. This is based
on your 0001, but I edited somewhat. Please let me know your
thoughts. I am not willing to go further and rearrange actual code in
9.5 at this point; it just isn't necessary.

Fine by me. But this revision hasn't made the important point at all
-- which is that 0002 is safe. That's a stronger guarantee than the
abbreviated key representation being pass-by-value.

Right. I don't think that we should back-patch that stuff into 9.5.

Looking at this whole system again, I wonder if we're missing a trick
here. How about if we decree from on high that the abbreviated-key
comparator is always just the application of an integer comparison
operator? The only abbreviation function that we have right now that
wouldn't be happy with that is the one for numeric, and that looks
like it could be fixed.

I gather you mean that we'd decree that they were always compared
using a text or uuid style 3-way unsigned integer comparison, allowing
us to hardcode that.

Right.

Then we could hard-code knowledge of this
representation in tuplesort.c in such a way that it wouldn't need to
call a comparator function at all - instead of doing
ApplySortComparator() and then maybe ApplySortAbbrevFullComparator(),
it would do something like:

if (using_abbreviation && (compare = ApplyAbbrevComparator(...)) != 0)
return compare;

I'm not sure if that would save enough cycles vs. the status quo to be
worth bothering with, but it seems like it might. You may have a
better feeling for that than I do.

I like this idea. I thought of it myself already, actually.

It has some nice properties, because unsigned integers are so simple
and flexible. For example, we can do it with int4 sometimes, and then
compare two attributes at once on 64-bit platforms. Maybe the second
attribute (the second set of 4 bytes) isn't an int4, though -- it's
any other type's abbreviated key (which can be assumed to have a
compatible representation). That's one more advanced possibility.

Yikes.

It seems worthwhile because numeric is the odd one out. Once I or
someone else gets around to adding network types, which could happen
in 9.6, then we are done (because there is already a pending patch
that adds support for remaining character-like types and alternative
opclasses). I really don't foresee much additional use of abbreviated
keys beyond these cases. There'd be a small but nice boost by not
doing pointer chasing for the abbreviated key comparison, I imagine,
but that benefit would be felt everywhere. Numeric is the odd one out
currently, but as you say it can be fixed, and I foresee no other
trouble from any other opclass/type that cares about abbreviation.

Another issue is that abbreviated keys in indexes are probably not
going to take 8 bytes, because they'll go in the ItemId array. It's
likely to be very useful to be able to take the first two bytes, and
know that a uint16 comparison is all that is needed, and have a basis
to perform an interpolation search rather than a binary search
(although that needs to be adaptive to different distributions, I
think it could be an effective technique -- there are many cache
misses for binary searches on internal B-Tree pages).

Hmm, that seems iffy to me. There are plenty of data sets where 8
bytes is enough to get some meaningful information about what part of
the keyspace you're in, but 2 bytes isn't.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#19)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Tue, Dec 22, 2015 at 9:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

It has some nice properties, because unsigned integers are so simple
and flexible. For example, we can do it with int4 sometimes, and then
compare two attributes at once on 64-bit platforms. Maybe the second
attribute (the second set of 4 bytes) isn't an int4, though -- it's
any other type's abbreviated key (which can be assumed to have a
compatible representation). That's one more advanced possibility.

Yikes.

I'm not suggesting we'd want to do that immediately, or even at all.
My point is -- stuff like this becomes possible. My intuition is that
it might come up in a bunch of places.

Another issue is that abbreviated keys in indexes are probably not
going to take 8 bytes, because they'll go in the ItemId array. It's
likely to be very useful to be able to take the first two bytes, and
know that a uint16 comparison is all that is needed, and have a basis
to perform an interpolation search rather than a binary search
(although that needs to be adaptive to different distributions, I
think it could be an effective technique -- there are many cache
misses for binary searches on internal B-Tree pages).

Hmm, that seems iffy to me. There are plenty of data sets where 8
bytes is enough to get some meaningful information about what part of
the keyspace you're in, but 2 bytes isn't.

Your experience with abbreviated keys for sorting is unlikely to
perfectly carry over here. That's because the 2 bytes are only put in
the internal pages (typically less than 1% of the total). These
internal pages typically have relatively low density anyway (we don't
use the The B* tree technique), so I think that the level of bloat
would be tiny. At the same time, these are the range of values that
naturally have very wide fan-out. For example, the entire range of
values in the index is represented at the root page.

All of that having been said, yes, it could still be useless. But
then, the overhead will still tend to be nill, due to the fact that
abbreviated key generation can be so effectively amortized (it happens
only once, per page split), and because branch prediction will tend to
remove any penalty.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#19)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Tue, Dec 22, 2015 at 12:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 21, 2015 at 2:55 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Mon, Dec 21, 2015 at 10:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:

PFA my proposal for comment changes for 9.5 and master. This is based
on your 0001, but I edited somewhat. Please let me know your
thoughts. I am not willing to go further and rearrange actual code in
9.5 at this point; it just isn't necessary.

Fine by me. But this revision hasn't made the important point at all
-- which is that 0002 is safe. That's a stronger guarantee than the
abbreviated key representation being pass-by-value.

Right. I don't think that we should back-patch that stuff into 9.5.

OK, so I've gone ahead and committed and back-patched that. Can you
please rebase and repost the remainder as a 9.6 proposal?

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#21)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Tue, Dec 22, 2015 at 2:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Right. I don't think that we should back-patch that stuff into 9.5.

OK, so I've gone ahead and committed and back-patched that. Can you
please rebase and repost the remainder as a 9.6 proposal?

OK. I don't know why you didn't acknowledge in your revision to
sortsupport.h that bitwise inequality must be a reliable proxy for
authoritative value inequality, which is a stronger restriction than
mandating that abbreviated keys always be pass-by-value, but I'm not
going to argue.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#21)
1 attachment(s)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Tue, Dec 22, 2015 at 2:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

OK, so I've gone ahead and committed and back-patched that. Can you
please rebase and repost the remainder as a 9.6 proposal?

I attach a rebased patch for 9.6 only.

--
Peter Geoghegan

Attachments:

0001-Reuse-abbreviated-keys-in-ordered-set-aggregates.patchtext/x-patch; charset=US-ASCII; name=0001-Reuse-abbreviated-keys-in-ordered-set-aggregates.patchDownload
From 4bc33a602822ef0d56222f03174d71bbb4cd126b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Thu, 9 Jul 2015 15:59:07 -0700
Subject: [PATCH] Reuse abbreviated keys in ordered [set] aggregates

When processing ordered aggregates following a sort that could make use
of the abbreviated key optimization, only call the equality operator to
compare successive pairs of tuples when their abbreviated keys were not
equal.  Only strict abbreviated key binary inequality is considered,
which is safe.
---
 src/backend/catalog/index.c            |  2 +-
 src/backend/executor/nodeAgg.c         | 20 ++++++---
 src/backend/executor/nodeSort.c        |  2 +-
 src/backend/utils/adt/orderedsetaggs.c | 33 ++++++++++-----
 src/backend/utils/sort/tuplesort.c     | 76 ++++++++++++++++++++++++----------
 src/include/utils/tuplesort.h          |  4 +-
 6 files changed, 95 insertions(+), 42 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c10be3d..c6737c2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3073,7 +3073,7 @@ validate_index_heapscan(Relation heapRelation,
 			}
 
 			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  &ts_val, &ts_isnull);
+												  &ts_val, &ts_isnull, NULL);
 			Assert(tuplesort_empty || !ts_isnull);
 			if (!tuplesort_empty)
 			{
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2e36855..f007c38 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -539,7 +539,8 @@ fetch_input_tuple(AggState *aggstate)
 
 	if (aggstate->sort_in)
 	{
-		if (!tuplesort_gettupleslot(aggstate->sort_in, true, aggstate->sort_slot))
+		if (!tuplesort_gettupleslot(aggstate->sort_in, true, aggstate->sort_slot,
+									NULL))
 			return NULL;
 		slot = aggstate->sort_slot;
 	}
@@ -894,8 +895,8 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
  * The one-input case is handled separately from the multi-input case
  * for performance reasons: for single by-value inputs, such as the
  * common case of count(distinct id), the tuplesort_getdatum code path
- * is around 300% faster.  (The speedup for by-reference types is less
- * but still noticeable.)
+ * is around 300% faster.  (The speedup for by-reference types without
+ * abbreviated key support is less but still noticeable.)
  *
  * This function handles only one grouping set (already set in
  * aggstate->current_set).
@@ -913,6 +914,8 @@ process_ordered_aggregate_single(AggState *aggstate,
 	MemoryContext workcontext = aggstate->tmpcontext->ecxt_per_tuple_memory;
 	MemoryContext oldContext;
 	bool		isDistinct = (pertrans->numDistinctCols > 0);
+	Datum		newAbbrevVal = (Datum) 0;
+	Datum		oldAbbrevVal = (Datum) 0;
 	FunctionCallInfo fcinfo = &pertrans->transfn_fcinfo;
 	Datum	   *newVal;
 	bool	   *isNull;
@@ -932,7 +935,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 	 */
 
 	while (tuplesort_getdatum(pertrans->sortstates[aggstate->current_set],
-							  true, newVal, isNull))
+							  true, newVal, isNull, &newAbbrevVal))
 	{
 		/*
 		 * Clear and select the working context for evaluation of the equality
@@ -950,6 +953,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 			haveOldVal &&
 			((oldIsNull && *isNull) ||
 			 (!oldIsNull && !*isNull &&
+			  oldAbbrevVal == newAbbrevVal &&
 			  DatumGetBool(FunctionCall2(&pertrans->equalfns[0],
 										 oldVal, *newVal)))))
 		{
@@ -965,6 +969,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 				pfree(DatumGetPointer(oldVal));
 			/* and remember the new one for subsequent equality checks */
 			oldVal = *newVal;
+			oldAbbrevVal = newAbbrevVal;
 			oldIsNull = *isNull;
 			haveOldVal = true;
 		}
@@ -1002,6 +1007,8 @@ process_ordered_aggregate_multi(AggState *aggstate,
 	TupleTableSlot *slot2 = pertrans->uniqslot;
 	int			numTransInputs = pertrans->numTransInputs;
 	int			numDistinctCols = pertrans->numDistinctCols;
+	Datum		newAbbrevVal = (Datum) 0;
+	Datum		oldAbbrevVal = (Datum) 0;
 	bool		haveOldValue = false;
 	int			i;
 
@@ -1012,7 +1019,7 @@ process_ordered_aggregate_multi(AggState *aggstate,
 		ExecClearTuple(slot2);
 
 	while (tuplesort_gettupleslot(pertrans->sortstates[aggstate->current_set],
-								  true, slot1))
+								  true, slot1, &newAbbrevVal))
 	{
 		/*
 		 * Extract the first numTransInputs columns as datums to pass to the
@@ -1023,6 +1030,7 @@ process_ordered_aggregate_multi(AggState *aggstate,
 
 		if (numDistinctCols == 0 ||
 			!haveOldValue ||
+			newAbbrevVal != oldAbbrevVal ||
 			!execTuplesMatch(slot1, slot2,
 							 numDistinctCols,
 							 pertrans->sortColIdx,
@@ -1046,6 +1054,8 @@ process_ordered_aggregate_multi(AggState *aggstate,
 
 				slot2 = slot1;
 				slot1 = tmpslot;
+				/* avoid execTuplesMatch() calls by reusing abbreviated keys */
+				oldAbbrevVal = newAbbrevVal;
 				haveOldValue = true;
 			}
 		}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index af1dccf..35dd993 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -137,7 +137,7 @@ ExecSort(SortState *node)
 	slot = node->ss.ps.ps_ResultTupleSlot;
 	(void) tuplesort_gettupleslot(tuplesortstate,
 								  ScanDirectionIsForward(dir),
-								  slot);
+								  slot, NULL);
 	return slot;
 }
 
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 39ed85b..1a5f696 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -453,7 +453,7 @@ percentile_disc_final(PG_FUNCTION_ARGS)
 			elog(ERROR, "missing row in percentile_disc");
 	}
 
-	if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull))
+	if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull, NULL))
 		elog(ERROR, "missing row in percentile_disc");
 
 	/*
@@ -553,7 +553,7 @@ percentile_cont_final_common(FunctionCallInfo fcinfo,
 	if (!tuplesort_skiptuples(osastate->sortstate, first_row, true))
 		elog(ERROR, "missing row in percentile_cont");
 
-	if (!tuplesort_getdatum(osastate->sortstate, true, &first_val, &isnull))
+	if (!tuplesort_getdatum(osastate->sortstate, true, &first_val, &isnull, NULL))
 		elog(ERROR, "missing row in percentile_cont");
 	if (isnull)
 		PG_RETURN_NULL();
@@ -564,7 +564,7 @@ percentile_cont_final_common(FunctionCallInfo fcinfo,
 	}
 	else
 	{
-		if (!tuplesort_getdatum(osastate->sortstate, true, &second_val, &isnull))
+		if (!tuplesort_getdatum(osastate->sortstate, true, &second_val, &isnull, NULL))
 			elog(ERROR, "missing row in percentile_cont");
 
 		if (isnull)
@@ -792,7 +792,7 @@ percentile_disc_multi_final(PG_FUNCTION_ARGS)
 				if (!tuplesort_skiptuples(osastate->sortstate, target_row - rownum - 1, true))
 					elog(ERROR, "missing row in percentile_disc");
 
-				if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull))
+				if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull, NULL))
 					elog(ERROR, "missing row in percentile_disc");
 
 				rownum = target_row;
@@ -921,7 +921,8 @@ percentile_cont_multi_final_common(FunctionCallInfo fcinfo,
 				if (!tuplesort_skiptuples(osastate->sortstate, first_row - rownum - 1, true))
 					elog(ERROR, "missing row in percentile_cont");
 
-				if (!tuplesort_getdatum(osastate->sortstate, true, &first_val, &isnull) || isnull)
+				if (!tuplesort_getdatum(osastate->sortstate, true, &first_val,
+										&isnull, NULL) || isnull)
 					elog(ERROR, "missing row in percentile_cont");
 
 				rownum = first_row;
@@ -941,7 +942,8 @@ percentile_cont_multi_final_common(FunctionCallInfo fcinfo,
 			/* Fetch second_row if needed */
 			if (second_row > rownum)
 			{
-				if (!tuplesort_getdatum(osastate->sortstate, true, &second_val, &isnull) || isnull)
+				if (!tuplesort_getdatum(osastate->sortstate, true, &second_val,
+										&isnull, NULL) || isnull)
 					elog(ERROR, "missing row in percentile_cont");
 				rownum++;
 			}
@@ -1016,6 +1018,8 @@ mode_final(PG_FUNCTION_ARGS)
 	int64		last_val_freq = 0;
 	bool		last_val_is_mode = false;
 	FmgrInfo   *equalfn;
+	Datum		abbrev_val = (Datum) 0;
+	Datum		last_abbrev_val = (Datum) 0;
 	bool		shouldfree;
 
 	Assert(AggCheckCallContext(fcinfo, NULL) == AGG_CONTEXT_AGGREGATE);
@@ -1042,7 +1046,7 @@ mode_final(PG_FUNCTION_ARGS)
 	tuplesort_performsort(osastate->sortstate);
 
 	/* Scan tuples and count frequencies */
-	while (tuplesort_getdatum(osastate->sortstate, true, &val, &isnull))
+	while (tuplesort_getdatum(osastate->sortstate, true, &val, &isnull, &abbrev_val))
 	{
 		/* we don't expect any nulls, but ignore them if found */
 		if (isnull)
@@ -1054,8 +1058,10 @@ mode_final(PG_FUNCTION_ARGS)
 			mode_val = last_val = val;
 			mode_freq = last_val_freq = 1;
 			last_val_is_mode = true;
+			last_abbrev_val = abbrev_val;
 		}
-		else if (DatumGetBool(FunctionCall2(equalfn, val, last_val)))
+		else if (abbrev_val == last_abbrev_val &&
+				 DatumGetBool(FunctionCall2(equalfn, val, last_val)))
 		{
 			/* value equal to previous value, count it */
 			if (last_val_is_mode)
@@ -1078,6 +1084,8 @@ mode_final(PG_FUNCTION_ARGS)
 			if (shouldfree && !last_val_is_mode)
 				pfree(DatumGetPointer(last_val));
 			last_val = val;
+			/* avoid equality function calls by reusing abbreviated keys */
+			last_abbrev_val = abbrev_val;
 			last_val_freq = 1;
 			last_val_is_mode = false;
 		}
@@ -1181,7 +1189,7 @@ hypothetical_rank_common(FunctionCallInfo fcinfo, int flag,
 	tuplesort_performsort(osastate->sortstate);
 
 	/* iterate till we find the hypothetical row */
-	while (tuplesort_gettupleslot(osastate->sortstate, true, slot))
+	while (tuplesort_gettupleslot(osastate->sortstate, true, slot, NULL))
 	{
 		bool		isnull;
 		Datum		d = slot_getattr(slot, nargs + 1, &isnull);
@@ -1266,6 +1274,8 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 	int64		duplicate_count = 0;
 	OSAPerGroupState *osastate;
 	int			numDistinctCols;
+	Datum		abbrevVal = (Datum) 0;
+	Datum		abbrevOld = (Datum) 0;
 	AttrNumber *sortColIdx;
 	FmgrInfo   *equalfns;
 	TupleTableSlot *slot;
@@ -1342,7 +1352,7 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 	slot2 = extraslot;
 
 	/* iterate till we find the hypothetical row */
-	while (tuplesort_gettupleslot(osastate->sortstate, true, slot))
+	while (tuplesort_gettupleslot(osastate->sortstate, true, slot, &abbrevVal))
 	{
 		bool		isnull;
 		Datum		d = slot_getattr(slot, nargs + 1, &isnull);
@@ -1353,6 +1363,7 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 
 		/* count non-distinct tuples */
 		if (!TupIsNull(slot2) &&
+			abbrevVal == abbrevOld &&
 			execTuplesMatch(slot, slot2,
 							numDistinctCols,
 							sortColIdx,
@@ -1363,6 +1374,8 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 		tmpslot = slot2;
 		slot2 = slot;
 		slot = tmpslot;
+		/* avoid execTuplesMatch() calls by reusing abbreviated keys */
+		abbrevOld = abbrevVal;
 
 		rank++;
 
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index cf1cdcb..f2fd5d1 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1279,7 +1279,8 @@ tuplesort_putindextuplevalues(Tuplesortstate *state, Relation rel,
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to zeroed representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup.datum1 = original;
 	}
@@ -1348,7 +1349,11 @@ tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull)
 
 	if (isNull || state->datumTypeByVal)
 	{
-		stup.datum1 = val;
+		/*
+		 * Set datum1 to zeroed representation for NULLs (to be consistent, and
+		 * to support cheap inequality tests for NULL abbreviated keys).
+		 */
+		stup.datum1 = !isNull ? val : (Datum) 0;
 		stup.isnull1 = isNull;
 		stup.tuple = NULL;		/* no separate storage */
 	}
@@ -1865,10 +1870,17 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
  * Fetch the next tuple in either forward or back direction.
  * If successful, put tuple in slot and return TRUE; else, clear the slot
  * and return FALSE.
+ *
+ * Caller may optionally be passed back abbreviated value (on TRUE return
+ * value) when abbreviation was used, which can be used to cheaply avoid
+ * equality checks that might otherwise be required.  Caller can safely make a
+ * determination of "non-equal tuple" based on simple binary inequality.  A
+ * NULL value in leading attribute will set abbreviated value to zeroed
+ * representation, which caller may rely on in abbreviated inequality check.
  */
 bool
 tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
-					   TupleTableSlot *slot)
+					   TupleTableSlot *slot, Datum *abbrev)
 {
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
 	SortTuple	stup;
@@ -1881,6 +1893,10 @@ tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
 
 	if (stup.tuple)
 	{
+		/* Record abbreviated key for caller */
+		if (state->sortKeys->abbrev_converter && abbrev)
+			*abbrev = stup.datum1;
+
 		ExecStoreMinimalTuple((MinimalTuple) stup.tuple, slot, should_free);
 		return true;
 	}
@@ -1936,10 +1952,17 @@ tuplesort_getindextuple(Tuplesortstate *state, bool forward,
  *
  * If the Datum is pass-by-ref type, the returned value is freshly palloc'd
  * and is now owned by the caller.
+ *
+ * Caller may optionally be passed back abbreviated value (on TRUE return
+ * value) when abbreviation was used, which can be used to cheaply avoid
+ * equality checks that might otherwise be required.  Caller can safely make a
+ * determination of "non-equal tuple" based on simple binary inequality.  A
+ * NULL value will have a zeroed abbreviated value representation, which caller
+ * may rely on in abbreviated inequality check.
  */
 bool
 tuplesort_getdatum(Tuplesortstate *state, bool forward,
-				   Datum *val, bool *isNull)
+				   Datum *val, bool *isNull, Datum *abbrev)
 {
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
 	SortTuple	stup;
@@ -1951,6 +1974,10 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
 		return false;
 	}
 
+	/* Record abbreviated key for caller */
+	if (state->sortKeys->abbrev_converter && abbrev)
+		*abbrev = stup.datum1;
+
 	if (stup.isnull1 || state->datumTypeByVal)
 	{
 		*val = stup.datum1;
@@ -2231,6 +2258,22 @@ mergeruns(Tuplesortstate *state)
 	Assert(state->status == TSS_BUILDRUNS);
 	Assert(state->memtupcount == 0);
 
+	if (state->sortKeys != NULL && state->sortKeys->abbrev_converter != NULL)
+	{
+		/*
+		 * If there are multiple runs to be merged, when we go to read back
+		 * tuples from disk, abbreviated keys will not have been stored, and
+		 * we don't care to regenerate them.  Disable abbreviation from this
+		 * point on.
+		 */
+		state->sortKeys->abbrev_converter = NULL;
+		state->sortKeys->comparator = state->sortKeys->abbrev_full_comparator;
+
+		/* Not strictly necessary, but be tidy */
+		state->sortKeys->abbrev_abort = NULL;
+		state->sortKeys->abbrev_full_comparator = NULL;
+	}
+
 	/*
 	 * If we produced only one initial run (quite likely if the total data
 	 * volume is between 1X and 2X workMem), we can just use that tape as the
@@ -2246,22 +2289,6 @@ mergeruns(Tuplesortstate *state)
 		return;
 	}
 
-	if (state->sortKeys != NULL && state->sortKeys->abbrev_converter != NULL)
-	{
-		/*
-		 * If there are multiple runs to be merged, when we go to read back
-		 * tuples from disk, abbreviated keys will not have been stored, and
-		 * we don't care to regenerate them.  Disable abbreviation from this
-		 * point on.
-		 */
-		state->sortKeys->abbrev_converter = NULL;
-		state->sortKeys->comparator = state->sortKeys->abbrev_full_comparator;
-
-		/* Not strictly necessary, but be tidy */
-		state->sortKeys->abbrev_abort = NULL;
-		state->sortKeys->abbrev_full_comparator = NULL;
-	}
-
 	/* End of step D2: rewind all output tapes to prepare for merging */
 	for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
 		LogicalTapeRewind(state->tapeset, tapenum, false);
@@ -3163,7 +3190,8 @@ copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to zeroed representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup->datum1 = original;
 	}
@@ -3405,7 +3433,8 @@ copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to zeroed representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup->datum1 = original;
 	}
@@ -3709,7 +3738,8 @@ copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to zeroed representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup->datum1 = original;
 	}
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index de6fc56..9af1159 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -93,13 +93,13 @@ extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 extern void tuplesort_performsort(Tuplesortstate *state);
 
 extern bool tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
-					   TupleTableSlot *slot);
+					   TupleTableSlot *slot, Datum *abbrev);
 extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward,
 					   bool *should_free);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward,
 						bool *should_free);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward,
-				   Datum *val, bool *isNull);
+				   Datum *val, bool *isNull, Datum *abbrev);
 
 extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 					 bool forward);
-- 
1.9.1

#24Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#20)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Tue, Dec 22, 2015 at 10:49 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Tue, Dec 22, 2015 at 9:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

It has some nice properties, because unsigned integers are so simple
and flexible. For example, we can do it with int4 sometimes, and then
compare two attributes at once on 64-bit platforms. Maybe the second
attribute (the second set of 4 bytes) isn't an int4, though -- it's
any other type's abbreviated key (which can be assumed to have a
compatible representation). That's one more advanced possibility.

Yikes.

I'm not suggesting we'd want to do that immediately, or even at all.
My point is -- stuff like this becomes possible. My intuition is that
it might come up in a bunch of places.

I had a better idea, that is based on the same intuition that we can
leverage the flexibility of that standardized representation of
abbreviated keys to the hilt.

We use one bit to represent whether this is the first run, or a
subsequent run (in a world where replacement selection is only used
for "quicksort with spillover", and run number isn't that
interesting), and another bit to represent NULLness. These are the
most significant and second most significant bits, respectively. Fill
all remaining bits with your unsigned abbreviated key representation,
without regard to the underlying details of the type. Now, you just
found a way to cut the size of SortTuples significantly, to just two
pointers (whereas with alignment considerations, SortTuples currently
consume enough memory to store 3 pointers). You also just found a way
of significantly cutting the number of instructions that are needed
for the hot code path, because you don't really need an
abbreviation-based ApplySortComparator() inline function to interpret
things through -- whether or not NULLs are first or last, and how
NULLs are considered generally is all baked into the abbreviated
representation from the start.

This is certainly speculative stuff, and having multiple versions of
SortTuple would be inconvenient, but the point again is that there are
considerable upsides to a standardized representation (abbreviated
keys always being compared as unsigned integers), and no downsides.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#23)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Tue, Dec 22, 2015 at 10:49 PM, Peter Geoghegan <pg@heroku.com> wrote:

I attach a rebased patch for 9.6 only.

I marked the patch -- my own patch -- "Ready for Committer". I'm the
third person to have marked the patch "Ready for Committer", even
though no committer bounced the patch back for review by the reviewer,
Andreas:

https://commitfest.postgresql.org/9/300/

First Andreas did so, then Michael, and finally myself. The problem is
that you see things like this:

"Closed in commitfest 2015-11 with status: Moved to next CF"

Michael (the CF manager at the time) remembered to change the status
to "Ready for Committer" again; you see this entry immediately
afterwards:

"New status: Ready for Committer"

but I gather from the CF app history that Alvaro (the current CF
manager) did not remember to do that second step when he later moved
the patch to the "next" (current) CF. Or maybe he just wasn't aware of
this quirk of the CF app.

I don't have a problem with having to resubmit a patch to the next CF
manually, but it's easy to miss that the status has been changed from
"Ready for Committer" to "Needs Review". I don't think anyone ever
argued for that, and this patch was accidentally in "Needs Review"
state for about 5 days. Can we fix it?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#25)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

Peter Geoghegan wrote:

Michael (the CF manager at the time) remembered to change the status
to "Ready for Committer" again; you see this entry immediately
afterwards:

"New status: Ready for Committer"

but I gather from the CF app history that Alvaro (the current CF
manager) did not remember to do that second step when he later moved
the patch to the "next" (current) CF. Or maybe he just wasn't aware of
this quirk of the CF app.

That seems a bug in the CF app to me.

(FWIW I'm not the "current" CF manager, because the CF I managed is
already over. I'm not sure that we know who the manager for the
upcoming one is.)

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Peter Geoghegan
pg@heroku.com
In reply to: Alvaro Herrera (#26)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Mon, Feb 15, 2016 at 12:58 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

(FWIW I'm not the "current" CF manager, because the CF I managed is
already over. I'm not sure that we know who the manager for the
upcoming one is.)

It's pretty easy to forget that this was the first time in a long time
that the CF wasn't closed because the next CF began. :-)

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#26)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

(FWIW I'm not the "current" CF manager, because the CF I managed is
already over. I'm not sure that we know who the manager for the
upcoming one is.)

We had a vict^H^H^H^Hvolunteer a few days ago:

/messages/by-id/56B91A6D.6030908@pgmasters.net

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Michael Paquier
michael.paquier@gmail.com
In reply to: Alvaro Herrera (#26)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Tue, Feb 16, 2016 at 5:58 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Peter Geoghegan wrote:

Michael (the CF manager at the time) remembered to change the status
to "Ready for Committer" again; you see this entry immediately
afterwards:

"New status: Ready for Committer"

but I gather from the CF app history that Alvaro (the current CF
manager) did not remember to do that second step when he later moved
the patch to the "next" (current) CF. Or maybe he just wasn't aware of
this quirk of the CF app.

I recall switching this patch as "Ready for committer" based on the
status of the thread.

That seems a bug in the CF app to me.

(FWIW I'm not the "current" CF manager, because the CF I managed is
already over. I'm not sure that we know who the manager for the
upcoming one is.)

When a patch with status "Ready for committer" on CF N is moved to CF
(N+1), its status is automatically changed to "Needs Review". That's
something to be aware of when cleaning up the CF app entries.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Michael Paquier (#29)
Commitfest Bug (was: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates)

On 2/16/16 12:38 AM, Michael Paquier wrote:

When a patch with status "Ready for committer" on CF N is moved to CF
(N+1), its status is automatically changed to "Needs Review". That's
something to be aware of when cleaning up the CF app entries.

I agree with Alvarro; this seems like a bug to me. If a patch is ready
for committer in CF N, why would it suddenly not be ready in N+1?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Pavel Stehule
pavel.stehule@gmail.com
In reply to: Jim Nasby (#30)
Re: Commitfest Bug (was: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates)

2016-02-17 3:19 GMT+01:00 Jim Nasby <Jim.Nasby@bluetreble.com>:

On 2/16/16 12:38 AM, Michael Paquier wrote:

When a patch with status "Ready for committer" on CF N is moved to CF
(N+1), its status is automatically changed to "Needs Review". That's
something to be aware of when cleaning up the CF app entries.

I agree with Alvarro; this seems like a bug to me. If a patch is ready for
committer in CF N, why would it suddenly not be ready in N+1?

+1,

This behave is pretty unpleasant and frustrating.

Regards

Pavel

Show quoted text

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#23)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Wed, Dec 23, 2015 at 12:19 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Tue, Dec 22, 2015 at 2:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

OK, so I've gone ahead and committed and back-patched that. Can you
please rebase and repost the remainder as a 9.6 proposal?

I attach a rebased patch for 9.6 only.

Committed, except I left out one comment hunk that I wasn't convinced about.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#32)
Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Wed, Feb 17, 2016 at 2:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Committed, except I left out one comment hunk that I wasn't convinced about.

Thank you.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Magnus Hagander
magnus@hagander.net
In reply to: Pavel Stehule (#31)
Re: Commitfest Bug (was: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates)

On Tue, Feb 16, 2016 at 11:12 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2016-02-17 3:19 GMT+01:00 Jim Nasby <Jim.Nasby@bluetreble.com>:

On 2/16/16 12:38 AM, Michael Paquier wrote:

When a patch with status "Ready for committer" on CF N is moved to CF
(N+1), its status is automatically changed to "Needs Review". That's
something to be aware of when cleaning up the CF app entries.

I agree with Alvarro; this seems like a bug to me. If a patch is ready
for committer in CF N, why would it suddenly not be ready in N+1?

+1,

This behave is pretty unpleasant and frustrating.

Well, it's in no way a bug, because it's the behavior we agreed upon at the
time :)

That said, we can certainly reconsider that. Would we always copy the value
over? Even if it was, say, rejected? (so it would be copied to the new CF
but still marked rejected) Or is there a subset of behaviors you're looking
for?

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#35Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Magnus Hagander (#34)
Re: Commitfest Bug (was: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates)

Magnus Hagander wrote:

On Tue, Feb 16, 2016 at 11:12 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

This behave is pretty unpleasant and frustrating.

Well, it's in no way a bug, because it's the behavior we agreed upon at the
time :)

I guess the "move to next CF" operation is new enough that we didn't yet
know what was the most useful behavior.

That said, we can certainly reconsider that. Would we always copy the value
over? Even if it was, say, rejected? (so it would be copied to the new CF
but still marked rejected) Or is there a subset of behaviors you're looking
for?

I think the states "Ready for Committer" and "Needs Review" ought to be
kept in the new CF.

I'm unclear on what to do about "Returned with Feedback" and "Waiting on
Author"; my first instinct is that if a patch is in those states, then
it shouldn't be possible to move to the next CF. On the other hand, if
we force the state to change to "Needs Review" before moving it, we
would lose the information of what state it was closed with. So perhaps
for any patch in those two states, the state in the next CF should be
"needs review" too.

I am even more unclear on "Rejected". My instinct says we should refuse
a move-to-next-cf for such patches.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#35)
Re: Commitfest Bug (was: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates)

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Magnus Hagander wrote:

That said, we can certainly reconsider that. Would we always copy the value
over? Even if it was, say, rejected? (so it would be copied to the new CF
but still marked rejected) Or is there a subset of behaviors you're looking
for?

I think the states "Ready for Committer" and "Needs Review" ought to be
kept in the new CF.

Definitely.

I'm unclear on what to do about "Returned with Feedback" and "Waiting on
Author"; my first instinct is that if a patch is in those states, then
it shouldn't be possible to move to the next CF. On the other hand, if
we force the state to change to "Needs Review" before moving it, we
would lose the information of what state it was closed with. So perhaps
for any patch in those two states, the state in the next CF should be
"needs review" too.

+1 for not moving such patches to the new CF until the author does
something --- at which point they'd change to "Needs Review" state.
But we should not change them into that state without author input.
And I don't see the value of having them in a new CF until the
author does something.

I am even more unclear on "Rejected". My instinct says we should refuse
a move-to-next-cf for such patches.

Right. Rejected is dead, it shouldn't propagate forward.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Peter Geoghegan
pg@heroku.com
In reply to: Tom Lane (#36)
Re: Commitfest Bug (was: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates)

On Tue, Mar 1, 2016 at 7:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

+1 for not moving such patches to the new CF until the author does
something --- at which point they'd change to "Needs Review" state.
But we should not change them into that state without author input.
And I don't see the value of having them in a new CF until the
author does something.

To be clear: My position was always that it's good that the author has
to do *something* to get their patch into the next CF. It's bad that
this change in state can easily be missed, though. I've now been on
both sides of this, as a patch author and patch reviewer. If the patch
was left as "Waiting on Author", as my review of Alexander's patch
was, then it ought to not change to "Needs Review" silently. That
makes absolutely no sense.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#36)
Re: Commitfest Bug (was: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates)

On Tue, Mar 1, 2016 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Magnus Hagander wrote:

That said, we can certainly reconsider that. Would we always copy the

value

over? Even if it was, say, rejected? (so it would be copied to the new

CF

but still marked rejected) Or is there a subset of behaviors you're

looking

for?

I think the states "Ready for Committer" and "Needs Review" ought to be
kept in the new CF.

Definitely.

I'm unclear on what to do about "Returned with Feedback" and "Waiting on
Author"; my first instinct is that if a patch is in those states, then
it shouldn't be possible to move to the next CF. On the other hand, if
we force the state to change to "Needs Review" before moving it, we
would lose the information of what state it was closed with. So perhaps
for any patch in those two states, the state in the next CF should be
"needs review" too.

+1 for not moving such patches to the new CF until the author does
something --- at which point they'd change to "Needs Review" state.
But we should not change them into that state without author input.
And I don't see the value of having them in a new CF until the
author does something.

Well, "waiting on author" is not considered to be a "closing". So as long
as we have patches in that state we want to do *something* with them in a
CF. Which currently can be "move to next cf", "reject" or "return with
feedback". So basically it means we have to decide if we want to reject it
or return-with-feedback it, if we say it should not be possible to "move to
next cf".

The problem is if the author already has something ready of course, in
which case it will become a new entry on the new CF instead of a moved one
-- which means we loose the historical connection between those two.

I am even more unclear on "Rejected". My instinct says we should refuse
a move-to-next-cf for such patches.

Right. Rejected is dead, it shouldn't propagate forward.

Ok, so let's try to summarize. We want the following actinos when someone
clicks "move to next cf":

Needs Review -> Needs Review
Waiting on Author -> Refuse moving
Ready for committer -> Ready for Committer
Committed -> refuse moving
Moved to next cf -> refuse moving (if it's already set like this, it would
probably be a bug)
Returned with feedback -> Refuse moving

Which basically means we only move things that are in needs review or ready
for committer state, and that we never actually change the status of a
patch in the moving.

Is that a correct summary?

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#39Michael Paquier
michael.paquier@gmail.com
In reply to: Magnus Hagander (#38)
Re: Commitfest Bug (was: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates)

On Wed, Mar 2, 2016 at 9:19 PM, Magnus Hagander <magnus@hagander.net> wrote:

Needs Review -> Needs Review
Waiting on Author -> Refuse moving
Ready for committer -> Ready for Committer
Committed -> refuse moving
Moved to next cf -> refuse moving (if it's already set like this, it would
probably be a bug)
Returned with feedback -> Refuse moving
Which basically means we only move things that are in needs review or ready
for committer state, and that we never actually change the status of a patch
in the moving.

Is that a correct summary?

Yes, thanks for the summary. It looks like this is what people would
expect based on what I read here.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Magnus Hagander
magnus@hagander.net
In reply to: Michael Paquier (#39)
Re: Commitfest Bug (was: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates)

On Wed, Mar 2, 2016 at 7:34 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Wed, Mar 2, 2016 at 9:19 PM, Magnus Hagander <magnus@hagander.net>
wrote:

Needs Review -> Needs Review
Waiting on Author -> Refuse moving
Ready for committer -> Ready for Committer
Committed -> refuse moving
Moved to next cf -> refuse moving (if it's already set like this, it

would

probably be a bug)
Returned with feedback -> Refuse moving
Which basically means we only move things that are in needs review or

ready

for committer state, and that we never actually change the status of a

patch

in the moving.

Is that a correct summary?

Yes, thanks for the summary. It looks like this is what people would
expect based on what I read here.

Ok, I've pushed a code that does that.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#41Peter Geoghegan
pg@heroku.com
In reply to: Magnus Hagander (#40)
Re: Commitfest Bug (was: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates)

On Wed, Mar 2, 2016 at 5:41 AM, Magnus Hagander <magnus@hagander.net> wrote:

Ok, I've pushed a code that does that.

Thank you.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers