PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

Started by Jon Nelsonalmost 12 years ago10 messages
#1Jon Nelson
jnelson+pgsql@jamponi.net
1 attachment(s)

Greetings -hackers:

I have worked up a patch to PostgreSQL which elides tuples during an
external sort. The primary use case is when sorted input is being used
to feed a DISTINCT operation. The idea is to throw out tuples that
compare as identical whenever it's convenient, predicated on the
assumption that even a single I/O is more expensive than some number
of (potentially extra) comparisons. Obviously, this is where a cost
model comes in, which has not been implemented. This patch is a
work-in-progress.

Being very new to hacking PostgreSQL, I've probably done a bunch of
things wrong, but I've tried to test it thoroughly.

A rough summary of the patch follows:

- a GUC variable enables or disables this capability
- in nodeAgg.c, eliding duplicate tuples is enabled if the number of
  distinct columns is equal to the number of sort columns (and both are
  greater than zero).
- in createplan.c, eliding duplicate tuples is enabled if we are
  creating a unique plan which involves sorting first
- ditto planner.c
- all of the remaining changes are in tuplesort.c, which consist of:
  + a new macro, DISCARDTUP and a new structure member, discardtup, are
    both defined and operate similar to COMPARETUP, COPYTUP, etc...
  + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply
    throw out the new tuple if it compares as identical to the tuple at
    the top of the heap. Since we're already performing this comparison,
    this is essentially free.
  + in mergeonerun, we may discard a tuple if it compares as identical
    to the *last written tuple*. This is a comparison that did not take
    place before, so it's not free, but it saves a write I/O.
  + We perform the same logic in dumptuples

I have tested this patch with a bunch of different data, and the
results are summarized below.

Test 1:
I used a corpus of approximately 1.5TB of textual data, consisting of
7.4 billion input rows, 12.6% of which is unique. This is a 'real-world'
test.

Before the patch: 44.25 hours using 274GB of temporary files
After the patch: 36.44 hours using 125GB of temporary files
Difference: -7.6 hours, 18% faster

Test 2:
Acquire the unique set of strings from a totally random set that
contains no duplicates. This is a 'worst case'.

Input: 700 million rows, 32GB.
Before: 18,796 seconds.
After: 19,358 seconds
Difference: +562 seconds, *slower* by 3%.
In this particular case, the performance varies from faster to slower, and I'm
not sure why. Statistical noise?

Test 3:
Acquire the unique set of integers from a set of 100 million, all
happen to have the value '1'. This is a 'best case'.

Input: 100 million rows, 3.4GB on-disk
Before: 26.1 seconds using 1.3GB of temporary files
After: 8.9 seconds using 8KB of disk
Difference: -17.2 seconds, 65% faster

Test 4:
Similar to test 3, but using 1 billion rows.
Before: 1450 seconds, 13 GB of temporary files
After: 468 seconds, 8 KB of temporary files
Difference: -982 seconds, 68% faster

See below for details on test 4.

Lastly, 'make check' passes without issues (modulo rangefuncs grumping about
changes in pg_settings).

Tests 1 through 3 were performed against PostgreSQL 9.4 HEAD as of
early September 2013. I have rebased the patch against
PostgreSQL 9.4 HEAD as of 19 January 2014, and re-run tests
2, 3, and 4 with similar results.

Regarding the patch: I've tried to conform to the indenting and style
rules, including using pg_bsd_indent and pg_indent, but I've not had
100% success.

The details on test 4:

A simple example, using 1 billion rows all with the value '1' (integer):
This is a fabricated *best case*.

postgres=# set enable_hashagg = false;
SET
Time: 30.402 ms
postgres=# explain analyze verbose select distinct * from i ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=245942793.27..250942793.27 rows=1 width=4) (actual
time=468188.900..468188.901 rows=1 loops=1)
Output: i
-> Sort (cost=245942793.27..248442793.27 rows=1000000000 width=4)
(actual time=468188.897..468188.897 rows=1 loops=1)
Output: i
Sort Key: i.i
Sort Method: external sort Disk: 8kB
-> Seq Scan on public.i (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=34.765..254595.990
rows=1000000000 loops=1)
Output: i
Total runtime: 468189.125 ms
(9 rows)

Time: 468189.755 ms
postgres=# set enable_opportunistic_tuple_elide = false;
SET
Time: 30.402 ms
postgres=# explain analyze verbose select distinct * from i ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=245942793.27..250942793.27 rows=1 width=4) (actual
time=1047226.451..1449371.632 rows=1 loops=1)
Output: i
-> Sort (cost=245942793.27..248442793.27 rows=1000000000 width=4)
(actual time=1047226.447..1284922.616 rows=1000000000 loops=1)
Output: i
Sort Key: i.i
Sort Method: external sort Disk: 13685248kB
-> Seq Scan on public.i (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=29.832..450978.694
rows=1000000000 loops=1)
Output: i
Total runtime: 1449644.717 ms
(9 rows)

There are some issues that remain.
For example, the following two queries behave very differently (they
*are* different queries, of course).

Compare this:
postgres=# explain ( analyze true, verbose true, costs true, buffers
true, timing true ) select count(distinct i.i) from i ;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16924779.00..16924779.01 rows=1 width=4) (actual
time=894138.114..894138.115 rows=1 loops=1)
Output: count(DISTINCT i)
Buffers: shared hit=102 read=4424683, temp read=1466277 written=1466277
-> Seq Scan on public.i (cost=0.00..14424779.00 rows=1000000000
width=4) (actual time=0.012..349142.072 rows=1000000000 loops=1)
Output: i
Buffers: shared hit=96 read=4424683
Total runtime: 894138.233 ms
(7 rows)

Time: 894190.392 ms
postgres=#

to this (with this code path enabled):

postgres=# explain ( analyze true, verbose true, costs true, buffers
true, timing true ) select count(1) from (select distinct i.i from i)
as sub ;

Aggregate (cost=182583418.28..182583418.29 rows=1 width=0) (actual
time=451973.381..451973.417 rows=1 loops=1)
Output: count(1)
Buffers: shared hit=192 read=4424587, temp read=1 written=1
-> Unique (cost=177583418.27..182583418.27 rows=1 width=4)
(actual time=451973.370..451973.372 rows=1 loops=1)
Output: i.i
Buffers: shared hit=192 read=4424587, temp read=1 written=1
-> Sort (cost=177583418.27..180083418.27 rows=1000000000
width=4) (actual time=451973.366..451973.367 rows=1 loops=1)
Output: i.i
Sort Key: i.i
Sort Method: external sort Disk: 8kB
Buffers: shared hit=192 read=4424587, temp read=1 written=1
-> Seq Scan on public.i (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=0.016..223587.173
rows=1000000000 loops=1)
Output: i.i
Buffers: shared hit=192 read=4424587
Total runtime: 451994.583 ms

(which also logged the following information)
LOG: tuplesort_end: opportunistically elided 999999999 tuples
(1864133 extra comparisons): 0 during initial accrual, 998135866
during build, 0 during merge, 1864133 during dump

For comparison purposes, here is the same query but with hash
aggregation enabled:

Aggregate (cost=16924779.02..16924779.03 rows=1 width=0) (actual
time=559379.525..559379.525 rows=1 loops=1)
Output: count(1)
Buffers: shared hit=224 read=4424555
-> HashAggregate (cost=16924779.00..16924779.01 rows=1 width=4)
(actual time=559379.513..559379.513 rows=1 loops=1)
Output: i.i
Group Key: i.i
Buffers: shared hit=224 read=4424555
-> Seq Scan on public.i (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=0.016..223704.468
rows=1000000000 loops=1)
Output: i.i
Buffers: shared hit=224 read=4424555
Total runtime: 559379.608 ms
(11 rows)

A note: this work was supported by my employer, Renesys Corporation (
http://www.renesys.com/ ).

The patch was formatted using 'unified' diff, per
http://wiki.postgresql.org/wiki/Submitting_a_Patch

--
Jon

Attachments:

patch001.difftext/plain; charset=US-ASCII; name=patch001.diffDownload
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8eae43d..cfdab17 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2701,7 +2701,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	state.tuplesort = tuplesort_begin_datum(TIDOID, TIDLessOperator,
 											InvalidOid, false,
 											maintenance_work_mem,
-											false);
+											false, false);
 	state.htups = state.itups = state.tups_inserted = 0;
 
 	(void) index_bulk_delete(&ivinfo, NULL,
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 7e4bca5..2c09dec 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -115,6 +115,7 @@
 #include "utils/tuplesort.h"
 #include "utils/datum.h"
 
+extern bool trace_sort;
 
 /*
  * AggStatePerAggData - per-aggregate working state for the Agg scan
@@ -343,6 +344,15 @@ initialize_aggregates(AggState *aggstate,
 		 */
 		if (peraggstate->numSortCols > 0)
 		{
+			bool		elideDuplicates = (peraggstate->numDistinctCols == peraggstate->numSortCols) && peraggstate->numDistinctCols > 0;
+
+#ifdef TRACE_SORT
+			if (trace_sort)
+
+				elog(LOG, "initialize_aggregates: elideDuplicates = %s, numDistinctCols = %d, numSortCols = %d",
+					 elideDuplicates ? "true" : "false", peraggstate->numDistinctCols, peraggstate->numSortCols);
+#endif
+
 			/*
 			 * In case of rescan, maybe there could be an uncompleted sort
 			 * operation?  Clean it up if so.
@@ -361,14 +371,14 @@ initialize_aggregates(AggState *aggstate,
 									  peraggstate->sortOperators[0],
 									  peraggstate->sortCollations[0],
 									  peraggstate->sortNullsFirst[0],
-									  work_mem, false) :
+									  work_mem, false, elideDuplicates) :
 				tuplesort_begin_heap(peraggstate->evaldesc,
 									 peraggstate->numSortCols,
 									 peraggstate->sortColIdx,
 									 peraggstate->sortOperators,
 									 peraggstate->sortCollations,
 									 peraggstate->sortNullsFirst,
-									 work_mem, false);
+									 work_mem, false, elideDuplicates);
 		}
 
 		/*
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b88571b..3afe8b3 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -89,7 +89,8 @@ ExecSort(SortState *node)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  node->randomAccess);
+											  node->randomAccess,
+											  plannode->elideDuplicates);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 184d37a..1b5aaee 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1063,6 +1063,11 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path)
 			groupColPos++;
 		}
 		plan = (Plan *) make_sort_from_sortclauses(root, sortList, subplan);
+		{
+			Sort	   *s = (Sort *) plan;
+
+			s->elideDuplicates = true;
+		}
 		plan = (Plan *) make_unique(plan, sortList);
 	}
 
@@ -3763,6 +3768,7 @@ make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
 	node->sortOperators = sortOperators;
 	node->collations = collations;
 	node->nullsFirst = nullsFirst;
+	node->elideDuplicates = false;
 
 	return node;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 35bda67..fe01511 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1889,6 +1889,11 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
 															   result_plan,
 															current_pathkeys,
 															   -1.0);
+				{
+					Sort	   *s = (Sort *) result_plan;
+
+					s->elideDuplicates = true;
+				}
 			}
 
 			result_plan = (Plan *) make_unique(result_plan,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index e5324e2..724f894 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -280,13 +280,13 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 												   qstate->sortOperators,
 												   qstate->sortCollations,
 												   qstate->sortNullsFirsts,
-												   work_mem, false);
+												   work_mem, false, false);
 	else
 		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
 													qstate->sortOperator,
 													qstate->sortCollation,
 													qstate->sortNullsFirst,
-													work_mem, false);
+													work_mem, false, false);
 
 	osastate->number_of_rows = 0;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1217098..f5d616d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -130,6 +130,7 @@ extern char *SSLCipherSuites;
 extern char *SSLECDHCurve;
 extern bool SSLPreferServerCiphers;
 
+extern bool enable_opportunistic_tuple_elide;
 #ifdef TRACE_SORT
 extern bool trace_sort;
 #endif
@@ -733,6 +734,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"enable_opportunistic_tuple_elide", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the sort code to opportunistically elide tuples."),
+			NULL
+		},
+		&enable_opportunistic_tuple_elide,
+		false,
+		NULL, NULL, NULL
+	},
+	{
 		{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of materialization."),
 			NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 8b520c1..047ac16 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -131,6 +131,7 @@ bool		trace_sort = false;
 #ifdef DEBUG_BOUNDED_SORT
 bool		optimize_bounded_sort = true;
 #endif
+bool		enable_opportunistic_tuple_elide = false;
 
 
 /*
@@ -209,6 +210,17 @@ struct Tuplesortstate
 	bool		randomAccess;	/* did caller request random access? */
 	bool		bounded;		/* did caller specify a maximum number of
 								 * tuples to return? */
+
+	int			elideDuplicates;
+	int64		tuplesElided;
+	int64		tuplesElidedDuringInitial;
+	int64		tuplesElidedDuringBuild;
+	int64		tuplesElidedDuringMerge;
+	int64		tuplesElidedDuringDump;
+	int64		extraComparisons;
+	bool		have_tup;
+	SortTuple	last_dumped_tup;
+
 	bool		boundUsed;		/* true if we made use of a bounded heap */
 	int			bound;			/* if bounded, the maximum number of tuples */
 	int64		availMem;		/* remaining memory available, in bytes */
@@ -248,6 +260,15 @@ struct Tuplesortstate
 										 SortTuple *stup);
 
 	/*
+	 * Function to discard a previously-stored tuple. pfree() the out-of-line
+	 * data (not the SortTuple struct!), and increase state->availMem by the
+	 * amount of memory space thereby released.
+	 */
+	void		(*discardtup) (Tuplesortstate *state, int tapenum,
+										   SortTuple *stup);
+
+
+	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
 	 * the already-read length of the stored tuple.  Create a palloc'd copy,
 	 * initialize tuple/datum1/isnull1 in the target SortTuple struct, and
@@ -394,6 +415,7 @@ struct Tuplesortstate
 #define COMPARETUP(state,a,b)	((*(state)->comparetup) (a, b, state))
 #define COPYTUP(state,stup,tup) ((*(state)->copytup) (state, stup, tup))
 #define WRITETUP(state,tape,stup)	((*(state)->writetup) (state, tape, stup))
+#define DISCARDTUP(state,tape,stup) ((*(state)->discardtup) (state, tape, stup))
 #define READTUP(state,stup,tape,len) ((*(state)->readtup) (state, stup, tape, len))
 #define REVERSEDIRECTION(state) ((*(state)->reversedirection) (state))
 #define LACKMEM(state)		((state)->availMem < 0)
@@ -449,7 +471,7 @@ struct Tuplesortstate
 	} while(0)
 
 
-static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
+static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess, bool elideDuplicates);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static void inittapes(Tuplesortstate *state);
 static void selectnewtape(Tuplesortstate *state);
@@ -471,6 +493,8 @@ static int comparetup_heap(const SortTuple *a, const SortTuple *b,
 static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
 static void writetup_heap(Tuplesortstate *state, int tapenum,
 			  SortTuple *stup);
+static void discardtup_heap(Tuplesortstate *state, int tapenum,
+				SortTuple *stup);
 static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
 			 int tapenum, unsigned int len);
 static void reversedirection_heap(Tuplesortstate *state);
@@ -479,6 +503,8 @@ static int comparetup_cluster(const SortTuple *a, const SortTuple *b,
 static void copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup);
 static void writetup_cluster(Tuplesortstate *state, int tapenum,
 				 SortTuple *stup);
+static void discardtup_cluster(Tuplesortstate *state, int tapenum,
+				   SortTuple *stup);
 static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 				int tapenum, unsigned int len);
 static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
@@ -488,6 +514,8 @@ static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 static void copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup);
 static void writetup_index(Tuplesortstate *state, int tapenum,
 			   SortTuple *stup);
+static void discardtup_index(Tuplesortstate *state, int tapenum,
+				 SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
 static void reversedirection_index_btree(Tuplesortstate *state);
@@ -497,6 +525,8 @@ static int comparetup_datum(const SortTuple *a, const SortTuple *b,
 static void copytup_datum(Tuplesortstate *state, SortTuple *stup, void *tup);
 static void writetup_datum(Tuplesortstate *state, int tapenum,
 			   SortTuple *stup);
+static void discardtup_datum(Tuplesortstate *state, int tapenum,
+				 SortTuple *stup);
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
 static void reversedirection_datum(Tuplesortstate *state);
@@ -532,7 +562,7 @@ static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
  */
 
 static Tuplesortstate *
-tuplesort_begin_common(int workMem, bool randomAccess)
+tuplesort_begin_common(int workMem, bool randomAccess, bool elideDuplicates)
 {
 	Tuplesortstate *state;
 	MemoryContext sortcontext;
@@ -564,6 +594,28 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
 	state->bounded = false;
+
+	if (!enable_opportunistic_tuple_elide && elideDuplicates)
+	{
+		elog(LOG, "tuplesort_begin_common: enable_opportunistic_tuple_elide is false but elideDuplicates is true. Changing to false.");
+		state->elideDuplicates = false;
+	}
+	else
+	{
+		state->elideDuplicates = elideDuplicates;
+		if (trace_sort)
+		{
+			elog(LOG, "tuplesort_begin_common: elideDuplicates: %s", elideDuplicates ? "true" : "false");
+		}
+	}
+	state->tuplesElided = 0;
+	state->tuplesElidedDuringInitial = 0;
+	state->tuplesElidedDuringBuild = 0;
+	state->tuplesElidedDuringMerge = 0;
+	state->tuplesElidedDuringDump = 0;
+	state->extraComparisons = 0;
+	state->have_tup = false;
+
 	state->boundUsed = false;
 	state->allowedMem = workMem * (int64) 1024;
 	state->availMem = state->allowedMem;
@@ -600,9 +652,9 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, bool randomAccess)
+					 int workMem, bool randomAccess, bool elideDuplicates)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess, elideDuplicates);
 	MemoryContext oldcontext;
 	int			i;
 
@@ -613,8 +665,8 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 #ifdef TRACE_SORT
 	if (trace_sort)
 		elog(LOG,
-			 "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
-			 nkeys, workMem, randomAccess ? 't' : 'f');
+			 "tuplesort_begin_heap: nkeys = %d, workMem = %d, randomAccess = %c, elideDuplicates = %s",
+			 nkeys, workMem, randomAccess ? 't' : 'f', state->elideDuplicates ? "true" : "false");
 #endif
 
 	state->nKeys = nkeys;
@@ -628,6 +680,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	state->comparetup = comparetup_heap;
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
+	state->discardtup = discardtup_heap;
 	state->readtup = readtup_heap;
 	state->reversedirection = reversedirection_heap;
 
@@ -664,7 +717,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
 						int workMem, bool randomAccess)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess, false);
 	MemoryContext oldcontext;
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
@@ -674,7 +727,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 #ifdef TRACE_SORT
 	if (trace_sort)
 		elog(LOG,
-			 "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
+			 "tuplesort_begin_cluster: nkeys = %d, workMem = %d, randomAccess = %c",
 			 RelationGetNumberOfAttributes(indexRel),
 			 workMem, randomAccess ? 't' : 'f');
 #endif
@@ -690,6 +743,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	state->comparetup = comparetup_cluster;
 	state->copytup = copytup_cluster;
 	state->writetup = writetup_cluster;
+	state->discardtup = discardtup_cluster;
 	state->readtup = readtup_cluster;
 	state->reversedirection = reversedirection_index_btree;
 
@@ -726,7 +780,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							bool enforceUnique,
 							int workMem, bool randomAccess)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess, false);
 	MemoryContext oldcontext;
 
 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -750,6 +804,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->comparetup = comparetup_index_btree;
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
+	state->discardtup = discardtup_index;
 	state->readtup = readtup_index;
 	state->reversedirection = reversedirection_index_btree;
 
@@ -769,7 +824,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 						   uint32 hash_mask,
 						   int workMem, bool randomAccess)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess, false);
 	MemoryContext oldcontext;
 
 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -787,6 +842,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	state->comparetup = comparetup_index_hash;
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
+	state->discardtup = discardtup_index;
 	state->readtup = readtup_index;
 	state->reversedirection = reversedirection_index_hash;
 
@@ -803,9 +859,9 @@ tuplesort_begin_index_hash(Relation heapRel,
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag,
-					  int workMem, bool randomAccess)
+					  int workMem, bool randomAccess, bool elideDuplicates)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess, elideDuplicates);
 	MemoryContext oldcontext;
 	int16		typlen;
 	bool		typbyval;
@@ -815,8 +871,8 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 #ifdef TRACE_SORT
 	if (trace_sort)
 		elog(LOG,
-			 "begin datum sort: workMem = %d, randomAccess = %c",
-			 workMem, randomAccess ? 't' : 'f');
+			 "tuplesort_begin_datum: workMem = %d, randomAccess = %c, elideDuplicates = %s",
+			 workMem, randomAccess ? 't' : 'f', state->elideDuplicates ? "true" : "false");
 #endif
 
 	state->nKeys = 1;			/* always a one-column sort */
@@ -830,6 +886,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	state->comparetup = comparetup_datum;
 	state->copytup = copytup_datum;
 	state->writetup = writetup_datum;
+	state->discardtup = discardtup_datum;
 	state->readtup = readtup_datum;
 	state->reversedirection = reversedirection_datum;
 
@@ -886,6 +943,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 
 	state->bounded = true;
 	state->bound = (int) bound;
+
+	if (state->elideDuplicates)
+	{
+		elog(LOG, "tuplesort_set_bound: elideDuplicates is true but this is a bounded sort. Changing to false.");
+		state->elideDuplicates = false;
+	}
 }
 
 /*
@@ -930,6 +993,16 @@ tuplesort_end(Tuplesortstate *state)
 		else
 			elog(LOG, "internal sort ended, %ld KB used: %s",
 				 spaceUsed, pg_rusage_show(&state->ru_start));
+		if (state->elideDuplicates)
+		{
+			elog(LOG, "tuplesort_end: opportunistically elided %ld tuples " \
+				 "(%ld extra comparisons): %ld during initial accrual, %ld during build, %ld during merge, %ld during dump",
+				 state->tuplesElided, state->extraComparisons,
+				 state->tuplesElidedDuringInitial,
+				 state->tuplesElidedDuringBuild,
+				 state->tuplesElidedDuringMerge,
+				 state->tuplesElidedDuringDump);
+		}
 	}
 
 	TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, spaceUsed);
@@ -1297,10 +1370,25 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple)
 			 * this point; see dumptuples.
 			 */
 			Assert(state->memtupcount > 0);
-			if (COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
-				tuplesort_heap_insert(state, tuple, state->currentRun, true);
-			else
-				tuplesort_heap_insert(state, tuple, state->currentRun + 1, true);
+
+			{
+				int			ret = COMPARETUP(state, tuple, &state->memtuples[0]);
+
+				if (ret == 0 && state->elideDuplicates)
+				{
+					free_sort_tuple(state, tuple);
+					state->tuplesElided++;
+					state->tuplesElidedDuringBuild++;
+				}
+				else if (ret >= 0)
+				{
+					tuplesort_heap_insert(state, tuple, state->currentRun, true);
+				}
+				else
+				{
+					tuplesort_heap_insert(state, tuple, state->currentRun + 1, true);
+				}
+			}
 
 			/*
 			 * If we are over the memory limit, dump tuples till we're under.
@@ -1403,6 +1491,10 @@ tuplesort_performsort(Tuplesortstate *state)
 				 pg_rusage_show(&state->ru_start));
 	}
 #endif
+	if (trace_sort && state->elideDuplicates)
+	{
+		elog(LOG, "performsort: opportunistically elided %ld tuples.", state->tuplesElided);
+	}
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -1847,6 +1939,8 @@ inittapes(Tuplesortstate *state)
 	if (trace_sort)
 		elog(LOG, "switching to external sort with %d tapes: %s",
 			 maxTapes, pg_rusage_show(&state->ru_start));
+	if (state->elideDuplicates)
+		elog(LOG, "inittapes: opportunistically elided %ld tuples so far.", state->tuplesElided);
 #endif
 
 	/*
@@ -2124,7 +2218,39 @@ mergeonerun(Tuplesortstate *state)
 		/* write the tuple to destTape */
 		priorAvail = state->availMem;
 		srcTape = state->memtuples[0].tupindex;
-		WRITETUP(state, destTape, &state->memtuples[0]);
+
+		if (state->elideDuplicates && !state->have_tup)
+		{
+			WRITETUP(state, destTape, &state->memtuples[0]);
+			state->last_dumped_tup = state->memtuples[0];
+			/* don't discard it */
+			state->have_tup = true;
+		}
+		else
+		{						/* have_tup || !elide duplicates */
+			if (state->elideDuplicates)
+			{
+				if (COMPARETUP(state, &state->last_dumped_tup, &state->memtuples[0]) == 0)
+				{
+					DISCARDTUP(state, destTape, &state->memtuples[0]);
+					state->tuplesElided++;
+					state->tuplesElidedDuringMerge++;
+				}
+				else
+				{
+					WRITETUP(state, destTape, &state->memtuples[0]);
+					DISCARDTUP(state, destTape, &state->last_dumped_tup);
+					state->last_dumped_tup = state->memtuples[0];
+				}
+				state->extraComparisons++;
+			}
+			else
+			{
+				WRITETUP(state, destTape, &state->memtuples[0]);
+				DISCARDTUP(state, destTape, &state->memtuples[0]);
+			}
+		}
+
 		/* writetup adjusted total free space, now fix per-tape space */
 		spaceFreed = state->availMem - priorAvail;
 		state->mergeavailmem[srcTape] += spaceFreed;
@@ -2150,6 +2276,12 @@ mergeonerun(Tuplesortstate *state)
 		state->mergeavailslots[srcTape]++;
 	}
 
+	if (state->elideDuplicates && state->have_tup)
+	{
+		DISCARDTUP(state, state->tp_tapenum[state->destTape], &state->last_dumped_tup);
+		state->have_tup = false;
+	}
+
 	/*
 	 * When the heap empties, we're done.  Write an end-of-run marker on the
 	 * output tape, and increment its count of real runs.
@@ -2159,8 +2291,14 @@ mergeonerun(Tuplesortstate *state)
 
 #ifdef TRACE_SORT
 	if (trace_sort)
+	{
 		elog(LOG, "finished %d-way merge step: %s", state->activeTapes,
 			 pg_rusage_show(&state->ru_start));
+		if (state->elideDuplicates)
+		{
+			elog(LOG, "mergeonerun: opportunistically elided %ld tuples.", state->tuplesElided);
+		}
+	}
 #endif
 }
 
@@ -2370,9 +2508,42 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		 * Dump the heap's frontmost entry, and sift up to remove it from the
 		 * heap.
 		 */
+		int			destTape = state->tp_tapenum[state->destTape];
+
 		Assert(state->memtupcount > 0);
-		WRITETUP(state, state->tp_tapenum[state->destTape],
-				 &state->memtuples[0]);
+
+		if (state->elideDuplicates && !state->have_tup)
+		{
+			WRITETUP(state, destTape, &state->memtuples[0]);
+			state->last_dumped_tup = state->memtuples[0];
+			/* don't discard it */
+			state->have_tup = true;
+		}
+		else
+		{						/* have_tup || !elide duplicates */
+			if (state->elideDuplicates)
+			{
+				if (COMPARETUP(state, &state->last_dumped_tup, &state->memtuples[0]) == 0)
+				{
+					DISCARDTUP(state, destTape, &state->memtuples[0]);
+					state->tuplesElided++;
+					state->tuplesElidedDuringDump++;
+				}
+				else
+				{
+					WRITETUP(state, destTape, &state->memtuples[0]);
+					DISCARDTUP(state, destTape, &state->last_dumped_tup);
+					state->last_dumped_tup = state->memtuples[0];
+				}
+				state->extraComparisons++;
+			}
+			else
+			{
+				WRITETUP(state, destTape, &state->memtuples[0]);
+				DISCARDTUP(state, destTape, &state->memtuples[0]);
+			}
+		}
+
 		tuplesort_heap_siftup(state, true);
 
 		/*
@@ -2389,10 +2560,16 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 
 #ifdef TRACE_SORT
 			if (trace_sort)
+			{
 				elog(LOG, "finished writing%s run %d to tape %d: %s",
 					 (state->memtupcount == 0) ? " final" : "",
 					 state->currentRun, state->destTape,
 					 pg_rusage_show(&state->ru_start));
+				if (state->elideDuplicates)
+				{
+					elog(LOG, "dumptuples: opportunistically elided %ld tuples.", state->tuplesElided);
+				}
+			}
 #endif
 
 			/*
@@ -2404,6 +2581,16 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 			selectnewtape(state);
 		}
 	}
+
+	if (alltuples)
+	{
+		/* last run */
+		if (state->elideDuplicates && state->have_tup)
+		{
+			DISCARDTUP(state, state->tp_tapenum[state->destTape], &state->last_dumped_tup);
+			state->have_tup = false;
+		}
+	}
 }
 
 /*
@@ -2935,6 +3122,12 @@ writetup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup)
 	if (state->randomAccess)	/* need trailing length word? */
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &tuplen, sizeof(tuplen));
+}
+
+static void
+discardtup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup)
+{
+	MinimalTuple tuple = (MinimalTuple) stup->tuple;
 
 	FREEMEM(state, GetMemoryChunkSpace(tuple));
 	heap_free_minimal_tuple(tuple);
@@ -3123,6 +3316,13 @@ writetup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup)
 	if (state->randomAccess)	/* need trailing length word? */
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 &tuplen, sizeof(tuplen));
+}
+
+
+static void
+discardtup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup)
+{
+	HeapTuple	tuple = (HeapTuple) stup->tuple;
 
 	FREEMEM(state, GetMemoryChunkSpace(tuple));
 	heap_freetuple(tuple);
@@ -3365,6 +3565,12 @@ writetup_index(Tuplesortstate *state, int tapenum, SortTuple *stup)
 	if (state->randomAccess)	/* need trailing length word? */
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &tuplen, sizeof(tuplen));
+}
+
+static void
+discardtup_index(Tuplesortstate *state, int tapenum, SortTuple *stup)
+{
+	IndexTuple	tuple = (IndexTuple) stup->tuple;
 
 	FREEMEM(state, GetMemoryChunkSpace(tuple));
 	pfree(tuple);
@@ -3463,11 +3669,17 @@ writetup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup)
 	if (state->randomAccess)	/* need trailing length word? */
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &writtenlen, sizeof(writtenlen));
+}
+
 
+static void
+discardtup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup)
+{
 	if (stup->tuple)
 	{
 		FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
 		pfree(stup->tuple);
+		stup->tuple = NULL;
 	}
 }
 
@@ -3524,4 +3736,5 @@ free_sort_tuple(Tuplesortstate *state, SortTuple *stup)
 {
 	FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
 	pfree(stup->tuple);
+	stup->tuple = NULL;
 }
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 846c31a..34de503 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -913,6 +913,7 @@ typedef struct SortGroupClause
 	Oid			sortop;			/* the ordering operator ('<' op), or 0 */
 	bool		nulls_first;	/* do NULLs come before normal values? */
 	bool		hashable;		/* can eqop be implemented by hashing? */
+	bool		elideDuplicates;
 } SortGroupClause;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 38c039c..78ccc72 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -586,6 +586,7 @@ typedef struct Sort
 	Oid		   *sortOperators;	/* OIDs of operators to sort them by */
 	Oid		   *collations;		/* OIDs of collations */
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
+	bool		elideDuplicates;
 } Sort;
 
 /* ---------------
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 05445f0..bdb73c6 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -62,7 +62,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, bool randomAccess);
+					 int workMem, bool randomAccess, bool elideDuplicates);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
 						int workMem, bool randomAccess);
@@ -77,7 +77,7 @@ extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 					  Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag,
-					  int workMem, bool randomAccess);
+					  int workMem, bool randomAccess, bool elideDuplicates);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
 
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jon Nelson (#1)
Re: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

Jon Nelson <jnelson+pgsql@jamponi.net> writes:

A rough summary of the patch follows:

- a GUC variable enables or disables this capability
- in nodeAgg.c, eliding duplicate tuples is enabled if the number of
distinct columns is equal to the number of sort columns (and both are
greater than zero).
- in createplan.c, eliding duplicate tuples is enabled if we are
creating a unique plan which involves sorting first
- ditto planner.c
- all of the remaining changes are in tuplesort.c, which consist of:
+ a new macro, DISCARDTUP and a new structure member, discardtup, are
both defined and operate similar to COMPARETUP, COPYTUP, etc...
+ in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply
throw out the new tuple if it compares as identical to the tuple at
the top of the heap. Since we're already performing this comparison,
this is essentially free.
+ in mergeonerun, we may discard a tuple if it compares as identical
to the *last written tuple*. This is a comparison that did not take
place before, so it's not free, but it saves a write I/O.
+ We perform the same logic in dumptuples

[ raised eyebrow ... ] And what happens if the planner drops the
unique step and then the sort doesn't actually go to disk?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Tom Lane (#2)
Re: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

On Tue, Jan 21, 2014 at 9:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jon Nelson <jnelson+pgsql@jamponi.net> writes:

A rough summary of the patch follows:

- a GUC variable enables or disables this capability
- in nodeAgg.c, eliding duplicate tuples is enabled if the number of
distinct columns is equal to the number of sort columns (and both are
greater than zero).
- in createplan.c, eliding duplicate tuples is enabled if we are
creating a unique plan which involves sorting first
- ditto planner.c
- all of the remaining changes are in tuplesort.c, which consist of:
+ a new macro, DISCARDTUP and a new structure member, discardtup, are
both defined and operate similar to COMPARETUP, COPYTUP, etc...
+ in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply
throw out the new tuple if it compares as identical to the tuple at
the top of the heap. Since we're already performing this comparison,
this is essentially free.
+ in mergeonerun, we may discard a tuple if it compares as identical
to the *last written tuple*. This is a comparison that did not take
place before, so it's not free, but it saves a write I/O.
+ We perform the same logic in dumptuples

[ raised eyebrow ... ] And what happens if the planner drops the
unique step and then the sort doesn't actually go to disk?

I'm not familiar enough with the code to be able to answer your
question with any sort of authority, but I believe that if the state
TSS_BUILDRUNS is never hit, then basically nothing new happens.

--
Jon

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Jeremy Harris
jgh@wizmail.org
In reply to: Jon Nelson (#1)
Re: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

On 22/01/14 03:16, Jon Nelson wrote:

Greetings -hackers:

I have worked up a patch to PostgreSQL which elides tuples during an
external sort. The primary use case is when sorted input is being used
to feed a DISTINCT operation. The idea is to throw out tuples that
compare as identical whenever it's convenient, predicated on the
assumption that even a single I/O is more expensive than some number
of (potentially extra) comparisons. Obviously, this is where a cost
model comes in, which has not been implemented. This patch is a
work-in-progress.

Dedup-in-sort is also done by my WIP internal merge sort, and
extended (in much the same ways as Jon's) to the external merge.

https://github.com/j47996/pgsql_sorb

I've not done a cost model either, but the dedup capability is
exposed from tuplesort.c to the executor, and downstream uniq
nodes removed.

I've not worked out yet how to eliminate upstream hashagg nodes,
which would be worthwhile from testing results.

--
Cheers,
Jeremy

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Jeremy Harris
jgh@wizmail.org
In reply to: Tom Lane (#2)
Re: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

On 22/01/14 03:53, Tom Lane wrote:

Jon Nelson <jnelson+pgsql@jamponi.net> writes:

A rough summary of the patch follows:

- a GUC variable enables or disables this capability
- in nodeAgg.c, eliding duplicate tuples is enabled if the number of
distinct columns is equal to the number of sort columns (and both are
greater than zero).
- in createplan.c, eliding duplicate tuples is enabled if we are
creating a unique plan which involves sorting first
- ditto planner.c
- all of the remaining changes are in tuplesort.c, which consist of:
+ a new macro, DISCARDTUP and a new structure member, discardtup, are
both defined and operate similar to COMPARETUP, COPYTUP, etc...
+ in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply
throw out the new tuple if it compares as identical to the tuple at
the top of the heap. Since we're already performing this comparison,
this is essentially free.
+ in mergeonerun, we may discard a tuple if it compares as identical
to the *last written tuple*. This is a comparison that did not take
place before, so it's not free, but it saves a write I/O.
+ We perform the same logic in dumptuples

[ raised eyebrow ... ] And what happens if the planner drops the
unique step and then the sort doesn't actually go to disk?

I don't think Jon was suggesting that the planner drop the unique step.
--
Cheers,
Jeremy

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeremy Harris (#5)
Re: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

Jeremy Harris <jgh@wizmail.org> writes:

On 22/01/14 03:53, Tom Lane wrote:

Jon Nelson <jnelson+pgsql@jamponi.net> writes:

- in createplan.c, eliding duplicate tuples is enabled if we are
creating a unique plan which involves sorting first

[ raised eyebrow ... ] And what happens if the planner drops the
unique step and then the sort doesn't actually go to disk?

I don't think Jon was suggesting that the planner drop the unique step.

Hm, OK, maybe I misread what he said there. Still, if we've told
tuplesort to remove duplicates, why shouldn't we expect it to have
done the job? Passing the data through a useless Unique step is
not especially cheap.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Tom Lane (#6)
Re: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

On Wed, Jan 22, 2014 at 3:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeremy Harris <jgh@wizmail.org> writes:

On 22/01/14 03:53, Tom Lane wrote:

Jon Nelson <jnelson+pgsql@jamponi.net> writes:

- in createplan.c, eliding duplicate tuples is enabled if we are
creating a unique plan which involves sorting first

[ raised eyebrow ... ] And what happens if the planner drops the
unique step and then the sort doesn't actually go to disk?

I don't think Jon was suggesting that the planner drop the unique step.

Hm, OK, maybe I misread what he said there. Still, if we've told
tuplesort to remove duplicates, why shouldn't we expect it to have
done the job? Passing the data through a useless Unique step is
not especially cheap.

That's correct - I do not propose to drop the unique step. Duplicates
are only dropped if it's convenient to do so. In one case, it's a
zero-cost drop (no extra comparison is made). In most other cases, an
extra comparison is made, typically right before writing a tuple to
tape. If it compares as identical to the previously-written tuple,
it's thrown out instead of being written.

The output of the modified code is still sorted, still *might* (and in
most cases, probably will) contain duplicates, but will (probably)
contain fewer duplicates.

--
Jon

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Jon Nelson (#7)
Re: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

What are my next steps here?

I believe the concept is sound, the code is appears to work and
doesn't crash, and the result does show a performance win in most
cases (sometimes a big win). It's also incomplete, at least insofar
as it doesn't interface with the cost models at all, etc...

--
Jon

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Jon Nelson (#8)
Re: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

What - if anything - do I need to do to get this on the commitfest
list for the next commitfest?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Michael Paquier
michael.paquier@gmail.com
In reply to: Jon Nelson (#9)
Re: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

On Wed, Feb 5, 2014 at 10:33 AM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:

What - if anything - do I need to do to get this on the commitfest
list for the next commitfest?

The list of instructions is here:
http://wiki.postgresql.org/wiki/Submitting_a_Patch#Patch_submission
Then the next commit fest (#1 for 9.5), will be in June and is here:
https://commitfest.postgresql.org/action/commitfest_view?id=22
Note that you will need an account on postgresql.org to register your patch.
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers