Pg18 Recursive Crash

Started by Paul Ramseyabout 1 year ago16 messages

pramsey@cleverelephant.ca

about 1 year ago

Apologies if this is already reported, but there’s a crasher in recursive queries at the head of the current development that happened to be exercised by our regression suite. Here is a core-only reproduction.

CREATE TABLE foo (id integer, x integer, y integer);
INSERT INTO foo VALUES (1, 0, 1);
INSERT INTO foo VALUES (2, 1, 2);
INSERT INTO foo VALUES (3, 2, 3);

WITH RECURSIVE path (id, x, y) AS (
SELECT id, x, y FROM foo WHERE id = 1
UNION
SELECT foo.id, foo.x, foo.y
FROM path, foo
WHERE path.y = foo.x
)
SELECT 'crash', id, x, y FROM path;

Thanks!
ATB,
P

Nathan Bossart

nathandbossart@gmail.com

about 1 year ago

In reply to: Paul Ramsey (#1)

Re: Pg18 Recursive Crash

Thanks for the report!

On Mon, Dec 16, 2024 at 09:50:39AM -0800, Paul Ramsey wrote:

Apologies if this is already reported, but thereï¿½s a crasher in recursive
queries at the head of the current development that happened to be
exercised by our regression suite. Here is a core-only reproduction.

CREATE TABLE foo (id integer, x integer, y integer);
INSERT INTO foo VALUES (1, 0, 1);
INSERT INTO foo VALUES (2, 1, 2);
INSERT INTO foo VALUES (3, 2, 3);

WITH RECURSIVE path (id, x, y) AS (
SELECT id, x, y FROM foo WHERE id = 1
UNION
SELECT foo.id, foo.x, foo.y
FROM path, foo
WHERE path.y = foo.x
)
SELECT 'crash', id, x, y FROM path;

git-bisect is pointing me to https://postgr.es/c/0f57382. Here is the
trace I'm seeing:

TRAP: failed Assert("op->d.fetch.kind == slot->tts_ops"), File: "../postgresql/src/backend/executor/execExprInterp.c", Line: 2244, PID: 5031
0 postgres 0x000000010112d068 ExceptionalCondition + 108
1 postgres 0x0000000100e54f04 ExecInterpExpr + 604
2 postgres 0x0000000100e5bd50 LookupTupleHashEntry + 116
3 postgres 0x0000000100e8e580 ExecRecursiveUnion + 140
4 postgres 0x0000000100e770c0 CteScanNext + 248
5 postgres 0x0000000100e66e9c ExecScan + 124
6 postgres 0x0000000100e5dc9c standard_ExecutorRun + 304
7 postgres 0x0000000100ffe6dc PortalRunSelect + 236
8 postgres 0x0000000100ffe2f8 PortalRun + 492
9 postgres 0x0000000100ffd298 exec_simple_query + 1276
10 postgres 0x0000000100ffaf24 PostgresMain + 3632
11 postgres 0x0000000100ff631c BackendInitialize + 0
12 postgres 0x0000000100f5d638 PgArchShmemSize + 0
13 postgres 0x0000000100f61518 ServerLoop + 4300
14 postgres 0x0000000100f5fc60 InitProcessGlobals + 0
15 postgres 0x0000000100eb1e7c help + 0
16 dyld 0x000000019ed3b154 start + 2476

--
nathan

David Rowley

dgrowleyml@gmail.com

about 1 year ago

In reply to: Nathan Bossart (#2)

Re: Pg18 Recursive Crash

On Tue, 17 Dec 2024 at 07:10, Nathan Bossart <nathandbossart@gmail.com> wrote:

On Mon, Dec 16, 2024 at 09:50:39AM -0800, Paul Ramsey wrote:

CREATE TABLE foo (id integer, x integer, y integer);
INSERT INTO foo VALUES (1, 0, 1);
INSERT INTO foo VALUES (2, 1, 2);
INSERT INTO foo VALUES (3, 2, 3);

WITH RECURSIVE path (id, x, y) AS (
SELECT id, x, y FROM foo WHERE id = 1
UNION
SELECT foo.id, foo.x, foo.y
FROM path, foo
WHERE path.y = foo.x
)
SELECT 'crash', id, x, y FROM path;

git-bisect is pointing me to https://postgr.es/c/0f57382. Here is the
trace I'm seeing:

Thanks for the report and bisection. Looking now.

David

David Rowley

dgrowleyml@gmail.com

about 1 year ago

In reply to: Nathan Bossart (#2)

1 attachment(s)

Re: Pg18 Recursive Crash

On Tue, 17 Dec 2024 at 07:10, Nathan Bossart <nathandbossart@gmail.com> wrote:

git-bisect is pointing me to https://postgr.es/c/0f57382. Here is the
trace I'm seeing:

TRAP: failed Assert("op->d.fetch.kind == slot->tts_ops"), File: "../postgresql/src/backend/executor/execExprInterp.c", Line: 2244, PID: 5031
0 postgres 0x000000010112d068 ExceptionalCondition + 108
1 postgres 0x0000000100e54f04 ExecInterpExpr + 604
2 postgres 0x0000000100e5bd50 LookupTupleHashEntry + 116
3 postgres 0x0000000100e8e580 ExecRecursiveUnion + 140

This is caused by me being overly optimistic about getting rid of the
slot deformation step in the ExprState evaluation for hashing. I
must've not fully understood the method of how hash table lookups are
performed for grouping requirements and mistakenly thought it was ok
to use &TTSOpsMinimalTuple for hashing the same as the equality code
is doing in ExecBuildGroupingEqual(). It turns out the ExprState built
by ExecBuildGroupingEqual() always uses slots with minimal tuples due
to how LookupTupleHashEntry_internal() always creates a hash table
entry without a key and then does ExecCopySlotMinimalTuple() to put
the tuple into the hash table after the hash bucket is reserved.
That's not the case for generating the hash value as that uses
whichever slot is passed to LookupTupleHashEntry().

To fix the reported crash, I propose adding a new parameter to
BuildTupleHashTableExt() to allow passing of the TupleTableSlotOps.
Really, I could just set scratch.d.fetch.kind to NULL in
ExecBuildHash32FromAttrs() so that the hashing ExprState is always
built with a deform step, but there are many cases (e.g notSubplan.c)
where a virtual slot is always used, so it would be nice to have some
way to pass the TupleTableSlotOps in for cases where it's known so
that the deform step can be eliminated when it's not needed.

The slightly annoying thing here is that the attached patch passes the
TupleTableSlotOps as NULL in nodeSetOp.c. Per nodeAppend.c line 186,
Append does not go to much effort to setting a fixed
TupleTableSlotOps. Really it could loop over all the child plans and
check if those have fixed slot types of the same type and then fix its
own resulting slot. For nodeSetOps.c use case, since the planner
(currently) injects the flag into the target list, it'll always
project and use a virtual slot type. It's maybe worth coming back and
adjusting nodeAppend.c so it works a bit harder to fix its slot type.
I think that's likely for another patch, however. Tom is also
currently working on nodeSetOps.c to change how all this works so it
no longer uses the flags method to determine the outer and inner
sides.

I plan to push the attached patch soon.

David

Attachments:

v1-0001-Fix-incorrect-slot-type-in-BuildTupleHashTableExt.patchapplication/octet-stream; name=v1-0001-Fix-incorrect-slot-type-in-BuildTupleHashTableExt.patchDownload

From 19b8ca44a8aeeb9dacb66b3bb77de83773ed7aed Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Wed, 18 Dec 2024 10:16:48 +1300
Subject: [PATCH v1] Fix incorrect slot type in BuildTupleHashTableExt

0f5738202 adjusted the execGrouping.c code so it made use of ExprStates to
generate hash values.  That commit made a wrong assumption that the slot
type to pass to ExecBuildHash32FromAttrs() is always &TTSOpsMinimalTuple.
That's not the case as the slot type depends on the slot type passed to
LookupTupleHashEntry(), which for nodeRecursiveunion.c, could be any of
the current slot types.

Here we fix this by adding a new parameter to BuildTupleHashTableExt()
to allow the slot type to be passed in.  In the case of nodeSubplan.c
and nodeAgg.c the slot type is always &TTSOpsVirtual, so for both of
those cases, it's beneficial to pass the known slot type as that allows
ExecBuildHash32FromAttrs() to skip adding the tuple deform step to the
resulting ExprState.  Another possible fix would have been to have
ExecBuildHash32FromAttrs() set "fetch.kind" to NULL so that
ExecComputeSlotInfo() always determines the EEOP_INNER_FETCHSOME is
required, however, that option isn't favorable as slows down aggregation
and hashed subplan evaluation due to the extra (needless) deform step.

Thanks to Nathan Bossart for bisecting to find the offending commit
based on Paul's report.

Reported-by: Paul Ramsey <pramsey@cleverelephant.ca>
Discussion: https://postgr.es/m/99F064C1-B3EB-4BE7-97D2-D2A0AA487A71@cleverelephant.ca
---
 src/backend/executor/execGrouping.c       |  5 ++++-
 src/backend/executor/nodeAgg.c            |  1 +
 src/backend/executor/nodeRecursiveunion.c |  1 +
 src/backend/executor/nodeSetOp.c          |  1 +
 src/backend/executor/nodeSubplan.c        |  7 +++++++
 src/include/executor/executor.h           |  1 +
 src/test/regress/expected/with.out        | 20 ++++++++++++++++++++
 src/test/regress/sql/with.sql             | 14 ++++++++++++++
 8 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 9a88fc6524..4a8f72305c 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -135,6 +135,7 @@ execTuplesHashPrepare(int numCols,
 /*
  * Construct an empty TupleHashTable
  *
+ *  inputOps: slot ops for input hash values, or NULL if unknown or not fixed
  *	numCols, keyColIdx: identify the tuple fields to use as lookup key
  *	eqfunctions: equality comparison functions to use
  *	hashfunctions: datatype-specific hashing functions to use
@@ -154,6 +155,7 @@ execTuplesHashPrepare(int numCols,
 TupleHashTable
 BuildTupleHashTableExt(PlanState *parent,
 					   TupleDesc inputDesc,
+					   const TupleTableSlotOps *inputOps,
 					   int numCols, AttrNumber *keyColIdx,
 					   const Oid *eqfuncoids,
 					   FmgrInfo *hashfunctions,
@@ -225,7 +227,7 @@ BuildTupleHashTableExt(PlanState *parent,
 
 	/* build hash ExprState for all columns */
 	hashtable->tab_hash_expr = ExecBuildHash32FromAttrs(inputDesc,
-														&TTSOpsMinimalTuple,
+														inputOps,
 														hashfunctions,
 														collations,
 														numCols,
@@ -274,6 +276,7 @@ BuildTupleHashTable(PlanState *parent,
 {
 	return BuildTupleHashTableExt(parent,
 								  inputDesc,
+								  NULL,
 								  numCols, keyColIdx,
 								  eqfuncoids,
 								  hashfunctions,
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 84d33fdebc..53931c8285 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1520,6 +1520,7 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
 
 	perhash->hashtable = BuildTupleHashTableExt(&aggstate->ss.ps,
 												perhash->hashslot->tts_tupleDescriptor,
+												perhash->hashslot->tts_ops,
 												perhash->numCols,
 												perhash->hashGrpColIdxHash,
 												perhash->eqfuncoids,
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 22e7b83b2e..8f8a20acbd 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -39,6 +39,7 @@ build_hash_table(RecursiveUnionState *rustate)
 
 	rustate->hashtable = BuildTupleHashTableExt(&rustate->ps,
 												desc,
+												NULL,
 												node->numCols,
 												node->dupColIdx,
 												rustate->eqfuncoids,
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index a8ac68b482..b40d81f3ff 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -128,6 +128,7 @@ build_hash_table(SetOpState *setopstate)
 
 	setopstate->hashtable = BuildTupleHashTableExt(&setopstate->ps,
 												   desc,
+												   NULL,
 												   node->numCols,
 												   node->dupColIdx,
 												   setopstate->eqfuncoids,
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index f26c883c67..78931344a6 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -519,6 +519,11 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 	 *
 	 * If it's not necessary to distinguish FALSE and UNKNOWN, then we don't
 	 * need to store subplan output rows that contain NULL.
+	 *
+	 * Because the input slot for each hash table is always the slot resulting
+	 * from an ExecProject(), we can use TTSOpsVirtual for the input ops. This
+	 * saves a needless fetch inner op step for the hashing ExprState created
+	 * in BuildTupleHashTableExt.
 	 */
 	MemoryContextReset(node->hashtablecxt);
 	node->havehashrows = false;
@@ -533,6 +538,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 	else
 		node->hashtable = BuildTupleHashTableExt(node->parent,
 												 node->descRight,
+												 &TTSOpsVirtual,
 												 ncols,
 												 node->keyColIdx,
 												 node->tab_eq_funcoids,
@@ -561,6 +567,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 		else
 			node->hashnulls = BuildTupleHashTableExt(node->parent,
 													 node->descRight,
+													 &TTSOpsVirtual,
 													 ncols,
 													 node->keyColIdx,
 													 node->tab_eq_funcoids,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 494ec4f2e5..c8e6befca8 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -140,6 +140,7 @@ extern TupleHashTable BuildTupleHashTable(PlanState *parent,
 										  MemoryContext tempcxt, bool use_variable_hash_iv);
 extern TupleHashTable BuildTupleHashTableExt(PlanState *parent,
 											 TupleDesc inputDesc,
+											 const TupleTableSlotOps *inputOps,
 											 int numCols, AttrNumber *keyColIdx,
 											 const Oid *eqfuncoids,
 											 FmgrInfo *hashfunctions,
diff --git a/src/test/regress/expected/with.out b/src/test/regress/expected/with.out
index 08cfa5463f..501c9d72e0 100644
--- a/src/test/regress/expected/with.out
+++ b/src/test/regress/expected/with.out
@@ -329,6 +329,26 @@ SELECT * FROM subdepartment ORDER BY name;
   1 |                 0 | A
 (1 row)
 
+-- exercise the deduplication code of a UNION with mixed input slot types
+WITH RECURSIVE subdepartment AS
+(
+	-- select all columns to prevent projection
+	SELECT id, parent_department, name FROM department WHERE name = 'A'
+	UNION
+	-- hash joins do projection
+	SELECT d.id, d.parent_department, d.name FROM department AS d, subdepartment AS sd
+		WHERE d.parent_department = sd.id
+)
+SELECT * FROM subdepartment ORDER BY name;
+ id | parent_department | name 
+----+-------------------+------
+  1 |                 0 | A
+  2 |                 1 | B
+  3 |                 2 | C
+  4 |                 2 | D
+  6 |                 4 | F
+(5 rows)
+
 -- inside subqueries
 SELECT count(*) FROM (
     WITH RECURSIVE t(n) AS (
diff --git a/src/test/regress/sql/with.sql b/src/test/regress/sql/with.sql
index 8f6e6c0b40..b25b4dcd90 100644
--- a/src/test/regress/sql/with.sql
+++ b/src/test/regress/sql/with.sql
@@ -216,6 +216,20 @@ WITH RECURSIVE subdepartment AS
 )
 SELECT * FROM subdepartment ORDER BY name;
 
+-- exercise the deduplication code of a UNION with mixed input slot types
+WITH RECURSIVE subdepartment AS
+(
+	-- select all columns to prevent projection
+	SELECT id, parent_department, name FROM department WHERE name = 'A'
+
+	UNION
+
+	-- hash joins do projection
+	SELECT d.id, d.parent_department, d.name FROM department AS d, subdepartment AS sd
+		WHERE d.parent_department = sd.id
+)
+SELECT * FROM subdepartment ORDER BY name;
+
 -- inside subqueries
 SELECT count(*) FROM (
     WITH RECURSIVE t(n) AS (
-- 
2.34.1

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: David Rowley (#4)

Re: Pg18 Recursive Crash

David Rowley <dgrowleyml@gmail.com> writes:

The slightly annoying thing here is that the attached patch passes the
TupleTableSlotOps as NULL in nodeSetOp.c. Per nodeAppend.c line 186,
Append does not go to much effort to setting a fixed
TupleTableSlotOps. Really it could loop over all the child plans and
check if those have fixed slot types of the same type and then fix its
own resulting slot. For nodeSetOps.c use case, since the planner
(currently) injects the flag into the target list, it'll always
project and use a virtual slot type. It's maybe worth coming back and
adjusting nodeAppend.c so it works a bit harder to fix its slot type.
I think that's likely for another patch, however. Tom is also
currently working on nodeSetOps.c to change how all this works so it
no longer uses the flags method to determine the outer and inner
sides.

Yeah, I see no point in putting effort into improving the current
nodeSetOp implementation. There might be a reason to change
nodeAppend as you suggest for other use-cases though.

I plan to push the attached patch soon.

I'll presumably need to rebase my nodeSetOp patch when this goes
in. I'll take a look then at whether the new code can be improved
with this additional feature.

regards, tom lane

David Rowley

dgrowleyml@gmail.com

about 1 year ago

In reply to: Tom Lane (#5)

Re: Pg18 Recursive Crash

On Wed, 18 Dec 2024 at 11:04, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

project and use a virtual slot type. It's maybe worth coming back and
adjusting nodeAppend.c so it works a bit harder to fix its slot type.
I think that's likely for another patch, however. Tom is also
currently working on nodeSetOps.c to change how all this works so it
no longer uses the flags method to determine the outer and inner
sides.

Yeah, I see no point in putting effort into improving the current
nodeSetOp implementation. There might be a reason to change
nodeAppend as you suggest for other use-cases though.

I'll have a look to see what's possible here. Maybe locally adding
some telemetry output to the regression tests to log when an ExprState
performs a deform operation would be good to test with and without the
said patch to see how widespread an improvement the patch would result
in. I expect it might be most useful for partition-wise joins, but
that'll much depend on what operations occur above the Append.

I plan to push the attached patch soon.

I'll presumably need to rebase my nodeSetOp patch when this goes
in. I'll take a look then at whether the new code can be improved
with this additional feature.

I've pushed the patch now.

David

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: David Rowley (#6)

1 attachment(s)

Re: Pg18 Recursive Crash

David Rowley <dgrowleyml@gmail.com> writes:

I've pushed the patch now.

So I tried adapting my patch to not make a copy of the input slot,
and it didn't work: I was still getting assertion failures about
the slot not being a MinimalTupleSlot as expected. On investigation
it appears your patch did not fully adjust BuildTupleHashTableExt
for variable input-slot type. You need the attached as well.

I'm not sure why the existing regression tests didn't catch this.
But it may not be worth searching for a test case, because my patch
will be one ...

regards, tom lane

Attachments:

fix-grouping.patchtext/x-diff; charset=us-ascii; name=fix-grouping.patchDownload

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 4a8f72305c..7491e53c03 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -236,9 +236,9 @@ BuildTupleHashTableExt(PlanState *parent,
 														hash_iv);
 
 	/* build comparator for all columns */
-	/* XXX: should we support non-minimal tuples for the inputslot? */
 	hashtable->tab_eq_func = ExecBuildGroupingEqual(inputDesc, inputDesc,
-													&TTSOpsMinimalTuple, &TTSOpsMinimalTuple,
+													inputOps,
+													&TTSOpsMinimalTuple,
 													numCols,
 													keyColIdx, eqfuncoids, collations,
 													allow_jit ? parent : NULL);

David Rowley

dgrowleyml@gmail.com

about 1 year ago

In reply to: Tom Lane (#7)

Re: Pg18 Recursive Crash

On Wed, 18 Dec 2024 at 14:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So I tried adapting my patch to not make a copy of the input slot,
and it didn't work: I was still getting assertion failures about
the slot not being a MinimalTupleSlot as expected. On investigation
it appears your patch did not fully adjust BuildTupleHashTableExt
for variable input-slot type. You need the attached as well.

Do you have a test case in master that triggers a problem here? Your
patch adjusts code that existed prior to d96d1d515, so I'm confused as
to why your patch is needed now when it wasn't before.

If you're only triggering an issue after patching with your setops
patch, are your changes maybe using FindTupleHashEntry() with an
eqcomp that isn't compatible with the 'slot' parameter you're passing
to that function?

David

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: David Rowley (#8)

1 attachment(s)

Re: Pg18 Recursive Crash

David Rowley <dgrowleyml@gmail.com> writes:

On Wed, 18 Dec 2024 at 14:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:

it appears your patch did not fully adjust BuildTupleHashTableExt
for variable input-slot type. You need the attached as well.

Do you have a test case in master that triggers a problem here?

No, that's what I didn't want to spend time looking for ;-).
But here is a WIP modification of my 0001 patch from the SetOp
improvement thread -- the difference is to lobotomize
setop_load_hash_tuple to just return the input slot. Without
the execGrouping.c changes, it gets assertion failures in the core
regression tests. (My intention had been to remove
setop_load_hash_tuple and then adjust build_hash_table to
pass a non-null inputOps if the two inputs have the same tuple
slot type. But I got stuck on this step.)

regards, tom lane

Attachments:

wip-setop-improvements.patchtext/x-diff; charset=us-ascii; name=wip-setop-improvements.patchDownload

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 4a8f72305c..7491e53c03 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -236,9 +236,9 @@ BuildTupleHashTableExt(PlanState *parent,
 														hash_iv);
 
 	/* build comparator for all columns */
-	/* XXX: should we support non-minimal tuples for the inputslot? */
 	hashtable->tab_eq_func = ExecBuildGroupingEqual(inputDesc, inputDesc,
-													&TTSOpsMinimalTuple, &TTSOpsMinimalTuple,
+													inputOps,
+													&TTSOpsMinimalTuple,
 													numCols,
 													keyColIdx, eqfuncoids, collations,
 													allow_jit ? parent : NULL);
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index b40d81f3ff..159db5710f 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -3,29 +3,30 @@
  * nodeSetOp.c
  *	  Routines to handle INTERSECT and EXCEPT selection
  *
- * The input of a SetOp node consists of tuples from two relations,
- * which have been combined into one dataset, with a junk attribute added
- * that shows which relation each tuple came from.  In SETOP_SORTED mode,
- * the input has furthermore been sorted according to all the grouping
- * columns (ie, all the non-junk attributes).  The SetOp node scans each
- * group of identical tuples to determine how many came from each input
- * relation.  Then it is a simple matter to emit the output demanded by the
- * SQL spec for INTERSECT, INTERSECT ALL, EXCEPT, or EXCEPT ALL.
+ * The input of a SetOp node consists of two relations (outer and inner)
+ * with identical column sets.  In EXCEPT queries the outer relation is
+ * always the left side, while in INTERSECT cases the planner tries to
+ * make the outer relation be the smaller of the two inputs.
  *
- * In SETOP_HASHED mode, the input is delivered in no particular order,
- * except that we know all the tuples from one input relation will come before
- * all the tuples of the other.  The planner guarantees that the first input
- * relation is the left-hand one for EXCEPT, and tries to make the smaller
- * input relation come first for INTERSECT.  We build a hash table in memory
- * with one entry for each group of identical tuples, and count the number of
- * tuples in the group from each relation.  After seeing all the input, we
- * scan the hashtable and generate the correct output using those counts.
- * We can avoid making hashtable entries for any tuples appearing only in the
- * second input relation, since they cannot result in any output.
+ * In SETOP_SORTED mode, each input has been sorted according to all the
+ * grouping columns (ie, all the non-junk attributes).  The SetOp node
+ * essentially performs a merge join on the grouping columns, except that
+ * it is only interested in counting how many tuples from each input match.
+ * Then it is a simple matter to emit the output demanded by the SQL spec
+ * for INTERSECT, INTERSECT ALL, EXCEPT, or EXCEPT ALL.
+ *
+ * In SETOP_HASHED mode, the inputs are delivered in no particular order.
+ * We read the outer relation and build a hash table in memory with one entry
+ * for each group of identical tuples, counting the number of tuples in the
+ * group.  Then we read the inner relation and count the number of tuples
+ * matching each outer group.  (We can disregard any tuples appearing only
+ * in the inner relation, since they cannot result in any output.)  After
+ * seeing all the input, we scan the hashtable and generate the correct
+ * output using those counts.
  *
  * This node type is not used for UNION or UNION ALL, since those can be
- * implemented more cheaply (there's no need for the junk attribute to
- * identify the source relation).
+ * implemented more cheaply (there's no need to count the number of
+ * matching tuples).
  *
  * Note that SetOp does no qual checking nor projection.  The delivered
  * output tuples are just copies of the first-to-arrive tuple in each
@@ -54,65 +55,30 @@
 /*
  * SetOpStatePerGroupData - per-group working state
  *
- * These values are working state that is initialized at the start of
- * an input tuple group and updated for each input tuple.
- *
- * In SETOP_SORTED mode, we need only one of these structs, and it's kept in
- * the plan state node.  In SETOP_HASHED mode, the hash table contains one
- * of these for each tuple group.
+ * In SETOP_SORTED mode, we need only one of these structs, and it's just a
+ * local in setop_retrieve_sorted.  In SETOP_HASHED mode, the hash table
+ * contains one of these for each tuple group.
  */
 typedef struct SetOpStatePerGroupData
 {
-	long		numLeft;		/* number of left-input dups in group */
-	long		numRight;		/* number of right-input dups in group */
-}			SetOpStatePerGroupData;
+	int64		numLeft;		/* number of left-input dups in group */
+	int64		numRight;		/* number of right-input dups in group */
+} SetOpStatePerGroupData;
+
+typedef SetOpStatePerGroupData *SetOpStatePerGroup;
 
 
-static TupleTableSlot *setop_retrieve_direct(SetOpState *setopstate);
+static TupleTableSlot *setop_retrieve_sorted(SetOpState *setopstate);
+static void setop_load_group(SetOpStatePerInput *input, PlanState *inputPlan,
+							 SetOpState *setopstate);
+static int	setop_compare_slots(TupleTableSlot *s1, TupleTableSlot *s2,
+								SetOpState *setopstate);
+static TupleTableSlot *setop_load_hash_tuple(SetOpState *setopstate,
+											 TupleTableSlot *inputslot);
 static void setop_fill_hash_table(SetOpState *setopstate);
 static TupleTableSlot *setop_retrieve_hash_table(SetOpState *setopstate);
 
 
-/*
- * Initialize state for a new group of input values.
- */
-static inline void
-initialize_counts(SetOpStatePerGroup pergroup)
-{
-	pergroup->numLeft = pergroup->numRight = 0;
-}
-
-/*
- * Advance the appropriate counter for one input tuple.
- */
-static inline void
-advance_counts(SetOpStatePerGroup pergroup, int flag)
-{
-	if (flag)
-		pergroup->numRight++;
-	else
-		pergroup->numLeft++;
-}
-
-/*
- * Fetch the "flag" column from an input tuple.
- * This is an integer column with value 0 for left side, 1 for right side.
- */
-static int
-fetch_tuple_flag(SetOpState *setopstate, TupleTableSlot *inputslot)
-{
-	SetOp	   *node = (SetOp *) setopstate->ps.plan;
-	int			flag;
-	bool		isNull;
-
-	flag = DatumGetInt32(slot_getattr(inputslot,
-									  node->flagColIdx,
-									  &isNull));
-	Assert(!isNull);
-	Assert(flag == 0 || flag == 1);
-	return flag;
-}
-
 /*
  * Initialize the hash table to empty.
  */
@@ -130,10 +96,10 @@ build_hash_table(SetOpState *setopstate)
 												   desc,
 												   NULL,
 												   node->numCols,
-												   node->dupColIdx,
+												   node->cmpColIdx,
 												   setopstate->eqfuncoids,
 												   setopstate->hashfunctions,
-												   node->dupCollations,
+												   node->cmpCollations,
 												   node->numGroups,
 												   0,
 												   setopstate->ps.state->es_query_cxt,
@@ -218,108 +184,126 @@ ExecSetOp(PlanState *pstate)
 		return setop_retrieve_hash_table(node);
 	}
 	else
-		return setop_retrieve_direct(node);
+		return setop_retrieve_sorted(node);
 }
 
 /*
  * ExecSetOp for non-hashed case
  */
 static TupleTableSlot *
-setop_retrieve_direct(SetOpState *setopstate)
+setop_retrieve_sorted(SetOpState *setopstate)
 {
 	PlanState  *outerPlan;
-	SetOpStatePerGroup pergroup;
-	TupleTableSlot *outerslot;
+	PlanState  *innerPlan;
 	TupleTableSlot *resultTupleSlot;
-	ExprContext *econtext = setopstate->ps.ps_ExprContext;
 
 	/*
 	 * get state info from node
 	 */
 	outerPlan = outerPlanState(setopstate);
-	pergroup = (SetOpStatePerGroup) setopstate->pergroup;
+	innerPlan = innerPlanState(setopstate);
 	resultTupleSlot = setopstate->ps.ps_ResultTupleSlot;
 
 	/*
-	 * We loop retrieving groups until we find one we should return
+	 * If first time through, establish the invariant that setop_load_group
+	 * expects: each side's nextTupleSlot is the next output from the child
+	 * plan, or empty if there is no more output from it.
 	 */
-	while (!setopstate->setop_done)
+	if (setopstate->need_init)
 	{
+		setopstate->need_init = false;
+
+		setopstate->leftInput.nextTupleSlot = ExecProcNode(outerPlan);
+
 		/*
-		 * If we don't already have the first tuple of the new group, fetch it
-		 * from the outer plan.
+		 * If the outer relation is empty, then we will emit nothing, and we
+		 * don't need to read the inner relation at all.
 		 */
-		if (setopstate->grp_firstTuple == NULL)
+		if (TupIsNull(setopstate->leftInput.nextTupleSlot))
 		{
-			outerslot = ExecProcNode(outerPlan);
-			if (!TupIsNull(outerslot))
-			{
-				/* Make a copy of the first input tuple */
-				setopstate->grp_firstTuple = ExecCopySlotHeapTuple(outerslot);
-			}
-			else
-			{
-				/* outer plan produced no tuples at all */
-				setopstate->setop_done = true;
-				return NULL;
-			}
+			setopstate->setop_done = true;
+			return NULL;
 		}
 
+		setopstate->rightInput.nextTupleSlot = ExecProcNode(innerPlan);
+
+		/* Set flags that we've not completed either side's group */
+		setopstate->leftInput.needGroup = true;
+		setopstate->rightInput.needGroup = true;
+	}
+
+	/*
+	 * We loop retrieving groups until we find one we should return
+	 */
+	while (!setopstate->setop_done)
+	{
+		int			cmpresult;
+		SetOpStatePerGroupData pergroup;
+
 		/*
-		 * Store the copied first input tuple in the tuple table slot reserved
-		 * for it.  The tuple will be deleted when it is cleared from the
-		 * slot.
+		 * Fetch the rest of the current outer group, if we didn't already.
 		 */
-		ExecStoreHeapTuple(setopstate->grp_firstTuple,
-						   resultTupleSlot,
-						   true);
-		setopstate->grp_firstTuple = NULL;	/* don't keep two pointers */
+		if (setopstate->leftInput.needGroup)
+			setop_load_group(&setopstate->leftInput, outerPlan, setopstate);
 
-		/* Initialize working state for a new input tuple group */
-		initialize_counts(pergroup);
+		/*
+		 * If no more outer groups, we're done, and don't need to look at any
+		 * more of the inner relation.
+		 */
+		if (setopstate->leftInput.numTuples == 0)
+		{
+			setopstate->setop_done = true;
+			break;
+		}
 
-		/* Count the first input tuple */
-		advance_counts(pergroup,
-					   fetch_tuple_flag(setopstate, resultTupleSlot));
+		/*
+		 * Fetch the rest of the current inner group, if we didn't already.
+		 */
+		if (setopstate->rightInput.needGroup)
+			setop_load_group(&setopstate->rightInput, innerPlan, setopstate);
 
 		/*
-		 * Scan the outer plan until we exhaust it or cross a group boundary.
+		 * Determine whether we have matching groups on both sides (this is
+		 * basically like the core logic of a merge join).
 		 */
-		for (;;)
-		{
-			outerslot = ExecProcNode(outerPlan);
-			if (TupIsNull(outerslot))
-			{
-				/* no more outer-plan tuples available */
-				setopstate->setop_done = true;
-				break;
-			}
-
-			/*
-			 * Check whether we've crossed a group boundary.
-			 */
-			econtext->ecxt_outertuple = resultTupleSlot;
-			econtext->ecxt_innertuple = outerslot;
-
-			if (!ExecQualAndReset(setopstate->eqfunction, econtext))
-			{
-				/*
-				 * Save the first input tuple of the next group.
-				 */
-				setopstate->grp_firstTuple = ExecCopySlotHeapTuple(outerslot);
-				break;
-			}
+		if (setopstate->rightInput.numTuples == 0)
+			cmpresult = -1;		/* as though left input is lesser */
+		else
+			cmpresult = setop_compare_slots(setopstate->leftInput.firstTupleSlot,
+											setopstate->rightInput.firstTupleSlot,
+											setopstate);
 
-			/* Still in same group, so count this tuple */
-			advance_counts(pergroup,
-						   fetch_tuple_flag(setopstate, outerslot));
+		if (cmpresult < 0)
+		{
+			/* Left group is first, has no right matches */
+			pergroup.numLeft = setopstate->leftInput.numTuples;
+			pergroup.numRight = 0;
+			/* We'll need another left group next time */
+			setopstate->leftInput.needGroup = true;
+		}
+		else if (cmpresult == 0)
+		{
+			/* We have matching groups */
+			pergroup.numLeft = setopstate->leftInput.numTuples;
+			pergroup.numRight = setopstate->rightInput.numTuples;
+			/* We'll need to read from both sides next time */
+			setopstate->leftInput.needGroup = true;
+			setopstate->rightInput.needGroup = true;
+		}
+		else
+		{
+			/* Right group has no left matches, so we can ignore it */
+			setopstate->rightInput.needGroup = true;
+			continue;
 		}
 
 		/*
-		 * Done scanning input tuple group.  See if we should emit any copies
-		 * of result tuple, and if so return the first copy.
+		 * Done scanning these input tuple groups.  See if we should emit any
+		 * copies of result tuple, and if so return the first copy.  (Note
+		 * that the result tuple is the same as the left input's firstTuple
+		 * slot.)
 		 */
-		set_output_count(setopstate, pergroup);
+		set_output_count(setopstate, &pergroup);
 
 		if (setopstate->numOutput > 0)
 		{
@@ -334,84 +318,199 @@ setop_retrieve_direct(SetOpState *setopstate)
 }
 
 /*
- * ExecSetOp for hashed case: phase 1, read input and build hash table
+ * Load next group of tuples from one child plan or the other.
+ *
+ * On entry, we've already read the first tuple of the next group
+ * (if there is one) into input->nextTupleSlot.  This invariant
+ * is maintained on exit.
+ */
+static void
+setop_load_group(SetOpStatePerInput *input, PlanState *inputPlan,
+				 SetOpState *setopstate)
+{
+	input->needGroup = false;
+
+	/* If we've exhausted this child plan, report an empty group */
+	if (TupIsNull(input->nextTupleSlot))
+	{
+		ExecClearTuple(input->firstTupleSlot);
+		input->numTuples = 0;
+		return;
+	}
+
+	/* Make a local copy of the first tuple for comparisons */
+	ExecStoreMinimalTuple(ExecCopySlotMinimalTuple(input->nextTupleSlot),
+						  input->firstTupleSlot,
+						  true);
+	/* and count it */
+	input->numTuples = 1;
+
+	/* Scan till we find the end-of-group */
+	for (;;)
+	{
+		int			cmpresult;
+
+		/* Get next input tuple, if there is one */
+		input->nextTupleSlot = ExecProcNode(inputPlan);
+		if (TupIsNull(input->nextTupleSlot))
+			break;
+
+		/* There is; does it belong to same group as firstTuple? */
+		cmpresult = setop_compare_slots(input->firstTupleSlot,
+										input->nextTupleSlot,
+										setopstate);
+		Assert(cmpresult <= 0); /* else input is mis-sorted */
+		if (cmpresult != 0)
+			break;
+
+		/* Still in same group, so count this tuple */
+		input->numTuples++;
+	}
+}
+
+/*
+ * Compare the tuples in the two given slots.
+ */
+static int
+setop_compare_slots(TupleTableSlot *s1, TupleTableSlot *s2,
+					SetOpState *setopstate)
+{
+	/* We'll often need to fetch all the columns, so just do it */
+	slot_getallattrs(s1);
+	slot_getallattrs(s2);
+	for (int nkey = 0; nkey < setopstate->numCols; nkey++)
+	{
+		SortSupport sortKey = setopstate->sortKeys + nkey;
+		AttrNumber	attno = sortKey->ssup_attno;
+		Datum		datum1 = s1->tts_values[attno - 1],
+					datum2 = s2->tts_values[attno - 1];
+		bool		isNull1 = s1->tts_isnull[attno - 1],
+					isNull2 = s2->tts_isnull[attno - 1];
+		int			compare;
+
+		compare = ApplySortComparator(datum1, isNull1,
+									  datum2, isNull2,
+									  sortKey);
+		if (compare != 0)
+			return compare;
+	}
+	return 0;
+}
+
+/*
+ * Prepare an input tuple for hashtable lookup.
+ *
+ * We could use the input slot directly except that LookupTupleHashEntry
+ * only accepts MinimalTuple slots, which is typically not what our child
+ * plans will return.  We abuse the SetOp node's ResultTupleSlot as a
+ * temporary slot for this purpose.
+ */
+static TupleTableSlot *
+setop_load_hash_tuple(SetOpState *setopstate, TupleTableSlot *inputslot)
+{
+#if 1
+	return inputslot;
+#else
+	TupleTableSlot *resultslot = setopstate->ps.ps_ResultTupleSlot;
+	int			natts = resultslot->tts_tupleDescriptor->natts;
+
+	slot_getallattrs(inputslot);
+	ExecClearTuple(resultslot);
+	for (int i = 0; i < natts; i++)
+	{
+		resultslot->tts_values[i] = inputslot->tts_values[i];
+		resultslot->tts_isnull[i] = inputslot->tts_isnull[i];
+	}
+	ExecStoreVirtualTuple(resultslot);
+	return resultslot;
+#endif
+}
+
+/*
+ * ExecSetOp for hashed case: phase 1, read inputs and build hash table
  */
 static void
 setop_fill_hash_table(SetOpState *setopstate)
 {
-	SetOp	   *node = (SetOp *) setopstate->ps.plan;
 	PlanState  *outerPlan;
-	int			firstFlag;
-	bool		in_first_rel PG_USED_FOR_ASSERTS_ONLY;
+	PlanState  *innerPlan;
 	ExprContext *econtext = setopstate->ps.ps_ExprContext;
+	bool		have_tuples = false;
 
 	/*
 	 * get state info from node
 	 */
 	outerPlan = outerPlanState(setopstate);
-	firstFlag = node->firstFlag;
-	/* verify planner didn't mess up */
-	Assert(firstFlag == 0 ||
-		   (firstFlag == 1 &&
-			(node->cmd == SETOPCMD_INTERSECT ||
-			 node->cmd == SETOPCMD_INTERSECT_ALL)));
+	innerPlan = innerPlanState(setopstate);
 
 	/*
 	 * Process each outer-plan tuple, and then fetch the next one, until we
 	 * exhaust the outer plan.
 	 */
-	in_first_rel = true;
 	for (;;)
 	{
 		TupleTableSlot *outerslot;
-		int			flag;
 		TupleHashEntryData *entry;
 		bool		isnew;
 
 		outerslot = ExecProcNode(outerPlan);
 		if (TupIsNull(outerslot))
 			break;
+		have_tuples = true;
 
-		/* Identify whether it's left or right input */
-		flag = fetch_tuple_flag(setopstate, outerslot);
+		/* Find or build hashtable entry for this tuple's group */
+		entry = LookupTupleHashEntry(setopstate->hashtable,
+									 setop_load_hash_tuple(setopstate,
+														   outerslot),
+									 &isnew, NULL);
 
-		if (flag == firstFlag)
+		/* If new tuple group, initialize counts to zero */
+		if (isnew)
 		{
-			/* (still) in first input relation */
-			Assert(in_first_rel);
-
-			/* Find or build hashtable entry for this tuple's group */
-			entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
-										 &isnew, NULL);
-
-			/* If new tuple group, initialize counts */
-			if (isnew)
-			{
-				entry->additional = (SetOpStatePerGroup)
-					MemoryContextAlloc(setopstate->hashtable->tablecxt,
+			entry->additional = (SetOpStatePerGroup)
+				MemoryContextAllocZero(setopstate->hashtable->tablecxt,
 									   sizeof(SetOpStatePerGroupData));
-				initialize_counts((SetOpStatePerGroup) entry->additional);
-			}
-
-			/* Advance the counts */
-			advance_counts((SetOpStatePerGroup) entry->additional, flag);
 		}
-		else
+
+		/* Advance the counts */
+		((SetOpStatePerGroup) entry->additional)->numLeft++;
+
+		/* Must reset expression context after each hashtable lookup */
+		ResetExprContext(econtext);
+	}
+
+	/*
+	 * If the outer relation is empty, then we will emit nothing, and we don't
+	 * need to read the inner relation at all.
+	 */
+	if (have_tuples)
+	{
+		/*
+		 * Process each inner-plan tuple, and then fetch the next one, until
+		 * we exhaust the inner plan.
+		 */
+		for (;;)
 		{
-			/* reached second relation */
-			in_first_rel = false;
+			TupleTableSlot *innerslot;
+			TupleHashEntryData *entry;
+
+			innerslot = ExecProcNode(innerPlan);
+			if (TupIsNull(innerslot))
+				break;
 
 			/* For tuples not seen previously, do not make hashtable entry */
-			entry = LookupTupleHashEntry(setopstate->hashtable, outerslot,
+			entry = LookupTupleHashEntry(setopstate->hashtable,
+										 setop_load_hash_tuple(setopstate,
+															   innerslot),
 										 NULL, NULL);
 
 			/* Advance the counts if entry is already present */
 			if (entry)
-				advance_counts((SetOpStatePerGroup) entry->additional, flag);
-		}
+				((SetOpStatePerGroup) entry->additional)->numRight++;
 
-		/* Must reset expression context after each hashtable lookup */
-		ResetExprContext(econtext);
+			/* Must reset expression context after each hashtable lookup */
+			ResetExprContext(econtext);
+		}
 	}
 
 	setopstate->table_filled = true;
@@ -482,7 +581,6 @@ SetOpState *
 ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 {
 	SetOpState *setopstate;
-	TupleDesc	outerDesc;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -495,14 +593,10 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 	setopstate->ps.state = estate;
 	setopstate->ps.ExecProcNode = ExecSetOp;
 
-	setopstate->eqfuncoids = NULL;
-	setopstate->hashfunctions = NULL;
 	setopstate->setop_done = false;
 	setopstate->numOutput = 0;
-	setopstate->pergroup = NULL;
-	setopstate->grp_firstTuple = NULL;
-	setopstate->hashtable = NULL;
-	setopstate->tableContext = NULL;
+	setopstate->numCols = node->numCols;
+	setopstate->need_init = true;
 
 	/*
 	 * create expression context
@@ -523,52 +617,72 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 *
-	 * If we are hashing then the child plan does not need to handle REWIND
+	 * If we are hashing then the child plans do not need to handle REWIND
 	 * efficiently; see ExecReScanSetOp.
 	 */
 	if (node->strategy == SETOP_HASHED)
 		eflags &= ~EXEC_FLAG_REWIND;
 	outerPlanState(setopstate) = ExecInitNode(outerPlan(node), estate, eflags);
-	outerDesc = ExecGetResultType(outerPlanState(setopstate));
+	innerPlanState(setopstate) = ExecInitNode(innerPlan(node), estate, eflags);
 
 	/*
-	 * Initialize result slot and type. Setop nodes do no projections, so
-	 * initialize projection info for this node appropriately.
+	 * Initialize locally-allocated slots.  In hashed mode, we just need a
+	 * result slot.  In sorted mode, we need one first-tuple-of-group slot for
+	 * each input; we use the result slot for the left input's slot and create
+	 * another for the right input.  (Note: the nextTupleSlot slots are not
+	 * ours, but just point to the last slot returned by the input plan node.)
 	 */
-	ExecInitResultTupleSlotTL(&setopstate->ps,
-							  node->strategy == SETOP_HASHED ?
-							  &TTSOpsMinimalTuple : &TTSOpsHeapTuple);
+	ExecInitResultTupleSlotTL(&setopstate->ps, &TTSOpsMinimalTuple);
+	if (node->strategy != SETOP_HASHED)
+	{
+		setopstate->leftInput.firstTupleSlot =
+			setopstate->ps.ps_ResultTupleSlot;
+		setopstate->rightInput.firstTupleSlot =
+			ExecInitExtraTupleSlot(estate,
+								   setopstate->ps.ps_ResultTupleDesc,
+								   &TTSOpsMinimalTuple);
+	}
+
+	/* Setop nodes do no projections. */
 	setopstate->ps.ps_ProjInfo = NULL;
 
 	/*
-	 * Precompute fmgr lookup data for inner loop. We need both equality and
-	 * hashing functions to do it by hashing, but only equality if not
-	 * hashing.
+	 * Precompute fmgr lookup data for inner loop.  We need equality and
+	 * hashing functions to do it by hashing, while for sorting we need
+	 * SortSupport data.
 	 */
 	if (node->strategy == SETOP_HASHED)
 		execTuplesHashPrepare(node->numCols,
-							  node->dupOperators,
+							  node->cmpOperators,
 							  &setopstate->eqfuncoids,
 							  &setopstate->hashfunctions);
 	else
-		setopstate->eqfunction =
-			execTuplesMatchPrepare(outerDesc,
-								   node->numCols,
-								   node->dupColIdx,
-								   node->dupOperators,
-								   node->dupCollations,
-								   &setopstate->ps);
+	{
+		int			nkeys = node->numCols;
+
+		setopstate->sortKeys = (SortSupport)
+			palloc0(nkeys * sizeof(SortSupportData));
+		for (int i = 0; i < nkeys; i++)
+		{
+			SortSupport sortKey = setopstate->sortKeys + i;
 
+			sortKey->ssup_cxt = CurrentMemoryContext;
+			sortKey->ssup_collation = node->cmpCollations[i];
+			sortKey->ssup_nulls_first = node->cmpNullsFirst[i];
+			sortKey->ssup_attno = node->cmpColIdx[i];
+			/* abbreviated key conversion is not useful here */
+			sortKey->abbreviate = false;
+
+			PrepareSortSupportFromOrderingOp(node->cmpOperators[i], sortKey);
+		}
+	}
+
+	/* Create a hash table if needed */
 	if (node->strategy == SETOP_HASHED)
 	{
 		build_hash_table(setopstate);
 		setopstate->table_filled = false;
 	}
-	else
-	{
-		setopstate->pergroup =
-			(SetOpStatePerGroup) palloc0(sizeof(SetOpStatePerGroupData));
-	}
 
 	return setopstate;
 }
@@ -576,7 +690,7 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 /* ----------------------------------------------------------------
  *		ExecEndSetOp
  *
- *		This shuts down the subplan and frees resources allocated
+ *		This shuts down the subplans and frees resources allocated
  *		to this node.
  * ----------------------------------------------------------------
  */
@@ -588,6 +702,7 @@ ExecEndSetOp(SetOpState *node)
 		MemoryContextDelete(node->tableContext);
 
 	ExecEndNode(outerPlanState(node));
+	ExecEndNode(innerPlanState(node));
 }
 
 
@@ -595,6 +710,7 @@ void
 ExecReScanSetOp(SetOpState *node)
 {
 	PlanState  *outerPlan = outerPlanState(node);
+	PlanState  *innerPlan = innerPlanState(node);
 
 	ExecClearTuple(node->ps.ps_ResultTupleSlot);
 	node->setop_done = false;
@@ -612,34 +728,29 @@ ExecReScanSetOp(SetOpState *node)
 			return;
 
 		/*
-		 * If we do have the hash table and the subplan does not have any
+		 * If we do have the hash table and the subplans do not have any
 		 * parameter changes, then we can just rescan the existing hash table;
 		 * no need to build it again.
 		 */
-		if (outerPlan->chgParam == NULL)
+		if (outerPlan->chgParam == NULL && innerPlan->chgParam == NULL)
 		{
 			ResetTupleHashIterator(node->hashtable, &node->hashiter);
 			return;
 		}
-	}
 
-	/* Release first tuple of group, if we have made a copy */
-	if (node->grp_firstTuple != NULL)
-	{
-		heap_freetuple(node->grp_firstTuple);
-		node->grp_firstTuple = NULL;
-	}
-
-	/* Release any hashtable storage */
-	if (node->tableContext)
-		MemoryContextReset(node->tableContext);
+		/* Release any hashtable storage */
+		if (node->tableContext)
+			MemoryContextReset(node->tableContext);
 
-	/* And rebuild empty hashtable if needed */
-	if (((SetOp *) node->ps.plan)->strategy == SETOP_HASHED)
-	{
+		/* And rebuild an empty hashtable */
 		ResetTupleHashTable(node->hashtable);
 		node->table_filled = false;
 	}
+	else
+	{
+		/* Need to re-read first input from each side */
+		node->need_init = true;
+	}
 
 	/*
 	 * if chgParam of subnode is not null then plan will be re-scanned by
@@ -647,4 +758,6 @@ ExecReScanSetOp(SetOpState *node)
 	 */
 	if (outerPlan->chgParam == NULL)
 		ExecReScan(outerPlan);
+	if (innerPlan->chgParam == NULL)
+		ExecReScan(innerPlan);
 }
diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index 2ab4f3dbf3..f341d9f303 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -649,7 +649,7 @@ RelOptInfo      - a relation or joined relations
   GroupingSetsPath - an Agg plan node used to implement GROUPING SETS
   MinMaxAggPath - a Result plan node with subplans performing MIN/MAX
   WindowAggPath - a WindowAgg plan node applied to some sub-path
-  SetOpPath     - a SetOp plan node applied to some sub-path
+  SetOpPath     - a SetOp plan node applied to two sub-paths
   RecursiveUnionPath - a RecursiveUnion plan node applied to two sub-paths
   LockRowsPath  - a LockRows plan node applied to some sub-path
   ModifyTablePath - a ModifyTable plan node applied to some sub-path(s)
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 178c572b02..b3e2294e84 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -301,9 +301,9 @@ static Unique *make_unique_from_pathkeys(Plan *lefttree,
 										 List *pathkeys, int numCols);
 static Gather *make_gather(List *qptlist, List *qpqual,
 						   int nworkers, int rescan_param, bool single_copy, Plan *subplan);
-static SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy, Plan *lefttree,
-						 List *distinctList, AttrNumber flagColIdx, int firstFlag,
-						 long numGroups);
+static SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy,
+						 List *tlist, Plan *lefttree, Plan *righttree,
+						 List *groupList, long numGroups);
 static LockRows *make_lockrows(Plan *lefttree, List *rowMarks, int epqParam);
 static Result *make_result(List *tlist, Node *resconstantqual, Plan *subplan);
 static ProjectSet *make_project_set(List *tlist, Plan *subplan);
@@ -2719,25 +2719,29 @@ static SetOp *
 create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
 {
 	SetOp	   *plan;
-	Plan	   *subplan;
+	List	   *tlist = build_path_tlist(root, &best_path->path);
+	Plan	   *leftplan;
+	Plan	   *rightplan;
 	long		numGroups;
 
 	/*
 	 * SetOp doesn't project, so tlist requirements pass through; moreover we
 	 * need grouping columns to be labeled.
 	 */
-	subplan = create_plan_recurse(root, best_path->subpath,
-								  flags | CP_LABEL_TLIST);
+	leftplan = create_plan_recurse(root, best_path->leftpath,
+								   flags | CP_LABEL_TLIST);
+	rightplan = create_plan_recurse(root, best_path->rightpath,
+									flags | CP_LABEL_TLIST);
 
 	/* Convert numGroups to long int --- but 'ware overflow! */
 	numGroups = clamp_cardinality_to_long(best_path->numGroups);
 
 	plan = make_setop(best_path->cmd,
 					  best_path->strategy,
-					  subplan,
-					  best_path->distinctList,
-					  best_path->flagColIdx,
-					  best_path->firstFlag,
+					  tlist,
+					  leftplan,
+					  rightplan,
+					  best_path->groupList,
 					  numGroups);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -6950,57 +6954,62 @@ make_gather(List *qptlist,
 }
 
 /*
- * distinctList is a list of SortGroupClauses, identifying the targetlist
- * items that should be considered by the SetOp filter.  The input path must
- * already be sorted accordingly.
+ * groupList is a list of SortGroupClauses, identifying the targetlist
+ * items that should be considered by the SetOp filter.  The input plans must
+ * already be sorted accordingly, if we're doing SETOP_SORTED mode.
  */
 static SetOp *
-make_setop(SetOpCmd cmd, SetOpStrategy strategy, Plan *lefttree,
-		   List *distinctList, AttrNumber flagColIdx, int firstFlag,
-		   long numGroups)
+make_setop(SetOpCmd cmd, SetOpStrategy strategy,
+		   List *tlist, Plan *lefttree, Plan *righttree,
+		   List *groupList, long numGroups)
 {
 	SetOp	   *node = makeNode(SetOp);
 	Plan	   *plan = &node->plan;
-	int			numCols = list_length(distinctList);
+	int			numCols = list_length(groupList);
 	int			keyno = 0;
-	AttrNumber *dupColIdx;
-	Oid		   *dupOperators;
-	Oid		   *dupCollations;
+	AttrNumber *cmpColIdx;
+	Oid		   *cmpOperators;
+	Oid		   *cmpCollations;
+	bool	   *cmpNullsFirst;
 	ListCell   *slitem;
 
-	plan->targetlist = lefttree->targetlist;
+	plan->targetlist = tlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
-	plan->righttree = NULL;
+	plan->righttree = righttree;
 
 	/*
-	 * convert SortGroupClause list into arrays of attr indexes and equality
+	 * convert SortGroupClause list into arrays of attr indexes and comparison
 	 * operators, as wanted by executor
 	 */
-	dupColIdx = (AttrNumber *) palloc(sizeof(AttrNumber) * numCols);
-	dupOperators = (Oid *) palloc(sizeof(Oid) * numCols);
-	dupCollations = (Oid *) palloc(sizeof(Oid) * numCols);
+	cmpColIdx = (AttrNumber *) palloc(sizeof(AttrNumber) * numCols);
+	cmpOperators = (Oid *) palloc(sizeof(Oid) * numCols);
+	cmpCollations = (Oid *) palloc(sizeof(Oid) * numCols);
+	cmpNullsFirst = (bool *) palloc(sizeof(bool) * numCols);
 
-	foreach(slitem, distinctList)
+	foreach(slitem, groupList)
 	{
 		SortGroupClause *sortcl = (SortGroupClause *) lfirst(slitem);
 		TargetEntry *tle = get_sortgroupclause_tle(sortcl, plan->targetlist);
 
-		dupColIdx[keyno] = tle->resno;
-		dupOperators[keyno] = sortcl->eqop;
-		dupCollations[keyno] = exprCollation((Node *) tle->expr);
-		Assert(OidIsValid(dupOperators[keyno]));
+		cmpColIdx[keyno] = tle->resno;
+		if (strategy == SETOP_HASHED)
+			cmpOperators[keyno] = sortcl->eqop;
+		else
+			cmpOperators[keyno] = sortcl->sortop;
+		Assert(OidIsValid(cmpOperators[keyno]));
+		cmpCollations[keyno] = exprCollation((Node *) tle->expr);
+		cmpNullsFirst[keyno] = sortcl->nulls_first;
 		keyno++;
 	}
 
 	node->cmd = cmd;
 	node->strategy = strategy;
 	node->numCols = numCols;
-	node->dupColIdx = dupColIdx;
-	node->dupOperators = dupOperators;
-	node->dupCollations = dupCollations;
-	node->flagColIdx = flagColIdx;
-	node->firstFlag = firstFlag;
+	node->cmpColIdx = cmpColIdx;
+	node->cmpOperators = cmpOperators;
+	node->cmpCollations = cmpCollations;
+	node->cmpNullsFirst = cmpNullsFirst;
 	node->numGroups = numGroups;
 
 	return node;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 9c3822f19a..1c43370005 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -65,7 +65,7 @@ static List *plan_union_children(PlannerInfo *root,
 								 List **istrivial_tlist);
 static void postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel);
 static bool choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-								Path *input_path,
+								Path *lpath, Path *rpath,
 								double dNumGroups, double dNumOutputRows,
 								const char *construct);
 static List *generate_setop_tlist(List *colTypes, List *colCollations,
@@ -315,8 +315,8 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
 		 * to the corresponding tlist entries of the subplan. However, since
 		 * the subplan was generated by generate_union_paths() or
 		 * generate_nonunion_paths(), and hence its tlist was generated by
-		 * generate_append_tlist(), this will work.  We just tell
-		 * generate_setop_tlist() to use varno 0.
+		 * generate_append_tlist() or generate_setop_tlist(), this will work.
+		 * We just tell generate_setop_tlist() to use varno 0.
 		 */
 		if (flag >= 0 ||
 			!tlist_same_datatypes(*pTargetList, colTypes, junkOK) ||
@@ -1028,29 +1028,27 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 			   *path;
 	List	   *lpath_tlist,
 			   *rpath_tlist,
-			   *tlist_list,
 			   *tlist,
-			   *groupList,
-			   *pathlist;
+			   *groupList;
 	bool		lpath_trivial_tlist,
-				rpath_trivial_tlist;
+				rpath_trivial_tlist,
+				result_trivial_tlist;
 	double		dLeftGroups,
 				dRightGroups,
 				dNumGroups,
 				dNumOutputRows;
 	bool		use_hash;
 	SetOpCmd	cmd;
-	int			firstFlag;
 
 	/*
 	 * Tell children to fetch all tuples.
 	 */
 	root->tuple_fraction = 0.0;
 
-	/* Recurse on children, ensuring their outputs are marked */
+	/* Recurse on children */
 	lrel = recurse_set_operations(op->larg, root,
 								  op->colTypes, op->colCollations,
-								  false, 0,
+								  false, -1,
 								  refnames_tlist,
 								  &lpath_tlist,
 								  &lpath_trivial_tlist);
@@ -1060,10 +1058,9 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	else
 		dLeftGroups = lrel->rows;
 
-	lpath = lrel->cheapest_total_path;
 	rrel = recurse_set_operations(op->rarg, root,
 								  op->colTypes, op->colCollations,
-								  false, 1,
+								  false, -1,
 								  refnames_tlist,
 								  &rpath_tlist,
 								  &rpath_trivial_tlist);
@@ -1073,41 +1070,51 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	else
 		dRightGroups = rrel->rows;
 
-	rpath = rrel->cheapest_total_path;
-
 	/* Undo effects of forcing tuple_fraction to 0 */
 	root->tuple_fraction = save_fraction;
 
 	/*
 	 * For EXCEPT, we must put the left input first.  For INTERSECT, either
 	 * order should give the same results, and we prefer to put the smaller
-	 * input first in order to minimize the size of the hash table in the
-	 * hashing case.  "Smaller" means the one with the fewer groups.
+	 * input first in order to (a) minimize the size of the hash table in the
+	 * hashing case, and (b) improve our chances of exploiting the executor's
+	 * fast path for empty left-hand input.  "Smaller" means the one with the
+	 * fewer groups.
 	 */
-	if (op->op == SETOP_EXCEPT || dLeftGroups <= dRightGroups)
-	{
-		pathlist = list_make2(lpath, rpath);
-		tlist_list = list_make2(lpath_tlist, rpath_tlist);
-		firstFlag = 0;
-	}
-	else
+	if (op->op != SETOP_EXCEPT && dLeftGroups > dRightGroups)
 	{
-		pathlist = list_make2(rpath, lpath);
-		tlist_list = list_make2(rpath_tlist, lpath_tlist);
-		firstFlag = 1;
+		/* need to swap the two inputs */
+		RelOptInfo *tmprel;
+		List	   *tmplist;
+		double		tmpd;
+
+		tmprel = lrel;
+		lrel = rrel;
+		rrel = tmprel;
+		tmplist = lpath_tlist;
+		lpath_tlist = rpath_tlist;
+		rpath_tlist = tmplist;
+		tmpd = dLeftGroups;
+		dLeftGroups = dRightGroups;
+		dRightGroups = tmpd;
 	}
 
+	lpath = lrel->cheapest_total_path;
+	rpath = rrel->cheapest_total_path;
+
 	/*
-	 * Generate tlist for Append plan node.
+	 * Generate tlist for SetOp plan node.
 	 *
-	 * The tlist for an Append plan isn't important as far as the Append is
+	 * The tlist for a SetOp plan isn't important so far as the SetOp is
 	 * concerned, but we must make it look real anyway for the benefit of the
-	 * next plan level up.  In fact, it has to be real enough that the flag
-	 * column is shown as a variable not a constant, else setrefs.c will get
-	 * confused.
+	 * next plan level up.
 	 */
-	tlist = generate_append_tlist(op->colTypes, op->colCollations, true,
-								  tlist_list, refnames_tlist);
+	tlist = generate_setop_tlist(op->colTypes, op->colCollations, -1,
+								 0, false, lpath_tlist, refnames_tlist,
+								 &result_trivial_tlist);
+
+	/* We should not have needed any type coercions in the tlist */
+	Assert(result_trivial_tlist);
 
 	*pTargetList = tlist;
 
@@ -1116,12 +1123,6 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 								 bms_union(lrel->relids, rrel->relids));
 	result_rel->reltarget = create_pathtarget(root, tlist);
 
-	/*
-	 * Append the child results together.
-	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
-									   NIL, NULL, 0, false, -1);
-
 	/* Identify the grouping semantics */
 	groupList = generate_setop_grouplist(op, tlist);
 
@@ -1140,25 +1141,40 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	}
 	else
 	{
-		dNumGroups = Min(dLeftGroups, dRightGroups);
+		dNumGroups = dLeftGroups;
 		dNumOutputRows = op->all ? Min(lpath->rows, rpath->rows) : dNumGroups;
 	}
 
 	/*
-	 * Decide whether to hash or sort, and add a sort node if needed.
+	 * Decide whether to hash or sort, and add sort nodes if needed.
 	 */
-	use_hash = choose_hashed_setop(root, groupList, path,
+	use_hash = choose_hashed_setop(root, groupList, lpath, rpath,
 								   dNumGroups, dNumOutputRows,
 								   (op->op == SETOP_INTERSECT) ? "INTERSECT" : "EXCEPT");
 
 	if (groupList && !use_hash)
-		path = (Path *) create_sort_path(root,
-										 result_rel,
-										 path,
-										 make_pathkeys_for_sortclauses(root,
-																	   groupList,
-																	   tlist),
-										 -1.0);
+	{
+		List	   *pathkeys;
+
+		pathkeys = make_pathkeys_for_sortclauses(root,
+												 groupList,
+												 lpath_tlist);
+		if (!pathkeys_contained_in(pathkeys, lpath->pathkeys))
+			lpath = (Path *) create_sort_path(root,
+											  lpath->parent,
+											  lpath,
+											  pathkeys,
+											  -1.0);
+		pathkeys = make_pathkeys_for_sortclauses(root,
+												 groupList,
+												 rpath_tlist);
+		if (!pathkeys_contained_in(pathkeys, rpath->pathkeys))
+			rpath = (Path *) create_sort_path(root,
+											  rpath->parent,
+											  rpath,
+											  pathkeys,
+											  -1.0);
+	}
 
 	/*
 	 * Finally, add a SetOp path node to generate the correct output.
@@ -1178,12 +1194,11 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	}
 	path = (Path *) create_setop_path(root,
 									  result_rel,
-									  path,
+									  lpath,
+									  rpath,
 									  cmd,
 									  use_hash ? SETOP_HASHED : SETOP_SORTED,
 									  groupList,
-									  list_length(op->colTypes) + 1,
-									  use_hash ? firstFlag : -1,
 									  dNumGroups,
 									  dNumOutputRows);
 
@@ -1285,10 +1300,13 @@ postprocess_setop_rel(PlannerInfo *root, RelOptInfo *rel)
 
 /*
  * choose_hashed_setop - should we use hashing for a set operation?
+ *
+ * XXX probably this should go away: just make both paths and let
+ * add_path sort it out.
  */
 static bool
 choose_hashed_setop(PlannerInfo *root, List *groupClauses,
-					Path *input_path,
+					Path *lpath, Path *rpath,
 					double dNumGroups, double dNumOutputRows,
 					const char *construct)
 {
@@ -1327,7 +1345,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
 	 * Don't do it if it doesn't look like the hashtable will fit into
 	 * hash_mem.
 	 */
-	hashentrysize = MAXALIGN(input_path->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);
+	hashentrysize = MAXALIGN(lpath->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);
 
 	if (hashentrysize * dNumGroups > hash_mem_limit)
 		return false;
@@ -1336,9 +1354,9 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
 	 * See if the estimated cost is no more than doing it the other way.
 	 *
 	 * We need to consider input_plan + hashagg versus input_plan + sort +
-	 * group.  Note that the actual result plan might involve a SetOp or
-	 * Unique node, not Agg or Group, but the cost estimates for Agg and Group
-	 * should be close enough for our purposes here.
+	 * group. XXX NOT TRUE: Note that the actual result plan might involve a
+	 * SetOp or Unique node, not Agg or Group, but the cost estimates for Agg
+	 * and Group should be close enough for our purposes here.
 	 *
 	 * These path variables are dummies that just hold cost fields; we don't
 	 * make actual Paths for these steps.
@@ -1346,27 +1364,31 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
 	cost_agg(&hashed_p, root, AGG_HASHED, NULL,
 			 numGroupCols, dNumGroups,
 			 NIL,
-			 input_path->disabled_nodes,
-			 input_path->startup_cost, input_path->total_cost,
-			 input_path->rows, input_path->pathtarget->width);
+			 lpath->disabled_nodes + rpath->disabled_nodes,
+			 lpath->startup_cost + rpath->startup_cost,
+			 lpath->total_cost + rpath->total_cost,
+			 lpath->rows + rpath->rows,
+			 lpath->pathtarget->width);
 
 	/*
-	 * Now for the sorted case.  Note that the input is *always* unsorted,
-	 * since it was made by appending unrelated sub-relations together.
+	 * Now for the sorted case.  XXX NOT TRUE: Note that the input is *always*
+	 * unsorted, since it was made by appending unrelated sub-relations
+	 * together.
 	 */
-	sorted_p.disabled_nodes = input_path->disabled_nodes;
-	sorted_p.startup_cost = input_path->startup_cost;
-	sorted_p.total_cost = input_path->total_cost;
+	sorted_p.disabled_nodes = lpath->disabled_nodes + rpath->disabled_nodes;
+	sorted_p.startup_cost = lpath->startup_cost + rpath->startup_cost;
+	sorted_p.total_cost = lpath->total_cost + rpath->total_cost;
 	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
 	cost_sort(&sorted_p, root, NIL, sorted_p.disabled_nodes,
 			  sorted_p.total_cost,
-			  input_path->rows, input_path->pathtarget->width,
+			  lpath->rows + rpath->rows,
+			  lpath->pathtarget->width,
 			  0.0, work_mem, -1.0);
 	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
 			   NIL,
 			   sorted_p.disabled_nodes,
 			   sorted_p.startup_cost, sorted_p.total_cost,
-			   input_path->rows);
+			   lpath->rows + rpath->rows);
 
 	/*
 	 * Now make the decision using the top-level tuple fraction.  First we
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index fc97bf6ee2..e52e4b1d67 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3634,25 +3634,26 @@ create_windowagg_path(PlannerInfo *root,
  *	  Creates a pathnode that represents computation of INTERSECT or EXCEPT
  *
  * 'rel' is the parent relation associated with the result
- * 'subpath' is the path representing the source of data
+ * 'leftpath' is the path representing the left-hand source of data
+ * 'rightpath' is the path representing the right-hand source of data
  * 'cmd' is the specific semantics (INTERSECT or EXCEPT, with/without ALL)
  * 'strategy' is the implementation strategy (sorted or hashed)
- * 'distinctList' is a list of SortGroupClause's representing the grouping
- * 'flagColIdx' is the column number where the flag column will be, if any
- * 'firstFlag' is the flag value for the first input relation when hashing;
- *		or -1 when sorting
- * 'numGroups' is the estimated number of distinct groups
+ * 'groupList' is a list of SortGroupClause's representing the grouping
+ * 'numGroups' is the estimated number of distinct groups in left-hand input
  * 'outputRows' is the estimated number of output rows
+ *
+ * leftpath and rightpath must produce the same columns.  Moreover, if
+ * strategy is SETOP_SORTED, leftpath and rightpath must both be sorted
+ * by all the grouping columns.
  */
 SetOpPath *
 create_setop_path(PlannerInfo *root,
 				  RelOptInfo *rel,
-				  Path *subpath,
+				  Path *leftpath,
+				  Path *rightpath,
 				  SetOpCmd cmd,
 				  SetOpStrategy strategy,
-				  List *distinctList,
-				  AttrNumber flagColIdx,
-				  int firstFlag,
+				  List *groupList,
 				  double numGroups,
 				  double outputRows)
 {
@@ -3660,34 +3661,37 @@ create_setop_path(PlannerInfo *root,
 
 	pathnode->path.pathtype = T_SetOp;
 	pathnode->path.parent = rel;
-	/* SetOp doesn't project, so use source path's pathtarget */
-	pathnode->path.pathtarget = subpath->pathtarget;
+	pathnode->path.pathtarget = rel->reltarget;
 	/* For now, assume we are above any joins, so no parameterization */
 	pathnode->path.param_info = NULL;
 	pathnode->path.parallel_aware = false;
 	pathnode->path.parallel_safe = rel->consider_parallel &&
-		subpath->parallel_safe;
-	pathnode->path.parallel_workers = subpath->parallel_workers;
+		leftpath->parallel_safe && rightpath->parallel_safe;
+	pathnode->path.parallel_workers =
+		leftpath->parallel_workers + rightpath->parallel_workers;
 	/* SetOp preserves the input sort order if in sort mode */
 	pathnode->path.pathkeys =
-		(strategy == SETOP_SORTED) ? subpath->pathkeys : NIL;
+		(strategy == SETOP_SORTED) ? leftpath->pathkeys : NIL;
 
-	pathnode->subpath = subpath;
+	pathnode->leftpath = leftpath;
+	pathnode->rightpath = rightpath;
 	pathnode->cmd = cmd;
 	pathnode->strategy = strategy;
-	pathnode->distinctList = distinctList;
-	pathnode->flagColIdx = flagColIdx;
-	pathnode->firstFlag = firstFlag;
+	pathnode->groupList = groupList;
 	pathnode->numGroups = numGroups;
 
 	/*
 	 * Charge one cpu_operator_cost per comparison per input tuple. We assume
 	 * all columns get compared at most of the tuples.
+	 *
+	 * XXX all wrong for hashing
 	 */
-	pathnode->path.disabled_nodes = subpath->disabled_nodes;
-	pathnode->path.startup_cost = subpath->startup_cost;
-	pathnode->path.total_cost = subpath->total_cost +
-		cpu_operator_cost * subpath->rows * list_length(distinctList);
+	pathnode->path.disabled_nodes =
+		leftpath->disabled_nodes + rightpath->disabled_nodes;
+	pathnode->path.startup_cost =
+		leftpath->startup_cost + rightpath->startup_cost;
+	pathnode->path.total_cost = leftpath->total_cost + rightpath->total_cost +
+		cpu_operator_cost * (leftpath->rows + rightpath->rows) * list_length(groupList);
 	pathnode->path.rows = outputRows;
 
 	return pathnode;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7f71b7625d..6ca7310fea 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2805,27 +2805,34 @@ typedef struct HashState
 /* ----------------
  *	 SetOpState information
  *
- *		Even in "sorted" mode, SetOp nodes are more complex than a simple
- *		Unique, since we have to count how many duplicates to return.  But
- *		we also support hashing, so this is really more like a cut-down
- *		form of Agg.
+ *		SetOp nodes support either sorted or hashed de-duplication.
+ *		The sorted mode is a bit like MergeJoin, the hashed mode like Agg.
  * ----------------
  */
-/* this struct is private in nodeSetOp.c: */
-typedef struct SetOpStatePerGroupData *SetOpStatePerGroup;
+typedef struct SetOpStatePerInput
+{
+	TupleTableSlot *firstTupleSlot; /* first tuple of current group */
+	int64		numTuples;		/* number of tuples in current group */
+	TupleTableSlot *nextTupleSlot;	/* next input tuple, if already read */
+	bool		needGroup;		/* do we need to load a new group? */
+} SetOpStatePerInput;
 
 typedef struct SetOpState
 {
 	PlanState	ps;				/* its first field is NodeTag */
-	ExprState  *eqfunction;		/* equality comparator */
-	Oid		   *eqfuncoids;		/* per-grouping-field equality fns */
-	FmgrInfo   *hashfunctions;	/* per-grouping-field hash fns */
 	bool		setop_done;		/* indicates completion of output scan */
-	long		numOutput;		/* number of dups left to output */
+	int64		numOutput;		/* number of dups left to output */
+	int			numCols;		/* number of grouping columns */
+
 	/* these fields are used in SETOP_SORTED mode: */
-	SetOpStatePerGroup pergroup;	/* per-group working state */
-	HeapTuple	grp_firstTuple; /* copy of first tuple of current group */
+	SortSupport sortKeys;		/* per-grouping-field sort data */
+	SetOpStatePerInput leftInput;	/* current outer-relation input state */
+	SetOpStatePerInput rightInput;	/* current inner-relation input state */
+	bool		need_init;		/* have we read the first tuples yet? */
+
 	/* these fields are used in SETOP_HASHED mode: */
+	Oid		   *eqfuncoids;		/* per-grouping-field equality fns */
+	FmgrInfo   *hashfunctions;	/* per-grouping-field hash fns */
 	TupleHashTable hashtable;	/* hash table with one entry per group */
 	MemoryContext tableContext; /* memory context containing hash table */
 	bool		table_filled;	/* hash table filled yet? */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0759e00e96..58748d2ca6 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2342,13 +2342,12 @@ typedef struct WindowAggPath
 typedef struct SetOpPath
 {
 	Path		path;
-	Path	   *subpath;		/* path representing input source */
+	Path	   *leftpath;		/* paths representing input sources */
+	Path	   *rightpath;
 	SetOpCmd	cmd;			/* what to do, see nodes.h */
 	SetOpStrategy strategy;		/* how to do it, see nodes.h */
-	List	   *distinctList;	/* SortGroupClauses identifying target cols */
-	AttrNumber	flagColIdx;		/* where is the flag column, if any */
-	int			firstFlag;		/* flag value for first input relation */
-	Cardinality numGroups;		/* estimated number of groups in input */
+	List	   *groupList;		/* SortGroupClauses identifying target cols */
+	Cardinality numGroups;		/* estimated number of groups in left input */
 } SetOpPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 52f29bcdb6..4633121689 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1225,23 +1225,20 @@ typedef struct SetOp
 	/* how to do it, see nodes.h */
 	SetOpStrategy strategy;
 
-	/* number of columns to check for duplicate-ness */
+	/* number of columns to compare */
 	int			numCols;
 
 	/* their indexes in the target list */
-	AttrNumber *dupColIdx pg_node_attr(array_size(numCols));
+	AttrNumber *cmpColIdx pg_node_attr(array_size(numCols));
 
-	/* equality operators to compare with */
-	Oid		   *dupOperators pg_node_attr(array_size(numCols));
-	Oid		   *dupCollations pg_node_attr(array_size(numCols));
-
-	/* where is the flag column, if any */
-	AttrNumber	flagColIdx;
+	/* comparison operators (either equality operators or sort operators) */
+	Oid		   *cmpOperators pg_node_attr(array_size(numCols));
+	Oid		   *cmpCollations pg_node_attr(array_size(numCols));
 
-	/* flag value for first input relation */
-	int			firstFlag;
+	/* nulls-first flags if sorting, otherwise not interesting */
+	bool	   *cmpNullsFirst pg_node_attr(array_size(numCols));
 
-	/* estimated number of groups in input */
+	/* estimated number of groups in left input */
 	long		numGroups;
 } SetOp;
 
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 1035e6560c..5a6d0350c1 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -260,12 +260,11 @@ extern WindowAggPath *create_windowagg_path(PlannerInfo *root,
 											bool topwindow);
 extern SetOpPath *create_setop_path(PlannerInfo *root,
 									RelOptInfo *rel,
-									Path *subpath,
+									Path *leftpath,
+									Path *rightpath,
 									SetOpCmd cmd,
 									SetOpStrategy strategy,
-									List *distinctList,
-									AttrNumber flagColIdx,
-									int firstFlag,
+									List *groupList,
 									double numGroups,
 									double outputRows);
 extern RecursiveUnionPath *create_recursiveunion_path(PlannerInfo *root,
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index 857ba1c40c..b4cd204de7 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -1231,22 +1231,18 @@ select count(*) from
     select * from onek i2 where i2.unique1 = o.unique2
   ) ss
 where o.ten = 1;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
+                         QUERY PLAN                         
+------------------------------------------------------------
  Aggregate
    ->  Nested Loop
          ->  Seq Scan on onek o
                Filter: (ten = 1)
-         ->  Subquery Scan on ss
-               ->  HashSetOp Except
-                     ->  Append
-                           ->  Subquery Scan on "*SELECT* 1"
-                                 ->  Index Scan using onek_unique1 on onek i1
-                                       Index Cond: (unique1 = o.unique1)
-                           ->  Subquery Scan on "*SELECT* 2"
-                                 ->  Index Scan using onek_unique1 on onek i2
-                                       Index Cond: (unique1 = o.unique2)
-(13 rows)
+         ->  HashSetOp Except
+               ->  Index Scan using onek_unique1 on onek i1
+                     Index Cond: (unique1 = o.unique1)
+               ->  Index Scan using onek_unique1 on onek i2
+                     Index Cond: (unique1 = o.unique2)
+(9 rows)
 
 select count(*) from
   onek o cross join lateral (
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index c73631a9a1..fbf24490e9 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -370,17 +370,13 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                                     QUERY PLAN                                     
-------------------------------------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Aggregate
-   ->  Subquery Scan on ss
-         ->  HashSetOp Intersect
-               ->  Append
-                     ->  Subquery Scan on "*SELECT* 2"
-                           ->  Seq Scan on tenk1
-                     ->  Subquery Scan on "*SELECT* 1"
-                           ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(8 rows)
+   ->  HashSetOp Intersect
+         ->  Seq Scan on tenk1
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(4 rows)
 
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -391,16 +387,13 @@ select count(*) from
 
 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                               QUERY PLAN                               
-------------------------------------------------------------------------
+                         QUERY PLAN                         
+------------------------------------------------------------
  HashSetOp Except
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Index Only Scan using tenk1_unique1 on tenk1
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-                     Filter: (unique2 <> 10)
-(7 rows)
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+(4 rows)
 
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1 
@@ -434,19 +427,17 @@ select count(*) from
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
-                                        QUERY PLAN                                        
-------------------------------------------------------------------------------------------
+                               QUERY PLAN                               
+------------------------------------------------------------------------
  Aggregate
-   ->  Subquery Scan on ss
-         ->  SetOp Intersect
-               ->  Sort
-                     Sort Key: "*SELECT* 2".fivethous
-                     ->  Append
-                           ->  Subquery Scan on "*SELECT* 2"
-                                 ->  Seq Scan on tenk1
-                           ->  Subquery Scan on "*SELECT* 1"
-                                 ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
-(10 rows)
+   ->  SetOp Intersect
+         ->  Sort
+               Sort Key: tenk1.fivethous
+               ->  Seq Scan on tenk1
+         ->  Sort
+               Sort Key: tenk1_1.unique1
+               ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+(8 rows)
 
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -457,18 +448,17 @@ select count(*) from
 
 explain (costs off)
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".unique1
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Index Only Scan using tenk1_unique1 on tenk1
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
-                           Filter: (unique2 <> 10)
-(9 rows)
+         Sort Key: tenk1.unique1
+         ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Sort
+         Sort Key: tenk1_1.unique2
+         ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+               Filter: (unique2 <> 10)
+(8 rows)
 
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1 
@@ -528,15 +518,12 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) union select x from (va
 
 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4])) _(x);
-                  QUERY PLAN                   
------------------------------------------------
+            QUERY PLAN             
+-----------------------------------
  HashSetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Values Scan on "*VALUES*"
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Values Scan on "*VALUES*_1"
-(6 rows)
+   ->  Values Scan on "*VALUES*"
+   ->  Values Scan on "*VALUES*_1"
+(3 rows)
 
 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4])) _(x);
    x   
@@ -546,15 +533,12 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from
 
 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4])) _(x);
-                  QUERY PLAN                   
------------------------------------------------
+            QUERY PLAN             
+-----------------------------------
  HashSetOp Except
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Values Scan on "*VALUES*"
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Values Scan on "*VALUES*_1"
-(6 rows)
+   ->  Values Scan on "*VALUES*"
+   ->  Values Scan on "*VALUES*_1"
+(3 rows)
 
 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4])) _(x);
    x   
@@ -606,17 +590,16 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) union select x from (va
 
 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4])) _(x);
-                     QUERY PLAN                      
------------------------------------------------------
+               QUERY PLAN                
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)
 
 select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from (values (array[1, 2]), (array[1, 4])) _(x);
    x   
@@ -626,17 +609,16 @@ select x from (values (array[1, 2]), (array[1, 3])) _(x) intersect select x from
 
 explain (costs off)
 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4])) _(x);
-                     QUERY PLAN                      
------------------------------------------------------
+               QUERY PLAN                
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)
 
 select x from (values (array[1, 2]), (array[1, 3])) _(x) except select x from (values (array[1, 2]), (array[1, 4])) _(x);
    x   
@@ -669,17 +651,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) union select x from (values
 
 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN                      
------------------------------------------------------
+               QUERY PLAN                
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)
 
 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x   
@@ -689,17 +670,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (va
 
 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN                      
------------------------------------------------------
+               QUERY PLAN                
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)
 
 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x   
@@ -777,17 +757,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) union select x from (values
 
 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN                      
------------------------------------------------------
+               QUERY PLAN                
+-----------------------------------------
  SetOp Intersect
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)
 
 select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x   
@@ -797,17 +776,16 @@ select x from (values (row(1, 2)), (row(1, 3))) _(x) intersect select x from (va
 
 explain (costs off)
 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
-                     QUERY PLAN                      
------------------------------------------------------
+               QUERY PLAN                
+-----------------------------------------
  SetOp Except
    ->  Sort
-         Sort Key: "*SELECT* 1".x
-         ->  Append
-               ->  Subquery Scan on "*SELECT* 1"
-                     ->  Values Scan on "*VALUES*"
-               ->  Subquery Scan on "*SELECT* 2"
-                     ->  Values Scan on "*VALUES*_1"
-(8 rows)
+         Sort Key: "*VALUES*".column1
+         ->  Values Scan on "*VALUES*"
+   ->  Sort
+         Sort Key: "*VALUES*_1".column1
+         ->  Values Scan on "*VALUES*_1"
+(7 rows)
 
 select x from (values (row(1, 2)), (row(1, 3))) _(x) except select x from (values (row(1, 2)), (row(1, 4))) _(x);
    x   
@@ -970,15 +948,12 @@ set enable_sort = false;
 -- no means to disable Unique.
 explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
-                              QUERY PLAN                              
-----------------------------------------------------------------------
+                        QUERY PLAN                        
+----------------------------------------------------------
  HashSetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Function Scan on generate_series
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Function Scan on generate_series generate_series_1
-(6 rows)
+   ->  Function Scan on generate_series
+   ->  Function Scan on generate_series generate_series_1
+(3 rows)
 
 select from generate_series(1,5) union all select from generate_series(1,3);
 --
@@ -1015,15 +990,12 @@ select from generate_series(1,5) union select from generate_series(1,3);
 
 explain (costs off)
 select from generate_series(1,5) intersect select from generate_series(1,3);
-                              QUERY PLAN                              
-----------------------------------------------------------------------
+                        QUERY PLAN                        
+----------------------------------------------------------
  SetOp Intersect
-   ->  Append
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Function Scan on generate_series
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Function Scan on generate_series generate_series_1
-(6 rows)
+   ->  Function Scan on generate_series
+   ->  Function Scan on generate_series generate_series_1
+(3 rows)
 
 select from generate_series(1,5) union select from generate_series(1,3);
 --
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ce33e55bf1..3bd71b78be 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2612,6 +2612,8 @@ SetOpCmd
 SetOpPath
 SetOpState
 SetOpStatePerGroup
+SetOpStatePerGroupData
+SetOpStatePerInput
 SetOpStrategy
 SetOperation
 SetOperationStmt

#10

Richard Guo

guofenglinux@gmail.com

about 1 year ago

In reply to: Tom Lane (#5)

Re: Pg18 Recursive Crash

On Wed, Dec 18, 2024 at 7:05 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

The slightly annoying thing here is that the attached patch passes the
TupleTableSlotOps as NULL in nodeSetOp.c. Per nodeAppend.c line 186,
Append does not go to much effort to setting a fixed
TupleTableSlotOps. Really it could loop over all the child plans and
check if those have fixed slot types of the same type and then fix its
own resulting slot. For nodeSetOps.c use case, since the planner
(currently) injects the flag into the target list, it'll always
project and use a virtual slot type. It's maybe worth coming back and
adjusting nodeAppend.c so it works a bit harder to fix its slot type.
I think that's likely for another patch, however. Tom is also
currently working on nodeSetOps.c to change how all this works so it
no longer uses the flags method to determine the outer and inner
sides.

Yeah, I see no point in putting effort into improving the current
nodeSetOp implementation. There might be a reason to change
nodeAppend as you suggest for other use-cases though.

Should we be concerned about passing a NULL TupleTableSlotOps in
nodeRecursiveUnion.c? The query below triggers the same assert
failure: the slot is expected to be TTSOpsMinimalTuple, but it is
TTSOpsBufferHeapTuple.

create table t (a int);
insert into t values (1), (1);

with recursive cte (a) as (select a from t union select a from cte)
select a from cte;

Thanks
Richard

#11

David Rowley

dgrowleyml@gmail.com

about 1 year ago

In reply to: Richard Guo (#10)

1 attachment(s)

Re: Pg18 Recursive Crash

On Wed, 18 Dec 2024 at 22:29, Richard Guo <guofenglinux@gmail.com> wrote:

Should we be concerned about passing a NULL TupleTableSlotOps in
nodeRecursiveUnion.c?

I made it pass NULL on purpose because the slot type on the recursive
union can be different on the inner and outer sides. Do you see issues
with that?

The query below triggers the same assert
failure: the slot is expected to be TTSOpsMinimalTuple, but it is
TTSOpsBufferHeapTuple.

create table t (a int);
insert into t values (1), (1);

with recursive cte (a) as (select a from t union select a from cte)
select a from cte;

That's a good find. I tested as far back as REL_13_STABLE and it's
failing the same Assert there too.

Maybe we need to backpatch passing NULL instead of &TTSOpsMinimalTuple
to ExecBuildGroupingEqual() in BuildTupleHashTableExt(). Something
like the attached patch.

As far as I see it, there's no downside to this as
ExecComputeSlotInfo() will return true anyway for the
EEOP_INNER_FETCHSOME step in ExecBuildGroupingEqual() due to minimal
tuples needing deformation anyway. For master, we could just do what
Tom proposed above so that any callers that always use virtual slots
can benefit from the elimination of the EEOP_INNER_FETCHSOME step.

David

Attachments:

fix_incorrect_slot_type.patchapplication/octet-stream; name=fix_incorrect_slot_type.patchDownload

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 7233f1e3c0..7f6d7703b4 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -224,9 +224,8 @@ BuildTupleHashTableExt(PlanState *parent,
 	allow_jit = metacxt != tablecxt;
 
 	/* build comparator for all columns */
-	/* XXX: should we support non-minimal tuples for the inputslot? */
 	hashtable->tab_eq_func = ExecBuildGroupingEqual(inputDesc, inputDesc,
-													&TTSOpsMinimalTuple, &TTSOpsMinimalTuple,
+													NULL, &TTSOpsMinimalTuple,
 													numCols,
 													keyColIdx, eqfuncoids, collations,
 													allow_jit ? parent : NULL);

#12

David Rowley

dgrowleyml@gmail.com

about 1 year ago

In reply to: David Rowley (#11)

1 attachment(s)

Re: Pg18 Recursive Crash

On Wed, 18 Dec 2024 at 23:45, David Rowley <dgrowleyml@gmail.com> wrote:

Maybe we need to backpatch passing NULL instead of &TTSOpsMinimalTuple
to ExecBuildGroupingEqual() in BuildTupleHashTableExt(). Something
like the attached patch.

I've attached a more formal patch for this and I've also now done a
bit more research and experimentation as to why we didn't notice this
for so long. It looks like the non-recursive part of the UNION must
use TTSOpsBufferHeapTuple and there must be duplicates. So that
basically means you need to select all columns, otherwise, there'd be
projection and the slot would be virtual. That boils down to, you need
a table without a primary key or any unique constraints, otherwise,
you can't get the duplicate value that's required to trigger the
Assert failure. I hope the proposed commit message is enough to
explain this in enough detail.

Of course, there may maybe some other path to trigger this using one
of the other users of BuildTupleHashTableExt().

I propose to quickly do a master-only follow-up commit to use the
inputOps instead of NULL in BuildTupleHashTableExt (Basically Tom's
patch from [1]/messages/by-id/2543667.1734483723@sss.pgh.pa.us)

Sound ok?

David

[1]: /messages/by-id/2543667.1734483723@sss.pgh.pa.us

Attachments:

v1-0001-Fix-Assert-failure-in-WITH-RECURSIVE-UNION-querie.patchapplication/octet-stream; name=v1-0001-Fix-Assert-failure-in-WITH-RECURSIVE-UNION-querie.patchDownload

From f6974dacc76c1372eb80f46b8f5d52e8579887cd Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Thu, 19 Dec 2024 10:39:10 +1300
Subject: [PATCH v1] Fix Assert failure in WITH RECURSIVE UNION queries

If the non-recursive part of a recursive CTE ended up using
TTSOpsBufferHeapTuple as the table slot type, then a duplicate value
could cause an Assert failure in CheckOpSlotCompatibility() when
checking the hash table for the duplicate value.  The expected slot type
for the deform step was TTSOpsMinimalTuple so the Assert failed when the
TTSOpsBufferHeapTuple slot was used.

This is a long-standing bug which we likely didn't notice because it
seems much more likely that the non-recursive term would have required
projection and used a TTSOpsVirtual slot, which CheckOpSlotCompatibility
is ok with.

There doesn't seem to be any harm done here other than the Assert
failure.  Both TTSOpsMinimalTuple and TTSOpsBufferHeapTuple slot types
require tuple deformation, so the EEOP_*_FETCHSOME ExprState step would
have properly existed in the ExprState.

The solution is to pass NULL for the ExecBuildGroupingEqual's 'lops'
parameter.  This means the ExprState's EEOP_*_FETCHSOME step won't
expect a fixed slot type.  This makes CheckOpSlotCompatibility() happy as
no checking is performed when the ExprEvalStep is not expecting a fixed
slot type.

Reported-by: Richard Guo
Discussion: https://postgr.es/m/CAMbWs4-8U9q2LAtf8+ghV11zeUReA3AmrYkxzBEv0vKnDxwkKA@mail.gmail.com
Backpatch-through: 13, all supported versions
---
 src/backend/executor/execGrouping.c |  3 +--
 src/test/regress/expected/with.out  | 15 +++++++++++++++
 src/test/regress/sql/with.sql       | 12 ++++++++++++
 3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 4a8f72305c..a69fa0f8cc 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -236,9 +236,8 @@ BuildTupleHashTableExt(PlanState *parent,
 														hash_iv);
 
 	/* build comparator for all columns */
-	/* XXX: should we support non-minimal tuples for the inputslot? */
 	hashtable->tab_eq_func = ExecBuildGroupingEqual(inputDesc, inputDesc,
-													&TTSOpsMinimalTuple, &TTSOpsMinimalTuple,
+													NULL, &TTSOpsMinimalTuple,
 													numCols,
 													keyColIdx, eqfuncoids, collations,
 													allow_jit ? parent : NULL);
diff --git a/src/test/regress/expected/with.out b/src/test/regress/expected/with.out
index ff9754603b..7a51e2eb75 100644
--- a/src/test/regress/expected/with.out
+++ b/src/test/regress/expected/with.out
@@ -656,6 +656,21 @@ SELECT t1.id, t2.path, t2 FROM t AS t1 JOIN t AS t2 ON
  16 | {3,7,11,16} | (16,"{3,7,11,16}")
 (16 rows)
 
+CREATE TEMP TABLE duplicates (a INT NOT NULL);
+INSERT INTO duplicates VALUES(1), (1);
+-- Try out a recursive UNION case where the non-recursive part's table slot
+-- uses TTSOpsBufferHeapTuple and contains duplicate rows.
+WITH RECURSIVE cte (a) as (
+	SELECT a FROM duplicates
+	UNION
+	SELECT a FROM cte
+)
+SELECT a FROM cte;
+ a 
+---
+ 1
+(1 row)
+
 -- test that column statistics from a materialized CTE are available
 -- to upper planner (otherwise, we'd get a stupider plan)
 explain (costs off)
diff --git a/src/test/regress/sql/with.sql b/src/test/regress/sql/with.sql
index aca7bae6dd..dcdaab5eff 100644
--- a/src/test/regress/sql/with.sql
+++ b/src/test/regress/sql/with.sql
@@ -361,6 +361,18 @@ UNION ALL
 SELECT t1.id, t2.path, t2 FROM t AS t1 JOIN t AS t2 ON
 (t1.id=t2.id);
 
+CREATE TEMP TABLE duplicates (a INT NOT NULL);
+INSERT INTO duplicates VALUES(1), (1);
+
+-- Try out a recursive UNION case where the non-recursive part's table slot
+-- uses TTSOpsBufferHeapTuple and contains duplicate rows.
+WITH RECURSIVE cte (a) as (
+	SELECT a FROM duplicates
+	UNION
+	SELECT a FROM cte
+)
+SELECT a FROM cte;
+
 -- test that column statistics from a materialized CTE are available
 -- to upper planner (otherwise, we'd get a stupider plan)
 explain (costs off)
-- 
2.34.1

#13

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: David Rowley (#12)

Re: Pg18 Recursive Crash

David Rowley <dgrowleyml@gmail.com> writes:

On Wed, 18 Dec 2024 at 23:45, David Rowley <dgrowleyml@gmail.com> wrote:

Maybe we need to backpatch passing NULL instead of &TTSOpsMinimalTuple
to ExecBuildGroupingEqual() in BuildTupleHashTableExt(). Something
like the attached patch.

I've attached a more formal patch for this and I've also now done a
bit more research and experimentation as to why we didn't notice this
for so long.

I suspect that another key reason for the lack of reports is that
it's an assertion failure only, with no consequences in production
builds. So ordinary users issuing such a query wouldn't notice.

I propose to quickly do a master-only follow-up commit to use the
inputOps instead of NULL in BuildTupleHashTableExt (Basically Tom's
patch from [1])

LGTM.

regards, tom lane

#14

Richard Guo

guofenglinux@gmail.com

about 1 year ago

In reply to: David Rowley (#11)

Re: Pg18 Recursive Crash

On Wed, Dec 18, 2024 at 7:45 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 18 Dec 2024 at 22:29, Richard Guo <guofenglinux@gmail.com> wrote:

Should we be concerned about passing a NULL TupleTableSlotOps in
nodeRecursiveUnion.c?

I made it pass NULL on purpose because the slot type on the recursive
union can be different on the inner and outer sides. Do you see issues
with that?

I see. I didn't notice any real issues with that; I was just flagged
by the XXX comment there, which raises the question of whether it's
worth working harder to determine the inputOps.

Thanks
Richard

#15

Richard Guo

guofenglinux@gmail.com

about 1 year ago

In reply to: Tom Lane (#13)

Re: Pg18 Recursive Crash

On Thu, Dec 19, 2024 at 8:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

I propose to quickly do a master-only follow-up commit to use the
inputOps instead of NULL in BuildTupleHashTableExt (Basically Tom's
patch from [1])

LGTM.

+1.

Thanks
Richard

#16

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Richard Guo (#14)

Re: Pg18 Recursive Crash

Richard Guo <guofenglinux@gmail.com> writes:

I see. I didn't notice any real issues with that; I was just flagged
by the XXX comment there, which raises the question of whether it's
worth working harder to determine the inputOps.

I was intending to add some code to my nodeSetop patch to see if
both input plan nodes use the same fixed slot type, and if so
pass that rather than NULL to BuildTupleHashTableExt. Perhaps
nodeRecursiveunion could do the same thing (in which case we
probably ought to abstract that out to a subroutine).

regards, tom lane