Wrong results from Parallel Hash Full Join
I came across $subject and reduced the repro query as below.
create table a (i int);
create table b (i int);
insert into a values (1);
insert into b values (2);
update b set i = 2;
set min_parallel_table_scan_size to 0;
set parallel_tuple_cost to 0;
set parallel_setup_cost to 0;
# explain (costs off) select * from a full join b on a.i = b.i;
QUERY PLAN
------------------------------------------
Gather
Workers Planned: 2
-> Parallel Hash Full Join
Hash Cond: (a.i = b.i)
-> Parallel Seq Scan on a
-> Parallel Hash
-> Parallel Seq Scan on b
(7 rows)
# select * from a full join b on a.i = b.i;
i | i
---+---
1 |
(1 row)
Tuple (NULL, 2) is missing from the results.
Thanks
Richard
On Wed, Apr 12, 2023 at 7:36 AM Richard Guo <guofenglinux@gmail.com> wrote:
I came across $subject and reduced the repro query as below.
create table a (i int);
create table b (i int);
insert into a values (1);
insert into b values (2);
update b set i = 2;set min_parallel_table_scan_size to 0;
set parallel_tuple_cost to 0;
set parallel_setup_cost to 0;# explain (costs off) select * from a full join b on a.i = b.i;
QUERY PLAN
------------------------------------------
Gather
Workers Planned: 2
-> Parallel Hash Full Join
Hash Cond: (a.i = b.i)
-> Parallel Seq Scan on a
-> Parallel Hash
-> Parallel Seq Scan on b
(7 rows)# select * from a full join b on a.i = b.i;
i | i
---+---
1 |
(1 row)Tuple (NULL, 2) is missing from the results.
Thanks so much for reporting this, Richard. This is a fantastic minimal
repro!
So, I looked into this, and it seems that, as you can imagine, the tuple
in b is hot updated, resulting in a heap only tuple.
t_ctid | raw_flags
--------+----------------------------------------------------------------------
(0,2) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_HOT_UPDATED}
(0,2) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE}
In ExecParallelScanHashTableForUnmatched() we don't emit the
NULL-extended tuple because HeapTupleHeaderHasMatch() is true for our
desired tuple.
while (hashTuple != NULL)
{
if (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(hashTuple)))
{
HeapTupleHeaderHasMatch() checks if HEAP_TUPLE_HAS_MATCH is set.
In htup_details.h, you will see that HEAP_TUPLE_HAS_MATCH is defined as
HEAP_ONLY_TUPLE
/*
* HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
* only used in tuples that are in the hash table, and those don't need
* any visibility information, so we can overlay it on a visibility flag
* instead of using up a dedicated bit.
*/
#define HEAP_TUPLE_HAS_MATCH HEAP_ONLY_TUPLE /* tuple has a join match */
If you redefine HEAP_TUPLE_HAS_MATCH as something that isn't already
used, say 0x1800, the query returns correct results.
QUERY PLAN
------------------------------------------
Gather
Workers Planned: 2
-> Parallel Hash Full Join
Hash Cond: (a.i = b.i)
-> Parallel Seq Scan on a
-> Parallel Hash
-> Parallel Seq Scan on b
(7 rows)
i | i
---+---
1 |
| 2
(2 rows)
The question is, why does this only happen for a parallel full hash join?
unpa
postgres=# explain (costs off) select * from a full join b on a.i = b.i;
QUERY PLAN
---------------------------
Hash Full Join
Hash Cond: (a.i = b.i)
-> Seq Scan on a
-> Hash
-> Seq Scan on b
(5 rows)
postgres=# select * from a full join b on a.i = b.i;
i | i
---+---
1 |
| 2
(2 rows)
I imagine it has something to do with what tuples are put in the
parallel hashtable. I am about to investigate that but just wanted to
share what I had so far.
- Melanie
Hi,
On 2023-04-12 10:57:17 -0400, Melanie Plageman wrote:
HeapTupleHeaderHasMatch() checks if HEAP_TUPLE_HAS_MATCH is set.
In htup_details.h, you will see that HEAP_TUPLE_HAS_MATCH is defined as
HEAP_ONLY_TUPLE
/*
* HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
* only used in tuples that are in the hash table, and those don't need
* any visibility information, so we can overlay it on a visibility flag
* instead of using up a dedicated bit.
*/
#define HEAP_TUPLE_HAS_MATCH HEAP_ONLY_TUPLE /* tuple has a join match */If you redefine HEAP_TUPLE_HAS_MATCH as something that isn't already
used, say 0x1800, the query returns correct results.
[...]
The question is, why does this only happen for a parallel full hash join?
I'd guess that PHJ code is missing a HeapTupleHeaderClearMatch() somewhere,
but the non-parallel case isn't.
Greetings,
Andres Freund
On Wed, Apr 12, 2023 at 2:14 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-04-12 10:57:17 -0400, Melanie Plageman wrote:
HeapTupleHeaderHasMatch() checks if HEAP_TUPLE_HAS_MATCH is set.
In htup_details.h, you will see that HEAP_TUPLE_HAS_MATCH is defined as
HEAP_ONLY_TUPLE
/*
* HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
* only used in tuples that are in the hash table, and those don't need
* any visibility information, so we can overlay it on a visibility flag
* instead of using up a dedicated bit.
*/
#define HEAP_TUPLE_HAS_MATCH HEAP_ONLY_TUPLE /* tuple has a join match */If you redefine HEAP_TUPLE_HAS_MATCH as something that isn't already
used, say 0x1800, the query returns correct results.
[...]
The question is, why does this only happen for a parallel full hash join?I'd guess that PHJ code is missing a HeapTupleHeaderClearMatch() somewhere,
but the non-parallel case isn't.
Indeed. Thanks! This diff fixes the case Richard provided.
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index a45bd3a315..54c06c5eb3 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1724,6 +1724,7 @@ retry:
/* Store the hash value in the HashJoinTuple header. */
hashTuple->hashvalue = hashvalue;
memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len);
+ HeapTupleHeaderClearMatch(HJTUPLE_MINTUPLE(hashTuple));
/* Push it onto the front of the bucket's list */
ExecParallelHashPushTuple(&hashtable->buckets.shared[bucketno],
I will propose a patch that includes this change and a test.
I just want to convince myself that ExecParallelHashTableInsertCurrentBatch()
covers the non-batch 0 cases and we don't need to add something to
sts_puttuple().
- Melanie
On Wed, Apr 12, 2023 at 2:59 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
On Wed, Apr 12, 2023 at 2:14 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-04-12 10:57:17 -0400, Melanie Plageman wrote:
HeapTupleHeaderHasMatch() checks if HEAP_TUPLE_HAS_MATCH is set.
In htup_details.h, you will see that HEAP_TUPLE_HAS_MATCH is defined as
HEAP_ONLY_TUPLE
/*
* HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
* only used in tuples that are in the hash table, and those don't need
* any visibility information, so we can overlay it on a visibility flag
* instead of using up a dedicated bit.
*/
#define HEAP_TUPLE_HAS_MATCH HEAP_ONLY_TUPLE /* tuple has a join match */If you redefine HEAP_TUPLE_HAS_MATCH as something that isn't already
used, say 0x1800, the query returns correct results.
[...]
The question is, why does this only happen for a parallel full hash join?I'd guess that PHJ code is missing a HeapTupleHeaderClearMatch() somewhere,
but the non-parallel case isn't.Indeed. Thanks! This diff fixes the case Richard provided.
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index a45bd3a315..54c06c5eb3 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -1724,6 +1724,7 @@ retry: /* Store the hash value in the HashJoinTuple header. */ hashTuple->hashvalue = hashvalue; memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len); + HeapTupleHeaderClearMatch(HJTUPLE_MINTUPLE(hashTuple));/* Push it onto the front of the bucket's list */
ExecParallelHashPushTuple(&hashtable->buckets.shared[bucketno],I will propose a patch that includes this change and a test.
I just want to convince myself that ExecParallelHashTableInsertCurrentBatch()
covers the non-batch 0 cases and we don't need to add something to
sts_puttuple().
So, indeed, tuples in batches after batch 0 already had their match bit
cleared by ExecParallelHashTableInsertCurrentBatch().
Attached patch includes the fix for ExecParallelHashTableInsert() as
well as a test. I toyed with adapting one of the existing parallel full
hash join tests to cover this case, however, I think Richard's repro is
much more clear. Maybe it is worth throwing in a few updates to the
tables in the existing queries to provide coverage for the other
HeapTupleHeaderClearMatch() calls in the code, though.
- Melanie
Attachments:
v1-0001-Reset-PHJ-tuple-match-bit-upon-hashtable-insert.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Reset-PHJ-tuple-match-bit-upon-hashtable-insert.patchDownload
From 6aa53dae62e8a8a34166ed6f28ee621cf727c004 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 12 Apr 2023 17:16:28 -0400
Subject: [PATCH v1] Reset PHJ tuple match bit upon hashtable insert
We reuse a single bit to indicate that a hash join tuple is heap-only
and that it has a match during hash join execution. Serial hash join and
parallel multi-batch hash join cleared this bit upon inserting the tuple
into the hashtable. Single batch parallel hash join and batch 0 of
unexpected multi-batch hash joins forgot to do this. Fix that. We didn't
notice before because parallel hash join wasn't using the match bits for
tuples in the hashtable since right and full parallel hash join were
unsupported.
Reported-by: Richard Guo
Discussion: https://postgr.es/m/flat/CAMbWs48Nde1Mv%3DBJv6_vXmRKHMuHZm2Q_g4F6Z3_pn%2B3EV6BGQ%40mail.gmail.com
---
src/backend/executor/nodeHash.c | 1 +
src/test/regress/expected/join_hash.out | 20 ++++++++++++++++++++
src/test/regress/sql/join_hash.sql | 17 ++++++++++++++++-
3 files changed, 37 insertions(+), 1 deletion(-)
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index a45bd3a315..54c06c5eb3 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1724,6 +1724,7 @@ retry:
/* Store the hash value in the HashJoinTuple header. */
hashTuple->hashvalue = hashvalue;
memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len);
+ HeapTupleHeaderClearMatch(HJTUPLE_MINTUPLE(hashTuple));
/* Push it onto the front of the bucket's list */
ExecParallelHashPushTuple(&hashtable->buckets.shared[bucketno],
diff --git a/src/test/regress/expected/join_hash.out b/src/test/regress/expected/join_hash.out
index 09376514bb..c4f4e2ee54 100644
--- a/src/test/regress/expected/join_hash.out
+++ b/src/test/regress/expected/join_hash.out
@@ -955,6 +955,26 @@ $$);
(1 row)
rollback to settings;
+-- Ensure that hash join tuple match bits have been cleared before putting them
+-- into the hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+ROLLBACK TO settings;
rollback;
-- Verify that hash key expressions reference the correct
-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
diff --git a/src/test/regress/sql/join_hash.sql b/src/test/regress/sql/join_hash.sql
index 179e94941c..86917ee13b 100644
--- a/src/test/regress/sql/join_hash.sql
+++ b/src/test/regress/sql/join_hash.sql
@@ -506,8 +506,23 @@ $$
$$);
rollback to settings;
-rollback;
+-- Ensure that hash join tuple match bits have been cleared before putting them
+-- into the hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ROLLBACK TO settings;
+
+rollback;
-- Verify that hash key expressions reference the correct
-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
--
2.37.2
On Thu, Apr 13, 2023 at 9:48 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
Attached patch includes the fix for ExecParallelHashTableInsert() as
well as a test. I toyed with adapting one of the existing parallel full
hash join tests to cover this case, however, I think Richard's repro is
much more clear. Maybe it is worth throwing in a few updates to the
tables in the existing queries to provide coverage for the other
HeapTupleHeaderClearMatch() calls in the code, though.
Oof. Analysis and code LGTM.
I thought about the way non-parallel HJ also clears the match bits
when re-using the hash table for rescans. PHJ doesn't keep hash
tables across rescans. (There's no fundamental reason why it
couldn't, but there was some complication and it seemed absurd to have
NestLoop over Gather over PHJ, forking a new set of workers for every
tuple, so I didn't implement that in the original PHJ.) But... there
is something a little odd about the code in
ExecHashTableResetMatchFlags(), or the fact that we appear to be
calling it: it's using the unshared union member unconditionally,
which wouldn't actually work for PHJ (there should be a variant of
that function with Parallel in its name if we ever want that to work).
That's not a bug AFAICT, as in fact we don't actually call it--it
should be unreachable because the hash table should be gone when we
rescan--but it's confusing. I'm wondering if we should put in
something explicit about that, maybe a comment and an assertion in
ExecReScanHashJoin().
+-- Ensure that hash join tuple match bits have been cleared before putting them
+-- into the hashtable.
Could you mention that the match flags steals a bit from the HOT flag,
ie *why* we're testing a join after an update? And if we're going to
exercise/test that case, should we do the non-parallel version too?
For the commit message, I think it's a good idea to use something like
"Fix ..." for the headline of bug fix commits to make that clearer,
and to add something like "oversight in commit XYZ" in the body, just
to help people connect the dots. (Yeah, I know I failed to reference
the delinquent commit in the recent assertion-removal commit, my bad.)
I think "Discussion:" footers are supposed to use
/messages/by-id/XXX shortened URLs.
On Wed, Apr 12, 2023 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Thu, Apr 13, 2023 at 9:48 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:Attached patch includes the fix for ExecParallelHashTableInsert() as
well as a test. I toyed with adapting one of the existing parallel full
hash join tests to cover this case, however, I think Richard's repro is
much more clear. Maybe it is worth throwing in a few updates to the
tables in the existing queries to provide coverage for the other
HeapTupleHeaderClearMatch() calls in the code, though.Oof. Analysis and code LGTM.
I thought about the way non-parallel HJ also clears the match bits
when re-using the hash table for rescans. PHJ doesn't keep hash
tables across rescans. (There's no fundamental reason why it
couldn't, but there was some complication and it seemed absurd to have
NestLoop over Gather over PHJ, forking a new set of workers for every
tuple, so I didn't implement that in the original PHJ.) But... there
is something a little odd about the code in
ExecHashTableResetMatchFlags(), or the fact that we appear to be
calling it: it's using the unshared union member unconditionally,
which wouldn't actually work for PHJ (there should be a variant of
that function with Parallel in its name if we ever want that to work).
That's not a bug AFAICT, as in fact we don't actually call it--it
should be unreachable because the hash table should be gone when we
rescan--but it's confusing. I'm wondering if we should put in
something explicit about that, maybe a comment and an assertion in
ExecReScanHashJoin().
An assert about it not being a parallel hash join? I support this.
+-- Ensure that hash join tuple match bits have been cleared before putting them +-- into the hashtable.Could you mention that the match flags steals a bit from the HOT flag,
ie *why* we're testing a join after an update?
v2 attached has some wordsmithing along these lines.
And if we're going to
exercise/test that case, should we do the non-parallel version too?
I've added this. I thought if we were adding the serial case, we might
as well add the multi-batch case as well. However, that proved a bit
more challenging. We can get a HOT tuple in one of the existing tables
with no issues. Doing this and then deleting the reset match bit code
doesn't cause any of the tests to fail, however, because we use this
expression as the join condition when we want to emit NULL-extended
unmatched tuples.
select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
I don't think we want to add yet another time-consuming test to this
test file. So, I was trying to decide if it was worth changing these
existing tests so that they would fail when the match bit wasn't reset.
I'm not sure.
For the commit message, I think it's a good idea to use something like
"Fix ..." for the headline of bug fix commits to make that clearer,
and to add something like "oversight in commit XYZ" in the body, just
to help people connect the dots. (Yeah, I know I failed to reference
the delinquent commit in the recent assertion-removal commit, my bad.)
I've made these edits and tried to improve the commit message clarity in
general.
I think "Discussion:" footers are supposed to use
/messages/by-id/XXX shortened URLs.
Hmm. Is the problem with mine that I included "flat"? Because I did use
postgr.es/m format. The message id is unfortunately long, but I believe
that is on google and not me.
- Melanie
Attachments:
v2-0001-Fix-PHJ-tuple-match-bit-management.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Fix-PHJ-tuple-match-bit-management.patchDownload
From ee24229a1eeca67dfd91e162502efe2abfead870 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 12 Apr 2023 17:16:28 -0400
Subject: [PATCH v2] Fix PHJ tuple match bit management
Hash join tuples reuse the HOT status bit to indicate match status
during hash join execution. Correct reuse requires clearing the bit in
all tuples. Serial hash join and parallel multi-batch hash join do so
upon inserting the tuple into the hashtable. Single batch parallel hash
join and batch 0 of unexpected multi-batch hash joins forgot to do this.
It hadn't come up before because hashtable tuple match bits are only
used for right and full outer joins and parallel ROJ and FOJ were
unsupported. 11c2d6fdf5 introduced support for parallel ROJ/FOJ but
neglected to ensure the match bits were reset.
Reported-by: Richard Guo
Discussion: https://postgr.es/m/flat/CAMbWs48Nde1Mv%3DBJv6_vXmRKHMuHZm2Q_g4F6Z3_pn%2B3EV6BGQ%40mail.gmail.com
---
src/backend/executor/nodeHash.c | 1 +
src/test/regress/expected/join_hash.out | 37 +++++++++++++++++++++++++
src/test/regress/sql/join_hash.sql | 28 ++++++++++++++++++-
3 files changed, 65 insertions(+), 1 deletion(-)
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index a45bd3a315..54c06c5eb3 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1724,6 +1724,7 @@ retry:
/* Store the hash value in the HashJoinTuple header. */
hashTuple->hashvalue = hashvalue;
memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len);
+ HeapTupleHeaderClearMatch(HJTUPLE_MINTUPLE(hashTuple));
/* Push it onto the front of the bucket's list */
ExecParallelHashPushTuple(&hashtable->buckets.shared[bucketno],
diff --git a/src/test/regress/expected/join_hash.out b/src/test/regress/expected/join_hash.out
index 09376514bb..e892e7cccb 100644
--- a/src/test/regress/expected/join_hash.out
+++ b/src/test/regress/expected/join_hash.out
@@ -955,6 +955,43 @@ $$);
(1 row)
rollback to settings;
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+ROLLBACK TO settings;
rollback;
-- Verify that hash key expressions reference the correct
-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
diff --git a/src/test/regress/sql/join_hash.sql b/src/test/regress/sql/join_hash.sql
index 179e94941c..06bab7a4c7 100644
--- a/src/test/regress/sql/join_hash.sql
+++ b/src/test/regress/sql/join_hash.sql
@@ -506,8 +506,34 @@ $$
$$);
rollback to settings;
-rollback;
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ROLLBACK TO settings;
+
+rollback;
-- Verify that hash key expressions reference the correct
-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
--
2.37.2
On Thu, Apr 13, 2023 at 12:31 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
On Wed, Apr 12, 2023 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:
I think "Discussion:" footers are supposed to use
/messages/by-id/XXX shortened URLs.Hmm. Is the problem with mine that I included "flat"? Because I did use
postgr.es/m format. The message id is unfortunately long, but I believe
that is on google and not me.
For some reason I thought we weren't supposed to use the flat thing,
but it looks like I'm just wrong and people do that all the time so I
take that back.
Pushed. Thanks Richard and Melanie.
On Thu, Apr 13, 2023 at 7:06 PM Thomas Munro <thomas.munro@gmail.com> wrote:
For some reason I thought we weren't supposed to use the flat thing,
but it looks like I'm just wrong and people do that all the time so I
take that back.Pushed. Thanks Richard and Melanie.
I tend to use http://postgr.es/m/ or https://postgr.es/m/ just to keep
the URL a bit shorter, and also because I like to point anyone reading
the commit log to the particular message that I think is most relevant
rather than to the thread as a whole. But I don't think there's any
hard-and-fast rule that committers have to do it one way rather than
another.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, Apr 12, 2023 at 08:31:26PM -0400, Melanie Plageman wrote:
On Wed, Apr 12, 2023 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:
And if we're going to
exercise/test that case, should we do the non-parallel version too?I've added this. I thought if we were adding the serial case, we might
as well add the multi-batch case as well. However, that proved a bit
more challenging. We can get a HOT tuple in one of the existing tables
with no issues. Doing this and then deleting the reset match bit code
doesn't cause any of the tests to fail, however, because we use this
expression as the join condition when we want to emit NULL-extended
unmatched tuples.select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
I don't think we want to add yet another time-consuming test to this
test file. So, I was trying to decide if it was worth changing these
existing tests so that they would fail when the match bit wasn't reset.
I'm not sure.
I couldn't stop thinking about how my explanation for why this test
didn't fail sounded wrong.
After some further investigation, I found that the real reason that the
HOT bit is already cleared in the tuples inserted into the hashtable for
this query is that the tuple descriptor for the relation "simple" and
the target list for the scan node are not identical (because we only
need to retain a single column from simple in order to eventually do
count(*)), so we make a new virtual tuple and build projection info for
the scan node. The virtual tuple doesn't have the HOT bit set anymore
(the buffer heap tuple would have). So we couldn't fail a test of the
code clearing the match bit.
Ultimately this is probably fine. If we wanted to modify one of the
existing tests to cover the multi-batch case, changing the select
count(*) to a select * would do the trick. I imagine we wouldn't want to
do this because of the excessive output this would produce. I wondered
if there was a pattern in the tests for getting around this. But,
perhaps we don't care enough to cover this code.
- Melanie
On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote:
Ultimately this is probably fine. If we wanted to modify one of the
existing tests to cover the multi-batch case, changing the select
count(*) to a select * would do the trick. I imagine we wouldn't want to
do this because of the excessive output this would produce. I wondered
if there was a pattern in the tests for getting around this.
You could use explain (ANALYZE). But the output is machine-dependant in
various ways (which is why the tests use "explain analyze so rarely).
So you'd have to filter its output with a function (like the functions
that exist in a few places for similar purpose).
--
Justin
Hi,
On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote:
On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote:
Ultimately this is probably fine. If we wanted to modify one of the
existing tests to cover the multi-batch case, changing the select
count(*) to a select * would do the trick. I imagine we wouldn't want to
do this because of the excessive output this would produce. I wondered
if there was a pattern in the tests for getting around this.You could use explain (ANALYZE). But the output is machine-dependant in
various ways (which is why the tests use "explain analyze so rarely).
I think with sufficient options it's not machine specific. We have a bunch of
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) ..
in our tests.
Greetings,
Andres Freund
On Wed, Apr 19, 2023 at 12:20:51PM -0700, Andres Freund wrote:
On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote:
On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote:
Ultimately this is probably fine. If we wanted to modify one of the
existing tests to cover the multi-batch case, changing the select
count(*) to a select * would do the trick. I imagine we wouldn't want to
do this because of the excessive output this would produce. I wondered
if there was a pattern in the tests for getting around this.You could use explain (ANALYZE). But the output is machine-dependant in
various ways (which is why the tests use "explain analyze so rarely).I think with sufficient options it's not machine specific.
It *can* be machine specific depending on the node type..
In particular, for parallel workers, it shows "Workers Launched: ..",
which can vary even across executions on the same machine. And don't
forget about "loops=".
Plus:
src/backend/commands/explain.c: "Buckets: %d Batches: %d Memory Usage: %ldkB\n",
We have a bunch of
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) ..
in our tests.
There's 81 uses of "timing off", out of a total of ~1600 explains. Most
of them are in partition_prune.sql. explain analyze is barely used.
I sent a patch to elide the machine-specific parts, which would make it
easier to use. But there was no interest.
--
Justin
On Wed, Apr 19, 2023 at 3:20 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote:
On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote:
Ultimately this is probably fine. If we wanted to modify one of the
existing tests to cover the multi-batch case, changing the select
count(*) to a select * would do the trick. I imagine we wouldn't wantto
do this because of the excessive output this would produce. I wondered
if there was a pattern in the tests for getting around this.You could use explain (ANALYZE). But the output is machine-dependant in
various ways (which is why the tests use "explain analyze so rarely).I think with sufficient options it's not machine specific. We have a bunch
of
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) ..
in our tests.
Cool. Yea, so ultimately these options are almost enough but memory
usage changes from execution to execution. There are some tests which do
regexp_replace() on the memory usage part of the EXPLAIN ANALYZE output
to allow us to still compare the plans. However, I figured if I was
already going to go to the trouble of using regexp_replace(), I might as
well write a function that returns the "Actual Rows" field from the
EXPLAIN ANALYZE output.
The attached patch does that. I admittedly mostly copy-pasted the
plpgsql function from similar examples in other tests, and I suspect it
may be overkill and also poorly written.
The nice thing about this approach is that we can modify some of the
existing tests in join_hash.sql to use this function and cover the code
to reset the matchbit for serial hashjoin, single batch parallel
hashjoin, and all batches of parallel multi-batch hashjoin without any
additional queries. (I'll leave testing match bit resetting with the
skew hashtable and match bit resetting in case of a rescan for another
day.)
I was able to delete the tests added in 558c9d75fe, as they became
redundant.
I wonder if any other tests are in need of an EXPLAIN (ANALYZE,
MEMORY_USAGE OFF) option? Perhaps it is quite unusual to only require a
deterministic field like 'Actual Rows'. If we had that option we could
also remove the extra EXPLAIN invocations before the actual query
executions.
- Melanie
Attachments:
v1-0001-Test-multi-batch-PHJ-match-bit-initialization.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Test-multi-batch-PHJ-match-bit-initialization.patchDownload
From a8b1d29546634afcf5c1cf63e889a2155ac2917b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 19 Apr 2023 20:09:37 -0400
Subject: [PATCH v1] Test multi-batch PHJ match bit initialization
558c9d75 fixed a bug with initialization of the hash join match status
bit and added test coverage for serial hash join and single batch
parallel hash join. Add a test for multi-batch parallel hash join.
To test this with the existing relations in the join_hash test file,
this commit modifies some of the test queries to use SELECT * instead of
SELECT COUNT(*), which was needed to ensure that the tuples being
inserted into the hashtable had not already been made into virtual
tuples and lost their HOT status bit.
These modifications made the original tests introduced in 558c9d75
redundant.
Discussion: https://postgr.es/m/ZEAh6CewmARbDVuN%40telsasoft.com
---
src/test/regress/expected/join_hash.out | 134 +++++++++++-------------
src/test/regress/sql/join_hash.sql | 72 +++++++------
2 files changed, 100 insertions(+), 106 deletions(-)
diff --git a/src/test/regress/expected/join_hash.out b/src/test/regress/expected/join_hash.out
index e892e7cccb..5be0a4274b 100644
--- a/src/test/regress/expected/join_hash.out
+++ b/src/test/regress/expected/join_hash.out
@@ -49,10 +49,35 @@ begin
end loop;
end;
$$;
+-- Return the number of actual rows returned by a query. This is useful when a
+-- count(*) would not execute the desired codepath but a SELECT * would produce
+-- an excessive number of rows.
+create or replace function hj_actual_rows(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+ elements jsonb;
+begin
+ execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+ if jsonb_array_length(elements) > 1 then
+ raise exception 'Cannot return actual rows from more than one plan';
+ end if;
+ return elements->0->'Plan'->'Actual Rows';
+end;
+$$;
-- Make a simple relation with well distributed keys and correctly
-- estimated size.
-create table simple as
- select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+create table simple (id int, value text);
+insert into simple values (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa');
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable. To test this, update simple, creating a HOT tuple. If this status
+-- bit isn't cleared, we won't correctly emit the NULL-extended unmatching
+-- tuple in full hash join.
+update simple set id = 1;
+insert into simple select generate_series(2, 20000), 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
alter table simple set (parallel_workers = 2);
analyze simple;
-- Make a relation whose size we will under-estimate. We want stats
@@ -273,6 +298,7 @@ set local max_parallel_workers_per_gather = 2;
set local work_mem = '192kB';
set local hash_mem_multiplier = 1.0;
set local enable_parallel_hash = on;
+set local parallel_tuple_cost = 0;
explain (costs off)
select count(*) from simple r join simple s using (id);
QUERY PLAN
@@ -305,10 +331,11 @@ $$);
(1 row)
-- parallel full multi-batch hash join
-select count(*) from simple r full outer join simple s using (id);
- count
--------
- 20000
+select hj_actual_rows(
+ 'select * from simple r full outer join simple s on (r.id = 0 - s.id)');
+ hj_actual_rows
+----------------
+ 40000
(1 row)
rollback to settings;
@@ -844,20 +871,20 @@ rollback to settings;
savepoint settings;
set local max_parallel_workers_per_gather = 0;
explain (costs off)
- select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
- QUERY PLAN
-----------------------------------------
- Aggregate
- -> Hash Full Join
- Hash Cond: ((0 - s.id) = r.id)
- -> Seq Scan on simple s
- -> Hash
- -> Seq Scan on simple r
-(6 rows)
+ select * from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------
+ Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(5 rows)
-select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
- count
--------
+select hj_actual_rows(
+ 'select * from simple r full outer join simple s on (r.id = 0 - s.id)');
+ hj_actual_rows
+----------------
40000
(1 row)
@@ -888,24 +915,24 @@ rollback to settings;
-- parallelism is possible with parallel-aware full hash join
savepoint settings;
set local max_parallel_workers_per_gather = 2;
+set local parallel_tuple_cost = 0;
explain (costs off)
- select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
- QUERY PLAN
--------------------------------------------------------------
- Finalize Aggregate
- -> Gather
- Workers Planned: 2
- -> Partial Aggregate
- -> Parallel Hash Full Join
- Hash Cond: ((0 - s.id) = r.id)
- -> Parallel Seq Scan on simple s
- -> Parallel Hash
- -> Parallel Seq Scan on simple r
-(9 rows)
+ select * from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+-------------------------------------------------
+ Gather
+ Workers Planned: 2
+ -> Parallel Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Parallel Seq Scan on simple s
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple r
+(7 rows)
-select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
- count
--------
+select hj_actual_rows(
+ 'select * from simple r full outer join simple s on (r.id = 0 - s.id)');
+ hj_actual_rows
+----------------
40000
(1 row)
@@ -955,43 +982,6 @@ $$);
(1 row)
rollback to settings;
--- Hash join reuses the HOT status bit to indicate match status. This can only
--- be guaranteed to produce correct results if all the hash join tuple match
--- bits are reset before reuse. This is done upon loading them into the
--- hashtable.
-SAVEPOINT settings;
-SET enable_parallel_hash = on;
-SET min_parallel_table_scan_size = 0;
-SET parallel_setup_cost = 0;
-SET parallel_tuple_cost = 0;
-CREATE TABLE hjtest_matchbits_t1(id int);
-CREATE TABLE hjtest_matchbits_t2(id int);
-INSERT INTO hjtest_matchbits_t1 VALUES (1);
-INSERT INTO hjtest_matchbits_t2 VALUES (2);
--- Update should create a HOT tuple. If this status bit isn't cleared, we won't
--- correctly emit the NULL-extended unmatching tuple in full hash join.
-UPDATE hjtest_matchbits_t2 set id = 2;
-SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
- id | id
-----+----
- 1 |
- | 2
-(2 rows)
-
--- Test serial full hash join.
--- Resetting parallel_setup_cost should force a serial plan.
--- Just to be safe, however, set enable_parallel_hash to off, as parallel full
--- hash joins are only supported with shared hashtables.
-RESET parallel_setup_cost;
-SET enable_parallel_hash = off;
-SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
- id | id
-----+----
- 1 |
- | 2
-(2 rows)
-
-ROLLBACK TO settings;
rollback;
-- Verify that hash key expressions reference the correct
-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
diff --git a/src/test/regress/sql/join_hash.sql b/src/test/regress/sql/join_hash.sql
index 06bab7a4c7..c9454c302c 100644
--- a/src/test/regress/sql/join_hash.sql
+++ b/src/test/regress/sql/join_hash.sql
@@ -53,10 +53,36 @@ begin
end;
$$;
+-- Return the number of actual rows returned by a query. This is useful when a
+-- count(*) would not execute the desired codepath but a SELECT * would produce
+-- an excessive number of rows.
+create or replace function hj_actual_rows(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+ elements jsonb;
+begin
+ execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+ if jsonb_array_length(elements) > 1 then
+ raise exception 'Cannot return actual rows from more than one plan';
+ end if;
+ return elements->0->'Plan'->'Actual Rows';
+end;
+$$;
+
-- Make a simple relation with well distributed keys and correctly
-- estimated size.
-create table simple as
- select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+create table simple (id int, value text);
+insert into simple values (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa');
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable. To test this, update simple, creating a HOT tuple. If this status
+-- bit isn't cleared, we won't correctly emit the NULL-extended unmatching
+-- tuple in full hash join.
+update simple set id = 1;
+insert into simple select generate_series(2, 20000), 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
alter table simple set (parallel_workers = 2);
analyze simple;
@@ -179,6 +205,7 @@ set local max_parallel_workers_per_gather = 2;
set local work_mem = '192kB';
set local hash_mem_multiplier = 1.0;
set local enable_parallel_hash = on;
+set local parallel_tuple_cost = 0;
explain (costs off)
select count(*) from simple r join simple s using (id);
select count(*) from simple r join simple s using (id);
@@ -188,7 +215,8 @@ $$
select count(*) from simple r join simple s using (id);
$$);
-- parallel full multi-batch hash join
-select count(*) from simple r full outer join simple s using (id);
+select hj_actual_rows(
+ 'select * from simple r full outer join simple s on (r.id = 0 - s.id)');
rollback to settings;
-- The "bad" case: during execution we need to increase number of
@@ -460,8 +488,9 @@ rollback to settings;
savepoint settings;
set local max_parallel_workers_per_gather = 0;
explain (costs off)
- select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
-select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ select * from simple r full outer join simple s on (r.id = 0 - s.id);
+select hj_actual_rows(
+ 'select * from simple r full outer join simple s on (r.id = 0 - s.id)');
rollback to settings;
-- parallelism not possible with parallel-oblivious full hash join
@@ -476,9 +505,11 @@ rollback to settings;
-- parallelism is possible with parallel-aware full hash join
savepoint settings;
set local max_parallel_workers_per_gather = 2;
+set local parallel_tuple_cost = 0;
explain (costs off)
- select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
-select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ select * from simple r full outer join simple s on (r.id = 0 - s.id);
+select hj_actual_rows(
+ 'select * from simple r full outer join simple s on (r.id = 0 - s.id)');
rollback to settings;
@@ -506,33 +537,6 @@ $$
$$);
rollback to settings;
-
--- Hash join reuses the HOT status bit to indicate match status. This can only
--- be guaranteed to produce correct results if all the hash join tuple match
--- bits are reset before reuse. This is done upon loading them into the
--- hashtable.
-SAVEPOINT settings;
-SET enable_parallel_hash = on;
-SET min_parallel_table_scan_size = 0;
-SET parallel_setup_cost = 0;
-SET parallel_tuple_cost = 0;
-CREATE TABLE hjtest_matchbits_t1(id int);
-CREATE TABLE hjtest_matchbits_t2(id int);
-INSERT INTO hjtest_matchbits_t1 VALUES (1);
-INSERT INTO hjtest_matchbits_t2 VALUES (2);
--- Update should create a HOT tuple. If this status bit isn't cleared, we won't
--- correctly emit the NULL-extended unmatching tuple in full hash join.
-UPDATE hjtest_matchbits_t2 set id = 2;
-SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
--- Test serial full hash join.
--- Resetting parallel_setup_cost should force a serial plan.
--- Just to be safe, however, set enable_parallel_hash to off, as parallel full
--- hash joins are only supported with shared hashtables.
-RESET parallel_setup_cost;
-SET enable_parallel_hash = off;
-SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
-ROLLBACK TO settings;
-
rollback;
-- Verify that hash key expressions reference the correct
--
2.37.2
On Wed, Apr 19, 2023 at 8:41 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Wed, Apr 19, 2023 at 12:20:51PM -0700, Andres Freund wrote:
On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote:
On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote:
Ultimately this is probably fine. If we wanted to modify one of the
existing tests to cover the multi-batch case, changing the select
count(*) to a select * would do the trick. I imagine we wouldn't want to
do this because of the excessive output this would produce. I wondered
if there was a pattern in the tests for getting around this.You could use explain (ANALYZE). But the output is machine-dependant in
various ways (which is why the tests use "explain analyze so rarely).I think with sufficient options it's not machine specific.
It *can* be machine specific depending on the node type..
In particular, for parallel workers, it shows "Workers Launched: ..",
which can vary even across executions on the same machine. And don't
forget about "loops=".Plus:
src/backend/commands/explain.c: "Buckets: %d Batches: %d Memory Usage: %ldkB\n",We have a bunch of
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) ..
in our tests.There's 81 uses of "timing off", out of a total of ~1600 explains. Most
of them are in partition_prune.sql. explain analyze is barely used.I sent a patch to elide the machine-specific parts, which would make it
easier to use. But there was no interest.
While I don't know about other use cases, I would have used that here.
Do you still have that patch laying around? I'd be interested to at
least review it.
- Melanie
On Wed, Apr 19, 2023 at 8:43 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
On Wed, Apr 19, 2023 at 3:20 PM Andres Freund <andres@anarazel.de> wrote:
On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote:
On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote:
Ultimately this is probably fine. If we wanted to modify one of the
existing tests to cover the multi-batch case, changing the select
count(*) to a select * would do the trick. I imagine we wouldn't want to
do this because of the excessive output this would produce. I wondered
if there was a pattern in the tests for getting around this.You could use explain (ANALYZE). But the output is machine-dependant in
various ways (which is why the tests use "explain analyze so rarely).I think with sufficient options it's not machine specific. We have a bunch of
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) ..
in our tests.Cool. Yea, so ultimately these options are almost enough but memory
usage changes from execution to execution. There are some tests which do
regexp_replace() on the memory usage part of the EXPLAIN ANALYZE output
to allow us to still compare the plans. However, I figured if I was
already going to go to the trouble of using regexp_replace(), I might as
well write a function that returns the "Actual Rows" field from the
EXPLAIN ANALYZE output.The attached patch does that. I admittedly mostly copy-pasted the
plpgsql function from similar examples in other tests, and I suspect it
may be overkill and also poorly written.
I renamed the function to join_hash_actual_rows to avoid potentially
affecting other tests. Nothing about the function is specific to a hash
join plan, so I think it is more clear to prefix the function with the
test file name. v2 attached.
- Melanie
Attachments:
v2-0001-Test-multi-batch-PHJ-match-bit-initialization.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Test-multi-batch-PHJ-match-bit-initialization.patchDownload
From d8f6946cf127092913d8f1b7a9741b97f3215bbd Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 19 Apr 2023 20:09:37 -0400
Subject: [PATCH v2] Test multi-batch PHJ match bit initialization
558c9d75 fixed a bug with initialization of the hash join match status
bit and added test coverage for serial hash join and single batch
parallel hash join. Add a test for multi-batch parallel hash join.
To test this with the existing relations in the join_hash test file,
this commit modifies some of the test queries to use SELECT * instead of
SELECT COUNT(*), which was needed to ensure that the tuples being
inserted into the hashtable had not already been made into virtual
tuples and lost their HOT status bit.
These modifications made the original tests introduced in 558c9d75
redundant.
Discussion: https://postgr.es/m/ZEAh6CewmARbDVuN%40telsasoft.com
---
src/test/regress/expected/join_hash.out | 134 +++++++++++-------------
src/test/regress/sql/join_hash.sql | 72 +++++++------
2 files changed, 100 insertions(+), 106 deletions(-)
diff --git a/src/test/regress/expected/join_hash.out b/src/test/regress/expected/join_hash.out
index e892e7cccb..d4a70f6754 100644
--- a/src/test/regress/expected/join_hash.out
+++ b/src/test/regress/expected/join_hash.out
@@ -49,10 +49,35 @@ begin
end loop;
end;
$$;
+-- Return the number of actual rows returned by a query. This is useful when a
+-- count(*) would not execute the desired codepath but a SELECT * would produce
+-- an excessive number of rows.
+create or replace function join_hash_actual_rows(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+ elements jsonb;
+begin
+ execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+ if jsonb_array_length(elements) > 1 then
+ raise exception 'Cannot return actual rows from more than one plan';
+ end if;
+ return elements->0->'Plan'->'Actual Rows';
+end;
+$$;
-- Make a simple relation with well distributed keys and correctly
-- estimated size.
-create table simple as
- select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+create table simple (id int, value text);
+insert into simple values (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa');
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable. To test this, update simple, creating a HOT tuple. If this status
+-- bit isn't cleared, we won't correctly emit the NULL-extended unmatching
+-- tuple in full hash join.
+update simple set id = 1;
+insert into simple select generate_series(2, 20000), 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
alter table simple set (parallel_workers = 2);
analyze simple;
-- Make a relation whose size we will under-estimate. We want stats
@@ -273,6 +298,7 @@ set local max_parallel_workers_per_gather = 2;
set local work_mem = '192kB';
set local hash_mem_multiplier = 1.0;
set local enable_parallel_hash = on;
+set local parallel_tuple_cost = 0;
explain (costs off)
select count(*) from simple r join simple s using (id);
QUERY PLAN
@@ -305,10 +331,11 @@ $$);
(1 row)
-- parallel full multi-batch hash join
-select count(*) from simple r full outer join simple s using (id);
- count
--------
- 20000
+select join_hash_actual_rows(
+ 'select * from simple r full outer join simple s on (r.id = 0 - s.id)');
+ join_hash_actual_rows
+-----------------------
+ 40000
(1 row)
rollback to settings;
@@ -844,20 +871,20 @@ rollback to settings;
savepoint settings;
set local max_parallel_workers_per_gather = 0;
explain (costs off)
- select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
- QUERY PLAN
-----------------------------------------
- Aggregate
- -> Hash Full Join
- Hash Cond: ((0 - s.id) = r.id)
- -> Seq Scan on simple s
- -> Hash
- -> Seq Scan on simple r
-(6 rows)
+ select * from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------
+ Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(5 rows)
-select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
- count
--------
+select join_hash_actual_rows(
+ 'select * from simple r full outer join simple s on (r.id = 0 - s.id)');
+ join_hash_actual_rows
+-----------------------
40000
(1 row)
@@ -888,24 +915,24 @@ rollback to settings;
-- parallelism is possible with parallel-aware full hash join
savepoint settings;
set local max_parallel_workers_per_gather = 2;
+set local parallel_tuple_cost = 0;
explain (costs off)
- select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
- QUERY PLAN
--------------------------------------------------------------
- Finalize Aggregate
- -> Gather
- Workers Planned: 2
- -> Partial Aggregate
- -> Parallel Hash Full Join
- Hash Cond: ((0 - s.id) = r.id)
- -> Parallel Seq Scan on simple s
- -> Parallel Hash
- -> Parallel Seq Scan on simple r
-(9 rows)
+ select * from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+-------------------------------------------------
+ Gather
+ Workers Planned: 2
+ -> Parallel Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Parallel Seq Scan on simple s
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple r
+(7 rows)
-select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
- count
--------
+select join_hash_actual_rows(
+ 'select * from simple r full outer join simple s on (r.id = 0 - s.id)');
+ join_hash_actual_rows
+-----------------------
40000
(1 row)
@@ -955,43 +982,6 @@ $$);
(1 row)
rollback to settings;
--- Hash join reuses the HOT status bit to indicate match status. This can only
--- be guaranteed to produce correct results if all the hash join tuple match
--- bits are reset before reuse. This is done upon loading them into the
--- hashtable.
-SAVEPOINT settings;
-SET enable_parallel_hash = on;
-SET min_parallel_table_scan_size = 0;
-SET parallel_setup_cost = 0;
-SET parallel_tuple_cost = 0;
-CREATE TABLE hjtest_matchbits_t1(id int);
-CREATE TABLE hjtest_matchbits_t2(id int);
-INSERT INTO hjtest_matchbits_t1 VALUES (1);
-INSERT INTO hjtest_matchbits_t2 VALUES (2);
--- Update should create a HOT tuple. If this status bit isn't cleared, we won't
--- correctly emit the NULL-extended unmatching tuple in full hash join.
-UPDATE hjtest_matchbits_t2 set id = 2;
-SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
- id | id
-----+----
- 1 |
- | 2
-(2 rows)
-
--- Test serial full hash join.
--- Resetting parallel_setup_cost should force a serial plan.
--- Just to be safe, however, set enable_parallel_hash to off, as parallel full
--- hash joins are only supported with shared hashtables.
-RESET parallel_setup_cost;
-SET enable_parallel_hash = off;
-SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
- id | id
-----+----
- 1 |
- | 2
-(2 rows)
-
-ROLLBACK TO settings;
rollback;
-- Verify that hash key expressions reference the correct
-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
diff --git a/src/test/regress/sql/join_hash.sql b/src/test/regress/sql/join_hash.sql
index 06bab7a4c7..ef30afd747 100644
--- a/src/test/regress/sql/join_hash.sql
+++ b/src/test/regress/sql/join_hash.sql
@@ -53,10 +53,36 @@ begin
end;
$$;
+-- Return the number of actual rows returned by a query. This is useful when a
+-- count(*) would not execute the desired codepath but a SELECT * would produce
+-- an excessive number of rows.
+create or replace function join_hash_actual_rows(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+ elements jsonb;
+begin
+ execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+ if jsonb_array_length(elements) > 1 then
+ raise exception 'Cannot return actual rows from more than one plan';
+ end if;
+ return elements->0->'Plan'->'Actual Rows';
+end;
+$$;
+
-- Make a simple relation with well distributed keys and correctly
-- estimated size.
-create table simple as
- select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+create table simple (id int, value text);
+insert into simple values (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa');
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable. To test this, update simple, creating a HOT tuple. If this status
+-- bit isn't cleared, we won't correctly emit the NULL-extended unmatching
+-- tuple in full hash join.
+update simple set id = 1;
+insert into simple select generate_series(2, 20000), 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
alter table simple set (parallel_workers = 2);
analyze simple;
@@ -179,6 +205,7 @@ set local max_parallel_workers_per_gather = 2;
set local work_mem = '192kB';
set local hash_mem_multiplier = 1.0;
set local enable_parallel_hash = on;
+set local parallel_tuple_cost = 0;
explain (costs off)
select count(*) from simple r join simple s using (id);
select count(*) from simple r join simple s using (id);
@@ -188,7 +215,8 @@ $$
select count(*) from simple r join simple s using (id);
$$);
-- parallel full multi-batch hash join
-select count(*) from simple r full outer join simple s using (id);
+select join_hash_actual_rows(
+ 'select * from simple r full outer join simple s on (r.id = 0 - s.id)');
rollback to settings;
-- The "bad" case: during execution we need to increase number of
@@ -460,8 +488,9 @@ rollback to settings;
savepoint settings;
set local max_parallel_workers_per_gather = 0;
explain (costs off)
- select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
-select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ select * from simple r full outer join simple s on (r.id = 0 - s.id);
+select join_hash_actual_rows(
+ 'select * from simple r full outer join simple s on (r.id = 0 - s.id)');
rollback to settings;
-- parallelism not possible with parallel-oblivious full hash join
@@ -476,9 +505,11 @@ rollback to settings;
-- parallelism is possible with parallel-aware full hash join
savepoint settings;
set local max_parallel_workers_per_gather = 2;
+set local parallel_tuple_cost = 0;
explain (costs off)
- select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
-select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ select * from simple r full outer join simple s on (r.id = 0 - s.id);
+select join_hash_actual_rows(
+ 'select * from simple r full outer join simple s on (r.id = 0 - s.id)');
rollback to settings;
@@ -506,33 +537,6 @@ $$
$$);
rollback to settings;
-
--- Hash join reuses the HOT status bit to indicate match status. This can only
--- be guaranteed to produce correct results if all the hash join tuple match
--- bits are reset before reuse. This is done upon loading them into the
--- hashtable.
-SAVEPOINT settings;
-SET enable_parallel_hash = on;
-SET min_parallel_table_scan_size = 0;
-SET parallel_setup_cost = 0;
-SET parallel_tuple_cost = 0;
-CREATE TABLE hjtest_matchbits_t1(id int);
-CREATE TABLE hjtest_matchbits_t2(id int);
-INSERT INTO hjtest_matchbits_t1 VALUES (1);
-INSERT INTO hjtest_matchbits_t2 VALUES (2);
--- Update should create a HOT tuple. If this status bit isn't cleared, we won't
--- correctly emit the NULL-extended unmatching tuple in full hash join.
-UPDATE hjtest_matchbits_t2 set id = 2;
-SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
--- Test serial full hash join.
--- Resetting parallel_setup_cost should force a serial plan.
--- Just to be safe, however, set enable_parallel_hash to off, as parallel full
--- hash joins are only supported with shared hashtables.
-RESET parallel_setup_cost;
-SET enable_parallel_hash = off;
-SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
-ROLLBACK TO settings;
-
rollback;
-- Verify that hash key expressions reference the correct
--
2.37.2
On Wed, Apr 19, 2023 at 08:47:07PM -0400, Melanie Plageman wrote:
On Wed, Apr 19, 2023 at 8:41 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Wed, Apr 19, 2023 at 12:20:51PM -0700, Andres Freund wrote:
On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote:
On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote:
Ultimately this is probably fine. If we wanted to modify one of the
existing tests to cover the multi-batch case, changing the select
count(*) to a select * would do the trick. I imagine we wouldn't want to
do this because of the excessive output this would produce. I wondered
if there was a pattern in the tests for getting around this.You could use explain (ANALYZE). But the output is machine-dependant in
various ways (which is why the tests use "explain analyze so rarely).I think with sufficient options it's not machine specific.
It *can* be machine specific depending on the node type..
In particular, for parallel workers, it shows "Workers Launched: ..",
which can vary even across executions on the same machine. And don't
forget about "loops=".Plus:
src/backend/commands/explain.c: "Buckets: %d Batches: %d Memory Usage: %ldkB\n",We have a bunch of
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) ..
in our tests.There's 81 uses of "timing off", out of a total of ~1600 explains. Most
of them are in partition_prune.sql. explain analyze is barely used.I sent a patch to elide the machine-specific parts, which would make it
easier to use. But there was no interest.While I don't know about other use cases, I would have used that here.
Do you still have that patch laying around? I'd be interested to at
least review it.
I noticed that BF animal conchuela has several times fallen over on the
test case added by 558c9d75f:
diff -U3 /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/expected/join_hash.out /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/results/join_hash.out
--- /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/expected/join_hash.out 2023-04-19 10:20:26.159840000 +0200
+++ /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/results/join_hash.out 2023-04-19 10:21:47.971900000 +0200
@@ -974,8 +974,8 @@
SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
id | id
----+----
- 1 |
| 2
+ 1 |
(2 rows)
-- Test serial full hash join.
Considering that this is a parallel plan, I don't think there's any
mystery about why an ORDER-BY-less query might have unstable output
order; the only mystery is why more of the buildfarm hasn't failed.
Can we just add "ORDER BY t1.id" to this query? It looks like you
get the same PHJ plan, although now underneath Sort/Gather Merge.
regards, tom lane
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2023-04-19%2008%3A20%3A56
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2023-05-03%2006%3A21%3A03
[3]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2023-05-19%2022%3A21%3A04
On Fri, May 19, 2023 at 8:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I noticed that BF animal conchuela has several times fallen over on the
test case added by 558c9d75f:diff -U3 /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/expected/join_hash.out /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/results/join_hash.out --- /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/expected/join_hash.out 2023-04-19 10:20:26.159840000 +0200 +++ /home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/results/join_hash.out 2023-04-19 10:21:47.971900000 +0200 @@ -974,8 +974,8 @@ SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id; id | id ----+---- - 1 | | 2 + 1 | (2 rows)-- Test serial full hash join.
Considering that this is a parallel plan, I don't think there's any
mystery about why an ORDER-BY-less query might have unstable output
order; the only mystery is why more of the buildfarm hasn't failed.
Can we just add "ORDER BY t1.id" to this query? It looks like you
get the same PHJ plan, although now underneath Sort/Gather Merge.
Yes, this was an oversight on my part. Attached is the patch that does
just what you suggested.
I can't help but take this opportunity to bump my un-reviewed patch
further upthread which adds additional test coverage for match bit
clearing for multi-batch hash joins [1]/messages/by-id/CAAKRu_bdwDN_aHVctHcc9VoDP9av7LUMeuLbch1fHD2ESouw1g@mail.gmail.com. It happens to also remove the
test that failed on the buildfarm, which is why I thought to bring it
up.
-- Melanie
[1]: /messages/by-id/CAAKRu_bdwDN_aHVctHcc9VoDP9av7LUMeuLbch1fHD2ESouw1g@mail.gmail.com
Attachments:
fix-matchbit-test-case.patchtext/x-patch; charset=US-ASCII; name=fix-matchbit-test-case.patchDownload
From d7b357c0f2e75bdfee1a58f98fae09702d8533d2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 7 Jun 2023 16:42:44 -0400
Subject: [PATCH v1] Fix hash full join test case with ORDER BY
558c9d7 added test coverage for hash join match bit initialization but
neglected to include an ORDER BY in the parallel hash join test query
leading to sporadic buildfarm failures.
Reported by Tom Lane
Discussion: https://postgr.es/m/623596.1684541098%40sss.pgh.pa.us
---
src/test/regress/expected/join_hash.out | 3 ++-
src/test/regress/sql/join_hash.sql | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/src/test/regress/expected/join_hash.out b/src/test/regress/expected/join_hash.out
index e892e7cccb..4faf010f8c 100644
--- a/src/test/regress/expected/join_hash.out
+++ b/src/test/regress/expected/join_hash.out
@@ -971,7 +971,8 @@ INSERT INTO hjtest_matchbits_t2 VALUES (2);
-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
-- correctly emit the NULL-extended unmatching tuple in full hash join.
UPDATE hjtest_matchbits_t2 set id = 2;
-SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
id | id
----+----
1 |
diff --git a/src/test/regress/sql/join_hash.sql b/src/test/regress/sql/join_hash.sql
index 06bab7a4c7..e73f645e9e 100644
--- a/src/test/regress/sql/join_hash.sql
+++ b/src/test/regress/sql/join_hash.sql
@@ -523,7 +523,8 @@ INSERT INTO hjtest_matchbits_t2 VALUES (2);
-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
-- correctly emit the NULL-extended unmatching tuple in full hash join.
UPDATE hjtest_matchbits_t2 set id = 2;
-SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
-- Test serial full hash join.
-- Resetting parallel_setup_cost should force a serial plan.
-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
--
2.37.2
On Wed, Jun 07, 2023 at 05:16:12PM -0400, Melanie Plageman wrote:
On Fri, May 19, 2023 at 8:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Considering that this is a parallel plan, I don't think there's any
mystery about why an ORDER-BY-less query might have unstable output
order; the only mystery is why more of the buildfarm hasn't failed.
Can we just add "ORDER BY t1.id" to this query? It looks like you
get the same PHJ plan, although now underneath Sort/Gather Merge.Yes, this was an oversight on my part. Attached is the patch that does
just what you suggested.
Confirmed that adding an ORDER BY adds a Sort node between a Gather
Merge and a Parallel Hash Full Join, not removing coverage.
This has fallen through the cracks and conchuela has failed again
today, so I went ahead and applied the fix on HEAD. Thanks!
--
Michael
Michael Paquier <michael@paquier.xyz> writes:
This has fallen through the cracks and conchuela has failed again
today, so I went ahead and applied the fix on HEAD. Thanks!
Thanks! I'd intended to push that but it didn't get to the
top of the to-do queue yet. (I'm still kind of wondering why
only conchuela has failed to date.)
regards, tom lane
On Sun, Jun 11, 2023 at 11:24 PM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Jun 07, 2023 at 05:16:12PM -0400, Melanie Plageman wrote:
On Fri, May 19, 2023 at 8:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Considering that this is a parallel plan, I don't think there's any
mystery about why an ORDER-BY-less query might have unstable output
order; the only mystery is why more of the buildfarm hasn't failed.
Can we just add "ORDER BY t1.id" to this query? It looks like you
get the same PHJ plan, although now underneath Sort/Gather Merge.Yes, this was an oversight on my part. Attached is the patch that does
just what you suggested.Confirmed that adding an ORDER BY adds a Sort node between a Gather
Merge and a Parallel Hash Full Join, not removing coverage.This has fallen through the cracks and conchuela has failed again
today, so I went ahead and applied the fix on HEAD. Thanks!
Thanks!