Avoiding hash join batch explosions with extreme skew and weird stats

Started by Thomas Munroover 6 years ago62 messages

thomas.munro@gmail.com

over 6 years ago

1 attachment(s)

Hello,

As discussed elsewhere[1]/messages/by-id/CAG_=8kBoWY4AXwW=Cj44xe13VZnYohV9Yr-_hvZdx2xpiipr9w@mail.gmail.com[2]/messages/by-id/20190504003414.bulcbnge3rhwhcsh@development, our algorithm for deciding when to give
up on repartitioning (AKA increasing the number of batches) tends to
keep going until it has a number of batches that is a function of the
number of distinct well distributed keys. I wanted to move this minor
issue away from Tomas Vondra's thread[2]/messages/by-id/20190504003414.bulcbnge3rhwhcsh@development since it's a mostly
independent problem.

SET max_parallel_workers_per_gather = 0;
SET synchronize_seqscans = off;
SET work_mem = '4MB';

CREATE TABLE r AS SELECT generate_series(1, 10000000)::int i;
ANALYZE r;

-- 1k uniform keys + 1m duplicates
CREATE TABLE s1k (i int);
INSERT INTO s1k SELECT generate_series(1, 1000)::int i;
ALTER TABLE s1k SET (autovacuum_enabled = off);
ANALYZE s1k;
INSERT INTO s1k SELECT 42 FROM generate_series(1, 1000000);

EXPLAIN ANALYZE SELECT COUNT(*) FROM r JOIN s1k USING (i);

Buckets: 1048576 (originally 1048576)
Batches: 4096 (originally 16)
Memory Usage: 35157kB

-- 10k uniform keys + 1m duplicates
CREATE TABLE s10k (i int);
INSERT INTO s10k SELECT generate_series(1, 10000)::int i;
ALTER TABLE s10k SET (autovacuum_enabled = off);
ANALYZE s10k;
INSERT INTO s10k SELECT 42 FROM generate_series(1, 1000000);

EXPLAIN ANALYZE SELECT COUNT(*) FROM r JOIN s10k USING (i);

Buckets: 131072 (originally 131072)
Batches: 32768 (originally 16)
Memory Usage: 35157kB

See how the number of batches is determined by the number of uniform
keys in r? That's because the explosion unfolds until there is
*nothing left* but keys that hash to the same value in the problem
batch, which means those uniform keys have to keep spreading out until
there is something on the order of two batches per key. The point is
that it's bounded only by input data (or eventually INT_MAX / 2 and
MaxAllocSize), and as Tomas has illuminated, batches eat unmetered
memory. Ouch.

Here's a quick hack to show that a 95% cut-off fixes those examples.
I don't really know how to choose the number, but I suspect it should
be much closer to 100 than 50. I think this is the easiest of three
fundamental problems that need to be solved in this area. The others
are: accounting for per-partition overheads as Tomas pointed out, and
providing an actual fallback strategy that respects work_mem when
extreme skew is detected OR per-partition overheads dominate. I plan
to experiment with nested loop hash join (or whatever you want to call
it: the thing where you join every arbitrary fragment of the hash
table against the outer batch, and somehow deal with outer match
flags) when time permits.

[1]: /messages/by-id/CAG_=8kBoWY4AXwW=Cj44xe13VZnYohV9Yr-_hvZdx2xpiipr9w@mail.gmail.com
[2]: /messages/by-id/20190504003414.bulcbnge3rhwhcsh@development

--
Thomas Munro
https://enterprisedb.com

Attachments:

fix.patchapplication/octet-stream; name=fix.patchDownload

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 64eec91f8b..04019fefbe 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -43,6 +43,13 @@
 #include "utils/syscache.h"
 
 
+/*
+ * If repartitioning a batch sends more than this fraction of the tuples
+ * to either child batch, then assume that further repartitioning is unlikely
+ * to be useful.
+ */
+#define EXTREME_SKEW_LIMIT ((double) 0.95)
+
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
 static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable);
@@ -1030,14 +1037,15 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 #endif
 
 	/*
-	 * If we dumped out either all or none of the tuples in the table, disable
+	 * If we dumped out almost all or none of the tuples in the table, disable
 	 * further expansion of nbatch.  This situation implies that we have
 	 * enough tuples of identical hashvalues to overflow spaceAllowed.
 	 * Increasing nbatch will not fix it since there's no way to subdivide the
 	 * group any more finely. We have to just gut it out and hope the server
 	 * has enough RAM.
 	 */
-	if (nfreed == 0 || nfreed == ninmemory)
+	if (nfreed < (ninmemory * (1 - EXTREME_SKEW_LIMIT)) ||
+		(double) nfreed > (ninmemory * EXTREME_SKEW_LIMIT))
 	{
 		hashtable->growEnabled = false;
 #ifdef HJDEBUG

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Thomas Munro (#1)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Thu, May 16, 2019 at 01:22:31PM +1200, Thomas Munro wrote:

...

Here's a quick hack to show that a 95% cut-off fixes those examples.
I don't really know how to choose the number, but I suspect it should
be much closer to 100 than 50. I think this is the easiest of three
fundamental problems that need to be solved in this area. The others
are: accounting for per-partition overheads as Tomas pointed out, and
providing an actual fallback strategy that respects work_mem when
extreme skew is detected OR per-partition overheads dominate. I plan
to experiment with nested loop hash join (or whatever you want to call
it: the thing where you join every arbitrary fragment of the hash
table against the outer batch, and somehow deal with outer match
flags) when time permits.

I think this is a step in the right direction, but as I said on the other
thread(s), I think we should not disable growth forever and recheck once
in a while. Otherwise we'll end up in sad situation with non-uniform data
sets, as poined out by Hubert Zhang in [1]/messages/by-id/CAB0yrekv=6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA@mail.gmail.com. It's probably even truer with
this less strict logic, using 95% as a threshold (instead of 100%).

I kinda like the idea with increasing the spaceAllowed value. Essentially,
if we decide adding batches would be pointless, increasing the memory
budget is the only thing we can do anyway.

The problem however is that we only really look at a single bit - it may
be that doubling the batches would not help, but doing it twice would
actually reduce the memory usage. For example, assume there are 2 distinct
values in the batch, with hash values (in binary)

101010000
101010111

and assume we currently. Clearly, splitting batches is going to do nothing
until we get to the 000 vs. 111 parts.

At first I thought this is rather unlikely and we can ignore that, but I'm
not really sure about that - it may actually be pretty likely. We may get
to 101010 bucket with sufficiently large data set, and then it's ~50%
probability the next bit is the same (assuming two distinct values). So
this may be quite an issue, I think.

regards

[1]: /messages/by-id/CAB0yrekv=6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA@mail.gmail.com

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Thomas Munro

thomas.munro@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#2)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I think this is a step in the right direction, but as I said on the other
thread(s), I think we should not disable growth forever and recheck once
in a while. Otherwise we'll end up in sad situation with non-uniform data
sets, as poined out by Hubert Zhang in [1]. It's probably even truer with
this less strict logic, using 95% as a threshold (instead of 100%).

I kinda like the idea with increasing the spaceAllowed value. Essentially,
if we decide adding batches would be pointless, increasing the memory
budget is the only thing we can do anyway.

But that's not OK, we need to fix THAT.

The problem however is that we only really look at a single bit - it may
be that doubling the batches would not help, but doing it twice would
actually reduce the memory usage. For example, assume there are 2 distinct
values in the batch, with hash values (in binary)

Yes, that's a good point, and not a case that we should ignore. But
if we had a decent fall-back strategy that respected work_mem, we
wouldn't care so much if we get it wrong in a corner case. I'm
arguing that we should use Grace partitioning as our primary
partitioning strategy, but fall back to looping (or possibly
sort-merging) for the current batch if Grace doesn't seem to be
working. You'll always be able to find cases where if you'd just
tried one more round, Grace would work, but that seems acceptable to
me, because getting it wrong doesn't melt your computer, it just
probably takes longer. Or maybe it doesn't. How much longer would it
take to loop twice? Erm, twice as long, and each loop makes actual
progress, unlike extra speculative Grace partition expansions which
apply not just to the current batch but all batches, might not
actually work, and you *have* to abandon at some point. The more I
think about it, the more I think that a loop-base escape valve, though
unpalatably quadratic, is probably OK because we're in a sink-or-swim
situation at this point, and our budget is work_mem, not work_time.

I'm concerned that we're trying to find ways to treat the symptoms,
allowing us to exceed work_mem but maybe not so much, instead of
focusing on the fundamental problem, which is that we don't yet have
an algorithm that is guaranteed to respect work_mem.

Admittedly I don't have a patch, just a bunch of handwaving. One
reason I haven't attempted to write it is because although I know how
to do the non-parallel version using a BufFile full of match bits in
sync with the tuples for outer joins, I haven't figured out how to do
it for parallel-aware hash join, because then each loop over the outer
batch could see different tuples in each participant. You could use
the match bit in HashJoinTuple header, but then you'd have to write
all the tuples out again, which is more IO than I want to do. I'll
probably start another thread about that.

--
Thomas Munro
https://enterprisedb.com

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Thomas Munro (#3)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Fri, May 17, 2019 at 10:21:56AM +1200, Thomas Munro wrote:

On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I think this is a step in the right direction, but as I said on the other
thread(s), I think we should not disable growth forever and recheck once
in a while. Otherwise we'll end up in sad situation with non-uniform data
sets, as poined out by Hubert Zhang in [1]. It's probably even truer with
this less strict logic, using 95% as a threshold (instead of 100%).

I kinda like the idea with increasing the spaceAllowed value. Essentially,
if we decide adding batches would be pointless, increasing the memory
budget is the only thing we can do anyway.

But that's not OK, we need to fix THAT.

I agree increasing the budget is not ideal, althought at the moment it's
the only thing we can do. If we can improve that, great.

The problem however is that we only really look at a single bit - it may
be that doubling the batches would not help, but doing it twice would
actually reduce the memory usage. For example, assume there are 2 distinct
values in the batch, with hash values (in binary)

Yes, that's a good point, and not a case that we should ignore. But
if we had a decent fall-back strategy that respected work_mem, we
wouldn't care so much if we get it wrong in a corner case. I'm
arguing that we should use Grace partitioning as our primary
partitioning strategy, but fall back to looping (or possibly
sort-merging) for the current batch if Grace doesn't seem to be
working. You'll always be able to find cases where if you'd just
tried one more round, Grace would work, but that seems acceptable to
me, because getting it wrong doesn't melt your computer, it just
probably takes longer. Or maybe it doesn't. How much longer would it
take to loop twice? Erm, twice as long, and each loop makes actual
progress, unlike extra speculative Grace partition expansions which
apply not just to the current batch but all batches, might not
actually work, and you *have* to abandon at some point. The more I
think about it, the more I think that a loop-base escape valve, though
unpalatably quadratic, is probably OK because we're in a sink-or-swim
situation at this point, and our budget is work_mem, not work_time.

True.

I'm concerned that we're trying to find ways to treat the symptoms,
allowing us to exceed work_mem but maybe not so much, instead of
focusing on the fundamental problem, which is that we don't yet have
an algorithm that is guaranteed to respect work_mem.

Yes, that's a good point.

Admittedly I don't have a patch, just a bunch of handwaving. One
reason I haven't attempted to write it is because although I know how
to do the non-parallel version using a BufFile full of match bits in
sync with the tuples for outer joins, I haven't figured out how to do
it for parallel-aware hash join, because then each loop over the outer
batch could see different tuples in each participant. You could use
the match bit in HashJoinTuple header, but then you'd have to write
all the tuples out again, which is more IO than I want to do. I'll
probably start another thread about that.

That pesky parallelism ;-)

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Thomas Munro (#3)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I kinda like the idea with increasing the spaceAllowed value. Essentially,
if we decide adding batches would be pointless, increasing the memory
budget is the only thing we can do anyway.

But that's not OK, we need to fix THAT.

I don't think it's necessarily a good idea to suppose that we MUST
fit in work_mem come what may. It's likely impossible to guarantee
that in all cases. Even if we can, a query that runs for eons will
help nobody.

regards, tom lane

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Tom Lane (#5)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Thu, May 16, 2019 at 06:58:43PM -0400, Tom Lane wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I kinda like the idea with increasing the spaceAllowed value. Essentially,
if we decide adding batches would be pointless, increasing the memory
budget is the only thing we can do anyway.

But that's not OK, we need to fix THAT.

I don't think it's necessarily a good idea to suppose that we MUST
fit in work_mem come what may. It's likely impossible to guarantee
that in all cases. Even if we can, a query that runs for eons will
help nobody.

I kinda agree with Thomas - arbitrarily increasing work_mem is something
we should not do unless abosolutely necessary. If the query is slow, it's
up to the user to bump the value up, if deemed appropriate.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Thomas Munro

thomas.munro@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#6)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Fri, May 17, 2019 at 11:46 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Thu, May 16, 2019 at 06:58:43PM -0400, Tom Lane wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I kinda like the idea with increasing the spaceAllowed value. Essentially,
if we decide adding batches would be pointless, increasing the memory
budget is the only thing we can do anyway.

But that's not OK, we need to fix THAT.

I don't think it's necessarily a good idea to suppose that we MUST
fit in work_mem come what may. It's likely impossible to guarantee
that in all cases. Even if we can, a query that runs for eons will
help nobody.

I kinda agree with Thomas - arbitrarily increasing work_mem is something
we should not do unless abosolutely necessary. If the query is slow, it's
up to the user to bump the value up, if deemed appropriate.

I think we can gaurantee that we can fit in work_mem with only one
exception: we have to allow work_mem to be exceeded when we otherwise
couldn't fit a single tuple.

Then the worst possible case with the looping algorithm is that we
degrade to loading just one inner tuple at a time into the hash table,
at which point we effectively have a nested loop join (except (1) it's
flipped around: for each tuple on the inner side, we scan the outer
side; and (2) we can handle full outer joins). In any reasonable case
you'll have a decent amount of tuples at a time, so you won't have to
loop too many times so it's not really quadratic in the number of
tuples. The realisation that it's a nested loop join in the extreme
case is probably why the MySQL people called it 'block nested loop
join' (and as far as I can tell from quick googling, it might be their
*primary* strategy for hash joins that don't fit in memory, not just a
secondary strategy after Grace fails, but I might be wrong about
that). Unlike plain old single-tuple nested loop join, it works in
arbitrary sized blocks (the hash table). What we would call a regular
hash join, they call a BNL that just happens to have only one loop. I
think Grace is probably a better primary strategy, but loops are a
good fallback.

The reason I kept mentioning sort-merge in earlier threads is because
it'd be better in the worst cases. Unfortunately it would be worse in
the best case (smallish numbers of loops) and I suspect many real
world cases. It's hard to decide, so perhaps we should be happy that
sort-merge can't be considered currently because the join conditions
may not be merge-joinable.

--
Thomas Munro
https://enterprisedb.com

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Thomas Munro (#3)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Thu, May 16, 2019 at 3:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Admittedly I don't have a patch, just a bunch of handwaving. One
reason I haven't attempted to write it is because although I know how
to do the non-parallel version using a BufFile full of match bits in
sync with the tuples for outer joins, I haven't figured out how to do
it for parallel-aware hash join, because then each loop over the outer
batch could see different tuples in each participant. You could use
the match bit in HashJoinTuple header, but then you'd have to write
all the tuples out again, which is more IO than I want to do. I'll
probably start another thread about that.

Could you explain more about the implementation you are suggesting?

Specifically, what do you mean "BufFile full of match bits in sync with the
tuples for outer joins?"

Is the implementation you are thinking of one which falls back to NLJ on a
batch-by-batch basis decided during the build phase?
If so, why do you need to keep track of the outer tuples seen?
If you are going to loop through the whole outer side for each tuple on the
inner side, it seems like you wouldn't need to.

Could you make an outer "batch" which is the whole of the outer relation?
That
is, could you do something like: when hashing the inner side, if
re-partitioning
is resulting in batches that will overflow spaceAllowed, could you set a
flag on
that batch use_NLJ and when making batches for the outer side, make one
"batch"
that has all the tuples from the outer side which the inner side batch
which was
flagged will do NLJ with.

--
Melanie Plageman

Thomas Munro

thomas.munro@gmail.com

over 6 years ago

In reply to: Melanie Plageman (#8)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Sat, May 18, 2019 at 12:15 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Thu, May 16, 2019 at 3:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Admittedly I don't have a patch, just a bunch of handwaving. One
reason I haven't attempted to write it is because although I know how
to do the non-parallel version using a BufFile full of match bits in
sync with the tuples for outer joins, I haven't figured out how to do
it for parallel-aware hash join, because then each loop over the outer
batch could see different tuples in each participant. You could use
the match bit in HashJoinTuple header, but then you'd have to write
all the tuples out again, which is more IO than I want to do. I'll
probably start another thread about that.

Could you explain more about the implementation you are suggesting?

Specifically, what do you mean "BufFile full of match bits in sync with the
tuples for outer joins?"

First let me restate the PostgreSQL terminology for this stuff so I
don't get confused while talking about it:

* The inner side of the join = the right side = the side we use to
build a hash table. Right and full joins emit inner tuples when there
is no matching tuple on the outer side.

* The outer side of the join = the left side = the side we use to
probe the hash table. Left and full joins emit outer tuples when
there is no matching tuple on the inner side.

* Semi and anti joins emit exactly one instance of each outer tuple if
there is/isn't at least one match on the inner side.

We have a couple of relatively easy cases:

* Inner joins: for every outer tuple, we try to find a match in the
hash table, and if we find one we emit a tuple. To add looping
support, if we run out of memory when loading the hash table we can
just proceed to probe the fragment we've managed to load so far, and
then rewind the outer batch, clear the hash table and load in the next
work_mem-sized fragment and do it again... rinse and repeat until
we've eventually processed the whole inner batch. After we've
finished looping, we move on to the next batch.

* For right and full joins ("HJ_FILL_INNER"), we also need to emit an
inner tuple for every tuple that was loaded into the hash table but
never matched. That's done using a flag HEAP_TUPLE_HAS_MATCH in the
header of the tuples of the hash table, and a scan through the whole
hash table at the end of each batch to look for unmatched tuples
(ExecScanHashTableForUnmatched()). To add looping support, that just
has to be done at the end of every inner batch fragment, that is,
after every loop.

And now for the cases that need a new kind of match bit, as far as I can see:

* For left and full joins ("HJ_FILL_OUTER"), we also need to emit an
outer tuple for every tuple that didn't find a match in the hash
table. Normally that is done while probing, without any need for
memory or match flags: if we don't find a match, we just spit out an
outer tuple immediately. But that simple strategy won't work if the
hash table holds only part of the inner batch. Since we'll be
rewinding and looping over the outer batch again for the next inner
batch fragment, we can't yet say if there will be a match in a later
loop. But the later loops don't know on their own either. So we need
some kind of cumulative memory between loops, and we only know which
outer tuples have a match after we've finished all loops. So there
would need to be a new function ExecScanOuterBatchForUnmatched().

* For semi joins, we need to emit exactly one outer tuple whenever
there is one or more match on the inner side. To add looping support,
we need to make sure that we don't emit an extra copy of the outer
tuple if there is a second match in another inner batch fragment.
Again, this implies some kind of memory between loops, so we can
suppress later matches.

* For anti joins, we need to emit an outer tuple whenever there is no
match. To add looping support, we need to wait until we've seen all
the inner batch fragments before we know that a given outer tuple has
no match, perhaps with the same new function
ExecScanOuterBatchForUnmatched().

So, we need some kind of inter-loop memory, but we obviously don't
want to create another source of unmetered RAM gobbling. So one idea
is a BufFile that has one bit per outer tuple in the batch. In the
first loop, we just stream out the match results as we go, and then
somehow we OR the bitmap with the match results in subsequent loops.
After the last loop, we have a list of unmatched tuples -- just scan
it in lock-step with the outer batch and look for 0 bits.

Unfortunately that bits-in-order scheme doesn't work for parallel
hash, where the SharedTuplestore tuples seen by each worker are
non-deterministic. So perhaps in that case we could use the
HEAP_TUPLE_HAS_MATCH bit in the outer tuple header itself, and write
the whole outer batch back out each time through the loop. That'd
keep the tuples and match bits together, but it seems like a lot of
IO... Note that parallel hash doesn't support right/full joins today,
because of some complications about waiting and deadlocks that might
turn out to be relevant here too, and might be solvable (I should
probably write about that in another email), but left joins *are*
supported today so would need to be desupported if we wanted to add
loop-based escape valve but not deal with with these problems. That
doesn't seem acceptable, which is why I'm a bit stuck on this point,
and unfortunately it may be a while before I have time to tackle any
of that personally.

Is the implementation you are thinking of one which falls back to NLJ on a
batch-by-batch basis decided during the build phase?

Yeah.

If so, why do you need to keep track of the outer tuples seen?
If you are going to loop through the whole outer side for each tuple on the
inner side, it seems like you wouldn't need to.

The idea is to loop through the whole outer batch for every
work_mem-sized inner batch fragment, not every tuple. Though in
theory it could be as small as a single tuple.

Could you make an outer "batch" which is the whole of the outer relation? That
is, could you do something like: when hashing the inner side, if re-partitioning
is resulting in batches that will overflow spaceAllowed, could you set a flag on
that batch use_NLJ and when making batches for the outer side, make one "batch"
that has all the tuples from the outer side which the inner side batch which was
flagged will do NLJ with.

I didn't understand this... you always need to make one outer batch
corresponding to every inner batch. The problem is the tricky
left/full/anti/semi join cases when joining against fragments holding
less that the full inner batch: we still need some way to implement
join logic that depends on knowing whether there is a match in *any*
of the inner fragments/loops.

About the question of when exactly to set the "use_NLJ" flag: I had
originally been thinking of this only as a way to deal with the
extreme skew problem. But in light of Tomas's complaints about
unmetered per-batch memory overheads, I had a new thought: it should
also be triggered whenever doubling the number of batches would halve
the amount of memory left for the hash table (after including the size
of all those BufFile objects in the computation as Tomas proposes). I
think that might be exactly the right right cut-off if you want to do
as much Grace partitioning as your work_mem can afford, and therefore
as little looping as possible to complete the join while respecting
work_mem.

--
Thomas Munro
https://enterprisedb.com

#10

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Thomas Munro (#9)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Mon, May 20, 2019 at 11:07:03AM +1200, Thomas Munro wrote:

On Sat, May 18, 2019 at 12:15 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Thu, May 16, 2019 at 3:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Admittedly I don't have a patch, just a bunch of handwaving. One
reason I haven't attempted to write it is because although I know how
to do the non-parallel version using a BufFile full of match bits in
sync with the tuples for outer joins, I haven't figured out how to do
it for parallel-aware hash join, because then each loop over the outer
batch could see different tuples in each participant. You could use
the match bit in HashJoinTuple header, but then you'd have to write
all the tuples out again, which is more IO than I want to do. I'll
probably start another thread about that.

Could you explain more about the implementation you are suggesting?

Specifically, what do you mean "BufFile full of match bits in sync with the
tuples for outer joins?"

First let me restate the PostgreSQL terminology for this stuff so I
don't get confused while talking about it:

* The inner side of the join = the right side = the side we use to
build a hash table. Right and full joins emit inner tuples when there
is no matching tuple on the outer side.

* The outer side of the join = the left side = the side we use to
probe the hash table. Left and full joins emit outer tuples when
there is no matching tuple on the inner side.

* Semi and anti joins emit exactly one instance of each outer tuple if
there is/isn't at least one match on the inner side.

I think you're conflating inner/outer side and left/right, or rather
assuming it's always left=inner and right=outer.

... snip ...

Could you make an outer "batch" which is the whole of the outer relation? That
is, could you do something like: when hashing the inner side, if re-partitioning
is resulting in batches that will overflow spaceAllowed, could you set a flag on
that batch use_NLJ and when making batches for the outer side, make one "batch"
that has all the tuples from the outer side which the inner side batch which was
flagged will do NLJ with.

I didn't understand this... you always need to make one outer batch
corresponding to every inner batch. The problem is the tricky
left/full/anti/semi join cases when joining against fragments holding
less that the full inner batch: we still need some way to implement
join logic that depends on knowing whether there is a match in *any*
of the inner fragments/loops.

About the question of when exactly to set the "use_NLJ" flag: I had
originally been thinking of this only as a way to deal with the
extreme skew problem. But in light of Tomas's complaints about
unmetered per-batch memory overheads, I had a new thought: it should
also be triggered whenever doubling the number of batches would halve
the amount of memory left for the hash table (after including the size
of all those BufFile objects in the computation as Tomas proposes). I
think that might be exactly the right right cut-off if you want to do
as much Grace partitioning as your work_mem can afford, and therefore
as little looping as possible to complete the join while respecting
work_mem.

Not sure what NLJ flag rule you propose, exactly.

Regarding the threshold value - once the space for BufFiles (and other
overhead) gets over work_mem/2, it does not make any sense to increase
the number of batches because then the work_mem would be entirely
occupied by BufFiles.

The WIP patches don't actually do exactly that though - they just check
if the incremented size would be over work_mem/2. I think we should
instead allow up to work_mem*2/3, i.e. stop adding batches after the
BufFiles start consuming more than work_mem/3 memory.

I think that's actually what you mean by "halving the amount of memory
left for the hash table" because that's what happens after reaching the
work_mem/3.

But I think that rule is irrelevant here, really, because this thread
was discussing cases where adding batches is futile due to skew, no? In
which case we should stop adding batches after reaching some % of tuples
not moving from the batch.

Or are you suggesting we should remove that rule, and instead realy on
this rule about halving the hash table space? That might work too, I
guess.

OTOH I'm not sure it's a good idea to handle both those cases the same
way - "overflow file" idea works pretty well for cases where the hash
table actually can be split into batches, and I'm afraid NLJ will be
much less efficient for those cases.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11

Thomas Munro

thomas.munro@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#10)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Mon, May 20, 2019 at 12:22 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, May 20, 2019 at 11:07:03AM +1200, Thomas Munro wrote:

First let me restate the PostgreSQL terminology for this stuff so I
don't get confused while talking about it:

* The inner side of the join = the right side = the side we use to
build a hash table. Right and full joins emit inner tuples when there
is no matching tuple on the outer side.

* The outer side of the join = the left side = the side we use to
probe the hash table. Left and full joins emit outer tuples when
there is no matching tuple on the inner side.

* Semi and anti joins emit exactly one instance of each outer tuple if
there is/isn't at least one match on the inner side.

I think you're conflating inner/outer side and left/right, or rather
assuming it's always left=inner and right=outer.

In PostgreSQL, it's always inner = right, outer = left. You can see
that reflected in plannodes.h and elsewhere:

/* ----------------
* these are defined to avoid confusion problems with "left"
* and "right" and "inner" and "outer". The convention is that
* the "left" plan is the "outer" plan and the "right" plan is
* the inner plan, but these make the code more readable.
* ----------------
*/
#define innerPlan(node) (((Plan *)(node))->righttree)
#define outerPlan(node) (((Plan *)(node))->lefttree)

I'm not sure you think it's not always like that: are you referring to
the fact that the planner can choose to reverse the join (compared to
the SQL LEFT|RIGHT JOIN that appeared in the query), creating an extra
layer of confusion? In my email I was talking only about left and
right as seen by the executor.

About the question of when exactly to set the "use_NLJ" flag: I had
originally been thinking of this only as a way to deal with the
extreme skew problem. But in light of Tomas's complaints about
unmetered per-batch memory overheads, I had a new thought: it should
also be triggered whenever doubling the number of batches would halve
the amount of memory left for the hash table (after including the size
of all those BufFile objects in the computation as Tomas proposes). I
think that might be exactly the right right cut-off if you want to do
as much Grace partitioning as your work_mem can afford, and therefore
as little looping as possible to complete the join while respecting
work_mem.

Not sure what NLJ flag rule you propose, exactly.

Regarding the threshold value - once the space for BufFiles (and other
overhead) gets over work_mem/2, it does not make any sense to increase
the number of batches because then the work_mem would be entirely
occupied by BufFiles.

The WIP patches don't actually do exactly that though - they just check
if the incremented size would be over work_mem/2. I think we should
instead allow up to work_mem*2/3, i.e. stop adding batches after the
BufFiles start consuming more than work_mem/3 memory.

I think that's actually what you mean by "halving the amount of memory
left for the hash table" because that's what happens after reaching the
work_mem/3.

Well, instead of an arbitrary number like work_mem/2 or work_mem *
2/3, I was trying to figure out the precise threshold beyond which it
doesn't make sense to expend more memory on BufFile objects, even if
the keys are uniformly distributed so that splitting batches halves
the expect tuple count per batch. Let work_mem_for_hash_table =
work_mem - nbatch * sizeof(BufFile). Whenever you increase nbatch,
work_mem_for_hash_table goes down, but it had better be more than half
what it was before, or we expect to run out of memory again (if the
batch didn't fit before, and we're now splitting it so that we'll try
to load only half of it, we'd better have more than half the budget
for the hash table than we had before). Otherwise you'd be making
matters worse, and this process probably won't terminate.

But I think that rule is irrelevant here, really, because this thread
was discussing cases where adding batches is futile due to skew, no? In
which case we should stop adding batches after reaching some % of tuples
not moving from the batch.

Yeah, this thread started off just about the 95% thing, but veered off
course since these topics are tangled up. Sorry.

Or are you suggesting we should remove that rule, and instead realy on
this rule about halving the hash table space? That might work too, I
guess.

No, I suspect you need both rules. We still want to detect extreme
skew soon as possible, even though the other rule will eventually
fire; might as well do it sooner in clear-cut cases.

OTOH I'm not sure it's a good idea to handle both those cases the same
way - "overflow file" idea works pretty well for cases where the hash
table actually can be split into batches, and I'm afraid NLJ will be
much less efficient for those cases.

Yeah, you might be right about that, and everything I'm describing is
pure vapourware anyway. But your overflow file scheme isn't exactly
free of IO-amplification and multiple-processing of input data
either... and I haven't yet grokked how it would work for parallel
hash. Parallel hash generally doesn't have the
'throw-the-tuples-forward' concept. which is inherently based on
sequential in-order processing of batches.

--
Thomas Munro
https://enterprisedb.com

#12

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Thomas Munro (#11)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Hi,

On 2019-05-20 13:25:52 +1200, Thomas Munro wrote:

In PostgreSQL, it's always inner = right, outer = left. You can see
that reflected in plannodes.h and elsewhere:

/* ----------------
* these are defined to avoid confusion problems with "left"
* and "right" and "inner" and "outer". The convention is that
* the "left" plan is the "outer" plan and the "right" plan is
* the inner plan, but these make the code more readable.
* ----------------
*/
#define innerPlan(node) (((Plan *)(node))->righttree)
#define outerPlan(node) (((Plan *)(node))->lefttree)

I really don't understand why we don't just rename those fields.

Greetings,

Andres Freund

#13

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Thomas Munro (#11)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Mon, May 20, 2019 at 01:25:52PM +1200, Thomas Munro wrote:

On Mon, May 20, 2019 at 12:22 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, May 20, 2019 at 11:07:03AM +1200, Thomas Munro wrote:

First let me restate the PostgreSQL terminology for this stuff so I
don't get confused while talking about it:

* The inner side of the join = the right side = the side we use to
build a hash table. Right and full joins emit inner tuples when there
is no matching tuple on the outer side.

* The outer side of the join = the left side = the side we use to
probe the hash table. Left and full joins emit outer tuples when
there is no matching tuple on the inner side.

* Semi and anti joins emit exactly one instance of each outer tuple if
there is/isn't at least one match on the inner side.

I think you're conflating inner/outer side and left/right, or rather
assuming it's always left=inner and right=outer.

In PostgreSQL, it's always inner = right, outer = left. You can see
that reflected in plannodes.h and elsewhere:

/* ----------------
* these are defined to avoid confusion problems with "left"
* and "right" and "inner" and "outer". The convention is that
* the "left" plan is the "outer" plan and the "right" plan is
* the inner plan, but these make the code more readable.
* ----------------
*/
#define innerPlan(node) (((Plan *)(node))->righttree)
#define outerPlan(node) (((Plan *)(node))->lefttree)

I'm not sure you think it's not always like that: are you referring to
the fact that the planner can choose to reverse the join (compared to
the SQL LEFT|RIGHT JOIN that appeared in the query), creating an extra
layer of confusion? In my email I was talking only about left and
right as seen by the executor.

It might be my lack of understanding, but I'm not sure how we map
LEFT/RIGHT JOIN to left/righttree and inner/outer at plan level. My
assumption was that for "a LEFT JOIN b" then "a" and "b" can end up
both as inner and outer (sub)tree.

But I haven't checked so I may easily be wrong. Maybe the comment you
quoted clarifies that, not sure.

About the question of when exactly to set the "use_NLJ" flag: I had
originally been thinking of this only as a way to deal with the
extreme skew problem. But in light of Tomas's complaints about
unmetered per-batch memory overheads, I had a new thought: it should
also be triggered whenever doubling the number of batches would halve
the amount of memory left for the hash table (after including the size
of all those BufFile objects in the computation as Tomas proposes). I
think that might be exactly the right right cut-off if you want to do
as much Grace partitioning as your work_mem can afford, and therefore
as little looping as possible to complete the join while respecting
work_mem.

Not sure what NLJ flag rule you propose, exactly.

Regarding the threshold value - once the space for BufFiles (and other
overhead) gets over work_mem/2, it does not make any sense to increase
the number of batches because then the work_mem would be entirely
occupied by BufFiles.

The WIP patches don't actually do exactly that though - they just check
if the incremented size would be over work_mem/2. I think we should
instead allow up to work_mem*2/3, i.e. stop adding batches after the
BufFiles start consuming more than work_mem/3 memory.

I think that's actually what you mean by "halving the amount of memory
left for the hash table" because that's what happens after reaching the
work_mem/3.

Well, instead of an arbitrary number like work_mem/2 or work_mem *
2/3, I was trying to figure out the precise threshold beyond which it
doesn't make sense to expend more memory on BufFile objects, even if
the keys are uniformly distributed so that splitting batches halves
the expect tuple count per batch. Let work_mem_for_hash_table =
work_mem - nbatch * sizeof(BufFile). Whenever you increase nbatch,
work_mem_for_hash_table goes down, but it had better be more than half
what it was before, or we expect to run out of memory again (if the
batch didn't fit before, and we're now splitting it so that we'll try
to load only half of it, we'd better have more than half the budget
for the hash table than we had before). Otherwise you'd be making
matters worse, and this process probably won't terminate.

But the work_mem/3 does exactly that.

Let's say BufFiles need a bit less than work_mem/3. That means we have
a bit more than 2*work_mem/3 for the hash table. If you double the number
of batches, then you'll end up with a bit more than 2*work_mem/3. That is,
we've not halved the hash table size.

If BufFiles need a bit more memory than work_mem/3, then after doubling
the number of batches we'll end up with less than half the initial hash
table space.

So I think work_mem/3 is the threshold we're looking for.

But I think that rule is irrelevant here, really, because this thread
was discussing cases where adding batches is futile due to skew, no? In
which case we should stop adding batches after reaching some % of tuples
not moving from the batch.

Yeah, this thread started off just about the 95% thing, but veered off
course since these topics are tangled up. Sorry.

Or are you suggesting we should remove that rule, and instead realy on
this rule about halving the hash table space? That might work too, I
guess.

No, I suspect you need both rules. We still want to detect extreme
skew soon as possible, even though the other rule will eventually
fire; might as well do it sooner in clear-cut cases.

Right, I agree. I think we need the 95% rule (or whatever) to handle the
cases with skew / many duplicates, and then the overflow files to handle
underestimates with uniform distribution (or some other solution).

OTOH I'm not sure it's a good idea to handle both those cases the same
way - "overflow file" idea works pretty well for cases where the hash
table actually can be split into batches, and I'm afraid NLJ will be
much less efficient for those cases.

Yeah, you might be right about that, and everything I'm describing is
pure vapourware anyway. But your overflow file scheme isn't exactly
free of IO-amplification and multiple-processing of input data
either... and I haven't yet grokked how it would work for parallel
hash. Parallel hash generally doesn't have the
'throw-the-tuples-forward' concept. which is inherently based on
sequential in-order processing of batches.

Sure, let's do some math.

With the overflow scheme, the amplification is roughly ~2x (relative to
master), because we need to write data for most batches first into the
overflow file and then to the correct one. Master has wrte aplification
about ~1.25x (due to the gradual increase of batches), so the "total"
amplification is ~2.5x.

For the NLJ, the amplification fully depends on what fraction of the hash
table fits into work_mem. For example when it needs to be split into 32
fragments, we have ~32x amplification. It might affect just some batches,
of course.

So I still think those approaches are complementary and we need both.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Thomas Munro (#9)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Sun, May 19, 2019 at 4:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, May 18, 2019 at 12:15 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Thu, May 16, 2019 at 3:22 PM Thomas Munro <thomas.munro@gmail.com>

wrote:

Admittedly I don't have a patch, just a bunch of handwaving. One
reason I haven't attempted to write it is because although I know how
to do the non-parallel version using a BufFile full of match bits in
sync with the tuples for outer joins, I haven't figured out how to do
it for parallel-aware hash join, because then each loop over the outer
batch could see different tuples in each participant. You could use
the match bit in HashJoinTuple header, but then you'd have to write
all the tuples out again, which is more IO than I want to do. I'll
probably start another thread about that.

Could you explain more about the implementation you are suggesting?

Specifically, what do you mean "BufFile full of match bits in sync with

the

tuples for outer joins?"

First let me restate the PostgreSQL terminology for this stuff so I
don't get confused while talking about it:

* The inner side of the join = the right side = the side we use to
build a hash table. Right and full joins emit inner tuples when there
is no matching tuple on the outer side.

* The outer side of the join = the left side = the side we use to
probe the hash table. Left and full joins emit outer tuples when
there is no matching tuple on the inner side.

* Semi and anti joins emit exactly one instance of each outer tuple if
there is/isn't at least one match on the inner side.

We have a couple of relatively easy cases:

* Inner joins: for every outer tuple, we try to find a match in the
hash table, and if we find one we emit a tuple. To add looping
support, if we run out of memory when loading the hash table we can
just proceed to probe the fragment we've managed to load so far, and
then rewind the outer batch, clear the hash table and load in the next
work_mem-sized fragment and do it again... rinse and repeat until
we've eventually processed the whole inner batch. After we've
finished looping, we move on to the next batch.

* For right and full joins ("HJ_FILL_INNER"), we also need to emit an
inner tuple for every tuple that was loaded into the hash table but
never matched. That's done using a flag HEAP_TUPLE_HAS_MATCH in the
header of the tuples of the hash table, and a scan through the whole
hash table at the end of each batch to look for unmatched tuples
(ExecScanHashTableForUnmatched()). To add looping support, that just
has to be done at the end of every inner batch fragment, that is,
after every loop.

And now for the cases that need a new kind of match bit, as far as I can
see:

* For left and full joins ("HJ_FILL_OUTER"), we also need to emit an
outer tuple for every tuple that didn't find a match in the hash
table. Normally that is done while probing, without any need for
memory or match flags: if we don't find a match, we just spit out an
outer tuple immediately. But that simple strategy won't work if the
hash table holds only part of the inner batch. Since we'll be
rewinding and looping over the outer batch again for the next inner
batch fragment, we can't yet say if there will be a match in a later
loop. But the later loops don't know on their own either. So we need
some kind of cumulative memory between loops, and we only know which
outer tuples have a match after we've finished all loops. So there
would need to be a new function ExecScanOuterBatchForUnmatched().

* For semi joins, we need to emit exactly one outer tuple whenever
there is one or more match on the inner side. To add looping support,
we need to make sure that we don't emit an extra copy of the outer
tuple if there is a second match in another inner batch fragment.
Again, this implies some kind of memory between loops, so we can
suppress later matches.

* For anti joins, we need to emit an outer tuple whenever there is no
match. To add looping support, we need to wait until we've seen all
the inner batch fragments before we know that a given outer tuple has
no match, perhaps with the same new function
ExecScanOuterBatchForUnmatched().

So, we need some kind of inter-loop memory, but we obviously don't
want to create another source of unmetered RAM gobbling. So one idea
is a BufFile that has one bit per outer tuple in the batch. In the
first loop, we just stream out the match results as we go, and then
somehow we OR the bitmap with the match results in subsequent loops.
After the last loop, we have a list of unmatched tuples -- just scan
it in lock-step with the outer batch and look for 0 bits.

That makes sense. Thanks for the detailed explanation.

Unfortunately that bits-in-order scheme doesn't work for parallel
hash, where the SharedTuplestore tuples seen by each worker are
non-deterministic. So perhaps in that case we could use the
HEAP_TUPLE_HAS_MATCH bit in the outer tuple header itself, and write
the whole outer batch back out each time through the loop. That'd
keep the tuples and match bits together, but it seems like a lot of
IO...

If you set the has_match flag in the tuple header itself, wouldn't you only
need to write the tuples from the outer batch back out that don't have
matches?

If so, why do you need to keep track of the outer tuples seen?
If you are going to loop through the whole outer side for each tuple on

the

inner side, it seems like you wouldn't need to.

The idea is to loop through the whole outer batch for every
work_mem-sized inner batch fragment, not every tuple. Though in
theory it could be as small as a single tuple.

Could you make an outer "batch" which is the whole of the outer

relation? That

is, could you do something like: when hashing the inner side, if

re-partitioning

is resulting in batches that will overflow spaceAllowed, could you set a

flag on

that batch use_NLJ and when making batches for the outer side, make one

"batch"

that has all the tuples from the outer side which the inner side batch

which was

flagged will do NLJ with.

I didn't understand this... you always need to make one outer batch
corresponding to every inner batch. The problem is the tricky
left/full/anti/semi join cases when joining against fragments holding
less that the full inner batch: we still need some way to implement
join logic that depends on knowing whether there is a match in *any*
of the inner fragments/loops.

Sorry, my suggestion was inaccurate and unclear: I was basically suggesting
that once you have all batches created for outer and inner sides, for a
given inner side batch that does not fit in memory, for each outer tuple in
the corresponding outer batch file, load and join all of the chunks of the
inner batch file. That way, before you emit that tuple, you have checked
all of the corresponding inner batch.

Thinking about it now, I realize that that would be worse in all cases than
what you are thinking of -- joining the outer side batch with the inner
side batch chunk that fits in memory and marking the BufFile bit
representing that outer side tuple as "matched" and only emitting it with a
NULL from the inner side after all chunks have been processed.

--
Melanie Plageman

#15

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Thomas Munro (#9)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Sun, May 19, 2019 at 4:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, May 18, 2019 at 12:15 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Thu, May 16, 2019 at 3:22 PM Thomas Munro <thomas.munro@gmail.com>

wrote:

Admittedly I don't have a patch, just a bunch of handwaving. One
reason I haven't attempted to write it is because although I know how
to do the non-parallel version using a BufFile full of match bits in
sync with the tuples for outer joins, I haven't figured out how to do
it for parallel-aware hash join, because then each loop over the outer
batch could see different tuples in each participant. You could use
the match bit in HashJoinTuple header, but then you'd have to write
all the tuples out again, which is more IO than I want to do. I'll
probably start another thread about that.

Could you explain more about the implementation you are suggesting?

Specifically, what do you mean "BufFile full of match bits in sync with

the

tuples for outer joins?"

First let me restate the PostgreSQL terminology for this stuff so I
don't get confused while talking about it:

* The inner side of the join = the right side = the side we use to
build a hash table. Right and full joins emit inner tuples when there
is no matching tuple on the outer side.

* The outer side of the join = the left side = the side we use to
probe the hash table. Left and full joins emit outer tuples when
there is no matching tuple on the inner side.

* Semi and anti joins emit exactly one instance of each outer tuple if
there is/isn't at least one match on the inner side.

We have a couple of relatively easy cases:

* Inner joins: for every outer tuple, we try to find a match in the
hash table, and if we find one we emit a tuple. To add looping
support, if we run out of memory when loading the hash table we can
just proceed to probe the fragment we've managed to load so far, and
then rewind the outer batch, clear the hash table and load in the next
work_mem-sized fragment and do it again... rinse and repeat until
we've eventually processed the whole inner batch. After we've
finished looping, we move on to the next batch.

* For right and full joins ("HJ_FILL_INNER"), we also need to emit an
inner tuple for every tuple that was loaded into the hash table but
never matched. That's done using a flag HEAP_TUPLE_HAS_MATCH in the
header of the tuples of the hash table, and a scan through the whole
hash table at the end of each batch to look for unmatched tuples
(ExecScanHashTableForUnmatched()). To add looping support, that just
has to be done at the end of every inner batch fragment, that is,
after every loop.

And now for the cases that need a new kind of match bit, as far as I can
see:

* For left and full joins ("HJ_FILL_OUTER"), we also need to emit an
outer tuple for every tuple that didn't find a match in the hash
table. Normally that is done while probing, without any need for
memory or match flags: if we don't find a match, we just spit out an
outer tuple immediately. But that simple strategy won't work if the
hash table holds only part of the inner batch. Since we'll be
rewinding and looping over the outer batch again for the next inner
batch fragment, we can't yet say if there will be a match in a later
loop. But the later loops don't know on their own either. So we need
some kind of cumulative memory between loops, and we only know which
outer tuples have a match after we've finished all loops. So there
would need to be a new function ExecScanOuterBatchForUnmatched().

* For semi joins, we need to emit exactly one outer tuple whenever
there is one or more match on the inner side. To add looping support,
we need to make sure that we don't emit an extra copy of the outer
tuple if there is a second match in another inner batch fragment.
Again, this implies some kind of memory between loops, so we can
suppress later matches.

* For anti joins, we need to emit an outer tuple whenever there is no
match. To add looping support, we need to wait until we've seen all
the inner batch fragments before we know that a given outer tuple has
no match, perhaps with the same new function
ExecScanOuterBatchForUnmatched().

So, we need some kind of inter-loop memory, but we obviously don't
want to create another source of unmetered RAM gobbling. So one idea
is a BufFile that has one bit per outer tuple in the batch. In the
first loop, we just stream out the match results as we go, and then
somehow we OR the bitmap with the match results in subsequent loops.
After the last loop, we have a list of unmatched tuples -- just scan
it in lock-step with the outer batch and look for 0 bits.

Unfortunately that bits-in-order scheme doesn't work for parallel
hash, where the SharedTuplestore tuples seen by each worker are
non-deterministic. So perhaps in that case we could use the
HEAP_TUPLE_HAS_MATCH bit in the outer tuple header itself, and write
the whole outer batch back out each time through the loop. That'd
keep the tuples and match bits together, but it seems like a lot of
IO... Note that parallel hash doesn't support right/full joins today,
because of some complications about waiting and deadlocks that might
turn out to be relevant here too, and might be solvable (I should
probably write about that in another email), but left joins *are*
supported today so would need to be desupported if we wanted to add
loop-based escape valve but not deal with with these problems. That
doesn't seem acceptable, which is why I'm a bit stuck on this point,
and unfortunately it may be a while before I have time to tackle any
of that personally.

There was an off-list discussion at PGCon last week about doing this
hash looping strategy using the bitmap with match bits and solving the
parallel hashjoin problem by having tuple-identifying information
encoded in the bitmap which allowed each worker to indicate that an
outer tuple had a match when processing that inner side chunk and
then, at the end of the scan of the outer side, the bitmaps would be
OR'd together to represent a single view of the unmatched tuples from
that iteration.

I was talking to Jeff Davis about this on Saturday, and, he felt that
there might be a way to solve the problem differently if we thought of
the left join case as performing an inner join and an antijoin
instead.

Riffing on this idea a bit, I started trying to write a patch that
would basically emit a tuple if it matches and write the tuple out to
a file if it does not match. Then, after iterating through the outer
batch the first time for the first inner chunk, any tuples which do
not yet have a match are the only ones which need to be joined against
the other inner chunks. Instead of iterating through the outer side
original batch file, use the unmatched outer tuples file to do the
join against the next chunk. Repeat this for all chunks.

Could we not do this and avoid using the match bit? In the worst case,
you would have to write out all the tuples on the outer side (if none
match) nchunks times (chunk is the work_mem sized chunk of inner
loaded into the hashtable).

--
Melanie Plageman

#16

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Melanie Plageman (#15)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Mon, Jun 3, 2019 at 5:10 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I was talking to Jeff Davis about this on Saturday, and, he felt that
there might be a way to solve the problem differently if we thought of
the left join case as performing an inner join and an antijoin
instead.

Riffing on this idea a bit, I started trying to write a patch that
would basically emit a tuple if it matches and write the tuple out to
a file if it does not match. Then, after iterating through the outer
batch the first time for the first inner chunk, any tuples which do
not yet have a match are the only ones which need to be joined against
the other inner chunks. Instead of iterating through the outer side
original batch file, use the unmatched outer tuples file to do the
join against the next chunk. Repeat this for all chunks.

I'm not sure that I understanding this proposal correctly, but if I am
then I think it doesn't work in the case where a single outer row
matches rows in many different inner chunks. When you "use the
unmatched outer tuples file to do the join against the next chunk,"
you deny any rows that have already matched the chance to produce
additional matches.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Thomas Munro (#9)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Sun, May 19, 2019 at 7:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Unfortunately that bits-in-order scheme doesn't work for parallel
hash, where the SharedTuplestore tuples seen by each worker are
non-deterministic. So perhaps in that case we could use the
HEAP_TUPLE_HAS_MATCH bit in the outer tuple header itself, and write
the whole outer batch back out each time through the loop. That'd
keep the tuples and match bits together, but it seems like a lot of
IO...

So, I think the case you're worried about here is something like:

Gather
-> Parallel Hash Left Join
-> Parallel Seq Scan on a
-> Parallel Hash
-> Parallel Seq Scan on b

If I understand ExecParallelHashJoinPartitionOuter correctly, we're
going to hash all of a and put it into a set of batch files before we
even get started, so it's possible to identify precisely which tuple
we're talking about by just giving the batch number and the position
of the tuple within that batch. So while it's true that the
individual workers can't use the number of tuples they've read to know
where they are in the SharedTuplestore, maybe the SharedTuplestore
could just tell them. Then they could maintain a paged bitmap of the
tuples that they've matched to something, indexed by
position-within-the-tuplestore, and those bitmaps could be OR'd
together at the end.

Crazy idea, or...?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Robert Haas (#16)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Tue, Jun 4, 2019 at 5:43 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jun 3, 2019 at 5:10 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I was talking to Jeff Davis about this on Saturday, and, he felt that
there might be a way to solve the problem differently if we thought of
the left join case as performing an inner join and an antijoin
instead.

Riffing on this idea a bit, I started trying to write a patch that
would basically emit a tuple if it matches and write the tuple out to
a file if it does not match. Then, after iterating through the outer
batch the first time for the first inner chunk, any tuples which do
not yet have a match are the only ones which need to be joined against
the other inner chunks. Instead of iterating through the outer side
original batch file, use the unmatched outer tuples file to do the
join against the next chunk. Repeat this for all chunks.

I'm not sure that I understanding this proposal correctly, but if I am
then I think it doesn't work in the case where a single outer row
matches rows in many different inner chunks. When you "use the
unmatched outer tuples file to do the join against the next chunk,"
you deny any rows that have already matched the chance to produce
additional matches.

Oops! You are totally right.
I will amend the idea:
For each chunk on the inner side, loop through both the original batch
file and the unmatched outer tuples file created for the last chunk.
Emit any matches and write out any unmatched tuples to a new unmatched
outer tuples file.

I think, in the worst case, if no tuples from the outer have a match,
you end up writing out all of the outer tuples for each chunk on the
inner side. However, using the match bit in the tuple header solution
would require this much writing.
Probably the bigger problem is that in this worst case you would also
need to read double the number of outer tuples for each inner chunk.

However, in the best case it seems like it would be better than the
match bit/write everything from the outer side out solution.

--
Melanie Plageman

#19

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Melanie Plageman (#18)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Tue, Jun 4, 2019 at 2:47 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Oops! You are totally right.
I will amend the idea:
For each chunk on the inner side, loop through both the original batch
file and the unmatched outer tuples file created for the last chunk.
Emit any matches and write out any unmatched tuples to a new unmatched
outer tuples file.

I think, in the worst case, if no tuples from the outer have a match,
you end up writing out all of the outer tuples for each chunk on the
inner side. However, using the match bit in the tuple header solution
would require this much writing.
Probably the bigger problem is that in this worst case you would also
need to read double the number of outer tuples for each inner chunk.

However, in the best case it seems like it would be better than the
match bit/write everything from the outer side out solution.

I guess so, but the downside of needing to read twice as many outer
tuples for each inner chunk seems pretty large. It would be a lot
nicer if we could find a way to store the matched-bits someplace other
than where we are storing the tuples, what Thomas called a
bits-in-order scheme, because then the amount of additional read and
write I/O would be tiny -- one bit per tuple doesn't add up very fast.

In the scheme you propose here, I think that after you read the
original outer tuples for each chunk and the unmatched outer tuples
for each chunk, you'll have to match up the unmatched tuples to the
original tuples, probably by using memcmp() or something. Otherwise,
when a new match occurs, you won't know which tuple should now not be
emitted into the new unmatched outer tuples file that you're going to
produce. So I think what's going to happen is that you'll read the
original batch file, then read the unmatched tuples file and use that
to set or not set a bit on each tuple in memory, then do the real work
setting more bits, then write out a new unmatched-tuples file with the
tuples that still don't have the bit set. So your unmatched tuple
file is basically a list of tuple identifiers in the least compact
form imaginable: the tuple is identified by the entire tuple contents.
That doesn't seem very appealing, although I expect that it would
still win for some queries.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Robert Haas (#17)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Tue, Jun 4, 2019 at 6:05 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, May 19, 2019 at 7:07 PM Thomas Munro <thomas.munro@gmail.com>
wrote:

Unfortunately that bits-in-order scheme doesn't work for parallel
hash, where the SharedTuplestore tuples seen by each worker are
non-deterministic. So perhaps in that case we could use the
HEAP_TUPLE_HAS_MATCH bit in the outer tuple header itself, and write
the whole outer batch back out each time through the loop. That'd
keep the tuples and match bits together, but it seems like a lot of
IO...

So, I think the case you're worried about here is something like:

Gather
-> Parallel Hash Left Join
-> Parallel Seq Scan on a
-> Parallel Hash
-> Parallel Seq Scan on b

If I understand ExecParallelHashJoinPartitionOuter correctly, we're
going to hash all of a and put it into a set of batch files before we
even get started, so it's possible to identify precisely which tuple
we're talking about by just giving the batch number and the position
of the tuple within that batch. So while it's true that the
individual workers can't use the number of tuples they've read to know
where they are in the SharedTuplestore, maybe the SharedTuplestore
could just tell them. Then they could maintain a paged bitmap of the
tuples that they've matched to something, indexed by
position-within-the-tuplestore, and those bitmaps could be OR'd
together at the end.

Crazy idea, or...?

That idea does sound like it could work. Basically a worker is given a
tuple and a bit index (process this tuple and if it matches go flip
the bit at position 30) in its own bitmap, right?

I need to spend some time understanding how SharedTupleStore works and
how workers get tuples, so what I'm saying might not make sense.

One question I have is, how would the OR'd together bitmap be
propagated to workers after the first chunk? That is, when there are
no tuples left in the outer bunch, for a given inner chunk, would you
load the bitmaps from each worker into memory, OR them together, and
then write the updated bitmap back out so that each worker starts with
the updated bitmap?

--
Melanie Plageman

#21

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Melanie Plageman (#20)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Tue, Jun 4, 2019 at 3:09 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

One question I have is, how would the OR'd together bitmap be
propagated to workers after the first chunk? That is, when there are
no tuples left in the outer bunch, for a given inner chunk, would you
load the bitmaps from each worker into memory, OR them together, and
then write the updated bitmap back out so that each worker starts with
the updated bitmap?

I was assuming we'd elect one participant to go read all the bitmaps,
OR them together, and generate all the required null-extended tuples,
sort of like the PHJ_BUILD_ALLOCATING, PHJ_GROW_BATCHES_ALLOCATING,
PHJ_GROW_BUCKETS_ALLOCATING, and/or PHJ_BATCH_ALLOCATING states only
involve one participant being active at a time. Now you could hope for
something better -- why not parallelize that work? But on the other
hand, why not start simple and worry about that in some future patch
instead of right away? A committed patch that does something good is
better than an uncommitted patch that does something AWESOME.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Robert Haas (#19)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Tue, Jun 04, 2019 at 03:08:24PM -0400, Robert Haas wrote:

On Tue, Jun 4, 2019 at 2:47 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Oops! You are totally right.
I will amend the idea:
For each chunk on the inner side, loop through both the original batch
file and the unmatched outer tuples file created for the last chunk.
Emit any matches and write out any unmatched tuples to a new unmatched
outer tuples file.

I think, in the worst case, if no tuples from the outer have a match,
you end up writing out all of the outer tuples for each chunk on the
inner side. However, using the match bit in the tuple header solution
would require this much writing.
Probably the bigger problem is that in this worst case you would also
need to read double the number of outer tuples for each inner chunk.

However, in the best case it seems like it would be better than the
match bit/write everything from the outer side out solution.

I guess so, but the downside of needing to read twice as many outer
tuples for each inner chunk seems pretty large. It would be a lot
nicer if we could find a way to store the matched-bits someplace other
than where we are storing the tuples, what Thomas called a
bits-in-order scheme, because then the amount of additional read and
write I/O would be tiny -- one bit per tuple doesn't add up very fast.

In the scheme you propose here, I think that after you read the
original outer tuples for each chunk and the unmatched outer tuples
for each chunk, you'll have to match up the unmatched tuples to the
original tuples, probably by using memcmp() or something. Otherwise,
when a new match occurs, you won't know which tuple should now not be
emitted into the new unmatched outer tuples file that you're going to
produce. So I think what's going to happen is that you'll read the
original batch file, then read the unmatched tuples file and use that
to set or not set a bit on each tuple in memory, then do the real work
setting more bits, then write out a new unmatched-tuples file with the
tuples that still don't have the bit set. So your unmatched tuple
file is basically a list of tuple identifiers in the least compact
form imaginable: the tuple is identified by the entire tuple contents.
That doesn't seem very appealing, although I expect that it would
still win for some queries.

I wonder how big of an issue that actually is in practice. If this is
meant for significantly skewed data sets, which may easily cause OOM
(e.g. per the recent report, which restarted this discussion). So if we
still only expect to use this for rare cases, which may easily end up
with an OOM at the moment, the extra cost might be acceptable.

But if we plan to use this more widely (say, allow hashjoins even for
cases that we know won't fit into work_mem), then the extra cost would
be an issue. But even then it should be included in the cost estimate,
and switch the plan to a merge join when appropriate.

Of course, maybe there are many data sets with enough skew to consume
explosive growth and consume a lot of memory, but not enough to trigger
OOM. Those cases may get slower, but I think that's OK. If appropriate,
the user can increase work_mem and get the "good" plan.

FWIW this is a challenge for all approaches discussed in this thread,
not just this particular one. We're restricting the resources available
to the query, switching to something (likely) slower.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#23

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Robert Haas (#19)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Tue, Jun 4, 2019 at 12:08 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jun 4, 2019 at 2:47 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Oops! You are totally right.
I will amend the idea:
For each chunk on the inner side, loop through both the original batch
file and the unmatched outer tuples file created for the last chunk.
Emit any matches and write out any unmatched tuples to a new unmatched
outer tuples file.

I think, in the worst case, if no tuples from the outer have a match,
you end up writing out all of the outer tuples for each chunk on the
inner side. However, using the match bit in the tuple header solution
would require this much writing.
Probably the bigger problem is that in this worst case you would also
need to read double the number of outer tuples for each inner chunk.

However, in the best case it seems like it would be better than the
match bit/write everything from the outer side out solution.

I guess so, but the downside of needing to read twice as many outer
tuples for each inner chunk seems pretty large. It would be a lot
nicer if we could find a way to store the matched-bits someplace other
than where we are storing the tuples, what Thomas called a
bits-in-order scheme, because then the amount of additional read and
write I/O would be tiny -- one bit per tuple doesn't add up very fast.

In the scheme you propose here, I think that after you read the
original outer tuples for each chunk and the unmatched outer tuples
for each chunk, you'll have to match up the unmatched tuples to the
original tuples, probably by using memcmp() or something. Otherwise,
when a new match occurs, you won't know which tuple should now not be
emitted into the new unmatched outer tuples file that you're going to
produce. So I think what's going to happen is that you'll read the
original batch file, then read the unmatched tuples file and use that
to set or not set a bit on each tuple in memory, then do the real work
setting more bits, then write out a new unmatched-tuples file with the
tuples that still don't have the bit set. So your unmatched tuple
file is basically a list of tuple identifiers in the least compact
form imaginable: the tuple is identified by the entire tuple contents.
That doesn't seem very appealing, although I expect that it would
still win for some queries.

I'm not sure I understand why you would need to compare the original
tuples to the unmatched tuples file.

This is the example I used to try and reason through it.

let's say you have a batch (you are joining two single column tables)
and your outer side is:
5,7,9,11,10,11
and your inner is:
7,10,7,12,5,9
and for the inner, let's say that only two values can fit in memory,
so it is split into 3 chunks:
7,10 | 7,12 | 5,9
The first time you iterate through the outer side (joining it to the
first chunk), you emit as matched
7,7
10,10
and write to unmatched tuples file
5
9
11
11
The second time you iterate through the outer side (joining it to the
second chunk) you emit as matched
7,7
Then, you iterate again through the outer side a third time to join it
to the unmatched tuples in the unmatched tuples file (from the first
chunk) and write the following to a new unmatched tuples file:
5
9
11
11
The fourth time you iterate through the outer side (joining it to the
third chunk), you emit as matched
5,5
9,9
Then you iterate a fifth time through the outer side to join it to the
unmatched tuples in the unmatched tuples file (from the second chunk)
and write the following to a new unmatched tuples file:
11
11
Now that all chunks from the inner side have been processed, you can
loop through the final unmatched tuples file, NULL-extend, and emit
them

Wouldn't that work?

--
Melanie Plageman

#24

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Robert Haas (#21)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Tue, Jun 4, 2019 at 12:15 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jun 4, 2019 at 3:09 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

One question I have is, how would the OR'd together bitmap be
propagated to workers after the first chunk? That is, when there are
no tuples left in the outer bunch, for a given inner chunk, would you
load the bitmaps from each worker into memory, OR them together, and
then write the updated bitmap back out so that each worker starts with
the updated bitmap?

I was assuming we'd elect one participant to go read all the bitmaps,
OR them together, and generate all the required null-extended tuples,
sort of like the PHJ_BUILD_ALLOCATING, PHJ_GROW_BATCHES_ALLOCATING,
PHJ_GROW_BUCKETS_ALLOCATING, and/or PHJ_BATCH_ALLOCATING states only
involve one participant being active at a time. Now you could hope for
something better -- why not parallelize that work? But on the other
hand, why not start simple and worry about that in some future patch
instead of right away? A committed patch that does something good is
better than an uncommitted patch that does something AWESOME.

What if you have a lot of tuples -- couldn't the bitmaps get pretty
big? And then you have to OR them all together and if you can't put
the whole bitmap from each worker into memory at once to do it, it
seems like it would be pretty slow. (I mean maybe not as slow as
reading the outer side 5 times when you only have 3 chunks on the
inner + all the extra writes from my unmatched tuple file idea, but
still...)

--
Melanie Plageman

#25

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Thomas Munro (#3)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Thu, May 16, 2019 at 3:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Admittedly I don't have a patch, just a bunch of handwaving. One
reason I haven't attempted to write it is because although I know how
to do the non-parallel version using a BufFile full of match bits in
sync with the tuples for outer joins, I haven't figured out how to do
it for parallel-aware hash join, because then each loop over the outer
batch could see different tuples in each participant. You could use
the match bit in HashJoinTuple header, but then you'd have to write
all the tuples out again, which is more IO than I want to do. I'll
probably start another thread about that.

Going back to the idea of using the match bit in the HashJoinTuple header
and writing out all of the outer side for every chunk of the inner
side, I was wondering if there was something we could do that was kind
of like mmap'ing the outer side file to give the workers in parallel
hashjoin the ability to update a match bit in the tuple in place and
avoid writing the whole outer side out each time.

--
Melanie Plageman

#26

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Melanie Plageman (#25)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Thu, Jun 06, 2019 at 04:37:19PM -0700, Melanie Plageman wrote:

On Thu, May 16, 2019 at 3:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Admittedly I don't have a patch, just a bunch of handwaving. One
reason I haven't attempted to write it is because although I know how
to do the non-parallel version using a BufFile full of match bits in
sync with the tuples for outer joins, I haven't figured out how to do
it for parallel-aware hash join, because then each loop over the outer
batch could see different tuples in each participant. You could use
the match bit in HashJoinTuple header, but then you'd have to write
all the tuples out again, which is more IO than I want to do. I'll
probably start another thread about that.

Going back to the idea of using the match bit in the HashJoinTuple header
and writing out all of the outer side for every chunk of the inner
side, I was wondering if there was something we could do that was kind
of like mmap'ing the outer side file to give the workers in parallel
hashjoin the ability to update a match bit in the tuple in place and
avoid writing the whole outer side out each time.

I think this was one of the things we discussed in Ottawa - we could pass
index of the tuple (in the batch) along with the tuple, so that each
worker know which bit to set.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#27

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Melanie Plageman (#24)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Thu, Jun 06, 2019 at 04:33:31PM -0700, Melanie Plageman wrote:

On Tue, Jun 4, 2019 at 12:15 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jun 4, 2019 at 3:09 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

One question I have is, how would the OR'd together bitmap be
propagated to workers after the first chunk? That is, when there are
no tuples left in the outer bunch, for a given inner chunk, would you
load the bitmaps from each worker into memory, OR them together, and
then write the updated bitmap back out so that each worker starts with
the updated bitmap?

I was assuming we'd elect one participant to go read all the bitmaps,
OR them together, and generate all the required null-extended tuples,
sort of like the PHJ_BUILD_ALLOCATING, PHJ_GROW_BATCHES_ALLOCATING,
PHJ_GROW_BUCKETS_ALLOCATING, and/or PHJ_BATCH_ALLOCATING states only
involve one participant being active at a time. Now you could hope for
something better -- why not parallelize that work? But on the other
hand, why not start simple and worry about that in some future patch
instead of right away? A committed patch that does something good is
better than an uncommitted patch that does something AWESOME.

What if you have a lot of tuples -- couldn't the bitmaps get pretty
big? And then you have to OR them all together and if you can't put
the whole bitmap from each worker into memory at once to do it, it
seems like it would be pretty slow. (I mean maybe not as slow as
reading the outer side 5 times when you only have 3 chunks on the
inner + all the extra writes from my unmatched tuple file idea, but
still...)

Yes, they could get quite big, and I think you're right we need to
keep that in mind, because it's on the outer (often quite large) side of
the join. And if we're aiming to restrict memory usage, it'd be weird to
just ignore this.

But I think Thomas Munro originally proposed to treat this as a separate
BufFile, so my assumption was each worker would simply rewrite the bitmap
repeatedly for each hash table fragment. That means a bit more I/O, but as
those files are buffered and written in 8kB pages, with just 1 bit per
tuple. I think that's pretty OK and way cheaper that rewriting the whole
batch, where each tuple can be hundreds of bytes.

Also, it does not require any concurrency control, which rewriting the
batches themselves probably does (because we'd be feeding the tuples into
some shared file, I suppose). Except for the final step when we need to
merge the bitmaps, of course.

So I think this would work, it does not have the issue with using too much
memory, and I don't think the overhead is too bad.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#28

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Melanie Plageman (#23)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Thu, Jun 6, 2019 at 7:31 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I'm not sure I understand why you would need to compare the original
tuples to the unmatched tuples file.

I think I was confused. Actually, I'm still not sure I understand this part:

Then, you iterate again through the outer side a third time to join it
to the unmatched tuples in the unmatched tuples file (from the first
chunk) and write the following to a new unmatched tuples file:
5
9
11
11

and likewise here

Then you iterate a fifth time through the outer side to join it to the
unmatched tuples in the unmatched tuples file (from the second chunk)
and write the following to a new unmatched tuples file:
11
11

So you refer to joining the outer side to the unmatched tuples file,
but how would that tell you which outer tuples had no matches on the
inner side? I think what you'd need to do is anti-join the unmatched
tuples file to the current inner batch. So the algorithm would be
something like:

for each inner batch:
for each outer tuple:
if tuple matches inner batch then emit match
if tuple does not match inner batch and this is the first inner batch:
write tuple to unmatched tuples file
if this is not the first inner batch:
for each tuple from the unmatched tuples file:
if tuple does not match inner batch:
write to new unmatched tuples file
discard previous unmatched tuples file and use the new one for the
next iteration

for each tuple in the final unmatched tuples file:
null-extend and emit

If that's not what you have in mind, maybe you could provide some
similar pseudocode? Or you can just ignore me. I'm not trying to
interfere with an otherwise-fruitful discussion by being the only one
in the room who is confused...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#29

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#27)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Fri, Jun 7, 2019 at 10:17 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Yes, they could get quite big, and I think you're right we need to
keep that in mind, because it's on the outer (often quite large) side of
the join. And if we're aiming to restrict memory usage, it'd be weird to
just ignore this.

But I think Thomas Munro originally proposed to treat this as a separate
BufFile, so my assumption was each worker would simply rewrite the bitmap
repeatedly for each hash table fragment. That means a bit more I/O, but as
those files are buffered and written in 8kB pages, with just 1 bit per
tuple. I think that's pretty OK and way cheaper that rewriting the whole
batch, where each tuple can be hundreds of bytes.

Yes, this is also my thought. I'm not 100% sure I understand
Melanie's proposal, but I think that it involves writing every
still-unmatched outer tuple for every inner batch. This proposal --
assuming we can get the tuple numbering worked out -- involves writing
a bit for every outer tuple for every inner batch. So each time you
do an inner batch, you write either (a) one bit for EVERY outer tuple
or (b) the entirety of each unmatched tuple. It's possible for the
latter to be cheaper if the number of unmatched tuples is really,
really tiny, but it's not very likely.

For example, suppose that you've got 4 batches and each batch matches
99% of the tuples, which are each 50 bytes wide. After each batch,
approach A writes 1 bit per tuple, so a total of 4 bits per tuple
after 4 batches. Approach B writes a different amount of data after
each batch. After the first batch, it writes 1% of the tuples, and
for each one written it writes 50 bytes, so it writes 50 bytes * 0.01
= ~4 bits/tuple. That's already equal to what approach A wrote after
all 4 batches, and it's going to do a little more I/O over the course
of the remaining matches - although not much, because the unmatched
tuples file will be very very tiny after we eliminate 99% of the 1%
that survived the first batch. However, these are extremely favorable
assumptions for approach B. If the tuples are wider or the batches
match only say 20% of the tuples, approach B is going to be waaaay
more I/O.

Assuming I understand correctly, which I may not.

Also, it does not require any concurrency control, which rewriting the
batches themselves probably does (because we'd be feeding the tuples into
some shared file, I suppose). Except for the final step when we need to
merge the bitmaps, of course.

I suppose that rewriting the batches -- or really the unmatched tuples
file -- could just use a SharedTuplestore, so we probably wouldn't
need a lot of new code for this. I don't know whether contention
would be a problem or not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#30

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Robert Haas (#28)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Fri, Jun 7, 2019 at 7:30 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 6, 2019 at 7:31 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I'm not sure I understand why you would need to compare the original
tuples to the unmatched tuples file.

I think I was confused. Actually, I'm still not sure I understand this
part:

Then, you iterate again through the outer side a third time to join it
to the unmatched tuples in the unmatched tuples file (from the first
chunk) and write the following to a new unmatched tuples file:
5
9
11
11

and likewise here

Then you iterate a fifth time through the outer side to join it to the
unmatched tuples in the unmatched tuples file (from the second chunk)
and write the following to a new unmatched tuples file:
11
11

So you refer to joining the outer side to the unmatched tuples file,
but how would that tell you which outer tuples had no matches on the
inner side? I think what you'd need to do is anti-join the unmatched
tuples file to the current inner batch. So the algorithm would be
something like:

for each inner batch:
for each outer tuple:
if tuple matches inner batch then emit match
if tuple does not match inner batch and this is the first inner batch:
write tuple to unmatched tuples file
if this is not the first inner batch:
for each tuple from the unmatched tuples file:
if tuple does not match inner batch:
write to new unmatched tuples file
discard previous unmatched tuples file and use the new one for the
next iteration

for each tuple in the final unmatched tuples file:
null-extend and emit

If that's not what you have in mind, maybe you could provide some
similar pseudocode? Or you can just ignore me. I'm not trying to
interfere with an otherwise-fruitful discussion by being the only one
in the room who is confused...

Yep, the pseudo-code you have above is exactly what I was thinking. I
have been hacking around on my fork implementing this for the
non-parallel hashjoin (my idea was to implement a parallel-friendly
design but for the non-parallel-aware case and then go back and
implement it for the parallel-aware hashjoin later) and have some
thoughts.

I'll call the whole adaptive hashjoin fallback strategy "chunked
hashloop join" for the purposes of this description.
I'll abbreviate the three approaches we've discussed like this:

Approach A is using a separate data structure (a bitmap was the
suggested pick) to track the match status of each outer tuple

Approach B is the inner-join + anti-join writing out unmatched tuples
to a new file for every iteration through the outer side batch (for
each chunk of inner)

Approach C is setting a match bit in the tuple and then writing all
outer side tuples out for every iteration through the outer side (for
each chunk of inner)

To get started with I implemented the inner side chunking logic which
is required for all of the approaches. I did a super basic version
which only allows nbatches to be increased during the initial
hashtable build--not during loading of subsequent batches--if a batch
after batch 0 runs out of work_mem, it just loads what will fit and
saves the inner page offset in the hashjoin state.

Part of the allure of approaches B and C for me was that they seemed
like they would require less code complexity and concurrency control
because you could just write out the unmatched tuples (to probably a
SharedTupleStore) without having to care about their original order or
page offset. It seemed like it didn't require treating a spill file
like it permits random access nor treating the tuples as ordered in a
SharedTupleStore.

The benefit I saw of approach B over approach C was that, in the case
where more tuples are matches, it requires fewer writes than approach
C--at the cost of additional reads. It would require at most the same
number of writes as approach C.

Approach B turned out to be problematic for many reasons. First of
all, with approach B, you end up having to keep track of an additional
new spill file for unmatched outer tuples for every chunk of the inner
side. Each spill file could have a different number of tuples, so, any
reuse of the file seems difficult to get right. For approach C (which
I did not try to implement), it seems like you could get away with
only maintaining two spill files for the outer side--one to be read
from and one to write to. I'm sure it is more complicated than this.
However, it seemed like, for approach B you would need to create and
destroy entirely new unmatched tuple spill files for every chunk.

Approach B was not simpler when it came to the code complexity of the
state machine either -- you have to do something different for the
first chunk than the other chunks (write to the unmatched tups file
but read from the original spill file, whereas other chunks require
writing to the unmatched tups file and reading from the unmatched tups
file), which requires complexity in the state machine (and, I imagine,
worker orchestration in the parallel implementation). And, you still
have to process all of the unmatched tups, null-extend them, and emit
them before advancing the batch.

So, I decided to try out approach A. The crux of the idea (my
understanding of it, at least) is to keep a separate data structure
which has the match status of each outer tuple in the batch. The
discussion was to do this with a bitmap in a file, but, I started with
doing it with a list in memory.

What I have so far is a list of structs--one for each outer
tuple--where each struct has a match flag and the page offset of that
tuple in the outer spill file. I add each struct to the list when I am
getting each tuple from a spill file in HJ_NEED_NEW_OUTER state to
join to the first chunk of the inner, and, since I only do this when I
am getting an outer tuple from the spill file, I also grab the page
offset and set it in the struct in the list.

As I am creating the list, and, while processing each subsequent chunk
of the inner, if the tuple is a match, I set the match flag to true in
that outer tuple's member of the list.

Then, after finishing the whole inner batch, I loop through the list,
and, for each unmatched tuple, I go to that offset in the spill file
and get that tuple and NULL-extend and emit it.

(Currently, I have a problem with the list and it doesn't produce
correct results yet.)

Thinking about how to move from my list of offsets to using a bitmap,
I got confused.

Let me try to articulate what I think the bitmap implementation would look
like:

Before doing chunked hashloop join for any batch, we would need to
know how many tuples are in the outer batch to make the bitmap the
correct size.

We could do this either with one loop through the whole outer batch
file right before joining it to the inner batch (an extra loop).

Or we could try and do it during the first read of the outer relation
when processing batch 0 and keep a data structure with each batch
number mapped to the number of outer tuples spilled to that batch.

Then, once we have this number, before joining the outer to the first
chunk of the inner, we would generate a bitmap with ntuples in outer
batch number of bits and save it somewhere (eventually in a file,
initially in the hjstate).

Now, I am back to the original problem--how do you know which bit to
set without somehow numbering the tuples with a unique identifier? Is
there anything that uniquely identifies a spill file tuple except its
offset?

--
Melanie Plageman

#31

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Melanie Plageman (#30)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Tue, Jun 11, 2019 at 2:35 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Let me try to articulate what I think the bitmap implementation would look
like:

Before doing chunked hashloop join for any batch, we would need to
know how many tuples are in the outer batch to make the bitmap the
correct size.

I was thinking that we wouldn't need to know this, because if the
bitmap is in a file, we can always extend it. To imagine a needlessly
dumb implementation, consider:

set-bit(i):
let b = i / 8
while (b <= length of file in bytes)
append '\0' to file
read byte b from the file
modify the byte you read by setting bit i % 8
write the modified byte back to the file

In reality, we'd have some kind of buffer. I imagine locality of
reference would be pretty good, because the outer tuples are coming to
us in increasing-tuple-number order.

If you want to prototype with an in-memory implementation, I'd suggest
just pallocing 8kB initially and repallocing when the tuple number
gets too big. It'll be sorta inefficient, but who cares? It's
certainly way cheaper than an extra pass over the data, and for a POC
it should be fine.

Now, I am back to the original problem--how do you know which bit to
set without somehow numbering the tuples with a unique identifier? Is
there anything that uniquely identifies a spill file tuple except its
offset?

I don't think so. Approach A hinges on being able to get the tuple
number reliably and without contortions, and I have not tried to make
that work. So maybe it's really hard or not possible or something.
My intuition is that it ought to work, but that and a dollar will get
you cup of coffee, so...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#32

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Robert Haas (#31)

1 attachment(s)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Thu, Jun 13, 2019 at 7:10 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jun 11, 2019 at 2:35 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Let me try to articulate what I think the bitmap implementation would

look

like:

Before doing chunked hashloop join for any batch, we would need to
know how many tuples are in the outer batch to make the bitmap the
correct size.

I was thinking that we wouldn't need to know this, because if the
bitmap is in a file, we can always extend it. To imagine a needlessly
dumb implementation, consider:

set-bit(i):
let b = i / 8
while (b <= length of file in bytes)
append '\0' to file
read byte b from the file
modify the byte you read by setting bit i % 8
write the modified byte back to the file

In reality, we'd have some kind of buffer. I imagine locality of
reference would be pretty good, because the outer tuples are coming to
us in increasing-tuple-number order.

If you want to prototype with an in-memory implementation, I'd suggest
just pallocing 8kB initially and repallocing when the tuple number
gets too big. It'll be sorta inefficient, but who cares? It's
certainly way cheaper than an extra pass over the data, and for a POC
it should be fine.

That approach makes sense. I have attached the first draft of a patch
I wrote to do parallel-oblivious hashjoin fallback. I haven't switched
to using the approach with a bitmap (or bytemap :) yet because I found
that using a linked list was easier to debug for now.

(Also, I did things like include the value of the outer tuple
attribute in the linked list nodes and assumed it was an int because
that is what I have been testing with--this would definitely be blown
away with everything else that is just there to help me with debugging
right now).

I am refactoring it now to change the state machine to make more sense
before changing the representation of the match statuses.

So, specifically, I am interested in high-level gut checks on the
state machine I am currently implementing (not reflected in this
patch).

This patch adds only one state -- HJ_ADAPTIVE_EMIT_UNMATCHED-- which
duplicates the logic of HJ_FILL_OUTER_TUPLE. Also, in this patch, the
existing HJ_NEED_NEW_BATCH state is used for new chunks. After
separating the logic that advanced the batches from that which loaded
a batch, it felt like NEED_NEW_CHUNK did not need to be its own state.
When a new chunk is required, if more exist, then the next one should
be loaded and outer should be rewound. Rewinding of outer was already
being done (seek to the beginning of the outer spill file is the
equivalent of "loading" it).

Currently, I am tracking a lot of state in the HashJoinState, which is
fiddly and error-prone.

New state machine (questions posed below):
To refactor the state machine, I am thinking of adding a new state
HJ_NEED_NEW_INNER_CHUNK which we would transition to when outer batch
is over. We would load the new chunk, rewind the outer, and transition
to HJ_NEED_NEW_OUTER. However, we would have to emit unmatched inner
tuples for that chunk (in case of ROJ) before that transition to
HJ_NEED_NEW_OUTER. This feels a little less clean because the
HJ_FILL_INNER_TUPLES state is transitioned into when the inner batch
is over as well. And, in the current flow I am sketching out, if the
inner batch is exhausted, we check if we should emit NULL-extended
inner tuples and then check if we should emit NULL-extended outer
tuples (since both batches are exhausted), whereas when a single inner
chunk is done being processed, we only want to emit NULL-extended
tuples for the inner side. Not to mention HJ_NEED_NEW_INNER_CHUNK
would transition to HJ_NEED_NEW_OUTER directly instead of first
advancing the batches. This can all be hacked around with if
statements, but, my point here is that if I am refactoring the state
machine to be more clear, ideally, it would be more clear.

A similar problem happens with HJ_FILL_OUTER_TUPLE and the
non-fallback case. For the fallback case, with this implementation,
you must wait until after exhausting the inner side to emit
NULL-extended outer tuples. In the non-fallback case -- a batch which
can fit in memory or, always, for batch 0 -- the unmatched outer
tuples are emitted as they are encountered.

It makes most sense in the context of the state machine, as far as I
can tell, after exhausting both outer and inner batch, to emit
NULL-extended inner tuples for that chunk and then emit NULL-extended
outer tuples for that batch.

So, requiring an additional read of the outer side to emit
NULL-extended tuples at the end of the inner batch would slow things
down for the non-fallback case, however, it seems like special casing
the fallback case would make the state machine much more confusing --
basically like mashing two totally different state machines together.

These questions will probably make a lot more sense with corresponding
code, so I will follow up with the second version of the state machine
patch once I finish it.

--
Melanie Plageman

Attachments:

v1-0001-hashloop-fallback.patchtext/x-patch; charset=US-ASCII; name=v1-0001-hashloop-fallback.patchDownload

From 51d23d2a38a58f154958602e471458bbae5c38f7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 10 Jun 2019 10:54:42 -0700
Subject: [PATCH v1] hashloop fallback

First part is to "chunk" the inner file into arbitrary partitions of
work_mem size

This chunks inner file and makes it so that the offset is along tuple
bounds.

Note that this makes it impossible to increase nbatches during the
loading of batches after initial hashtable creation

In preparation for doing this chunking, separate advance batch and load
batch. advance batch only if page offset is reset to 0, then load that
part of the batch

Second part was to: implement outer tuple batch rewinding per chunk of
inner batch

Would be a simple rewind and replay of outer side for each chunk of
inner if it weren't for LOJ.
Because we need to wait to emit NULL-extended tuples for LOJ until after
all chunks of inner have been processed.
To do this, make a list with an entry for each outer tuple and keep
track of its match status. Also, keep track of its offset so that we can
access the file at that offset in case the tuples are not processed in
order (like in parallel case--not handled here but in anticipation of
such cases)
---
 src/backend/executor/nodeHashjoin.c       | 212 ++++++++++++++++++++--
 src/include/executor/hashjoin.h           |  10 +
 src/include/nodes/execnodes.h             |  10 +
 src/test/regress/expected/adaptive_hj.out | 209 +++++++++++++++++++++
 src/test/regress/sql/adaptive_hj.sql      |  31 ++++
 5 files changed, 456 insertions(+), 16 deletions(-)
 create mode 100644 src/test/regress/expected/adaptive_hj.out
 create mode 100644 src/test/regress/sql/adaptive_hj.sql

diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 8484a287e7..7207ab1e57 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -127,6 +127,7 @@
 #define HJ_FILL_OUTER_TUPLE		4
 #define HJ_FILL_INNER_TUPLES	5
 #define HJ_NEED_NEW_BATCH		6
+#define HJ_ADAPTIVE_EMIT_UNMATCHED 7
 
 /* Returns true if doing null-fill on outer relation */
 #define HJ_FILL_OUTER(hjstate)	((hjstate)->hj_NullInnerTupleSlot != NULL)
@@ -143,10 +144,13 @@ static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 												 BufFile *file,
 												 uint32 *hashvalue,
 												 TupleTableSlot *tupleSlot);
-static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
+static bool ExecHashJoinAdvanceBatch(HashJoinState *hjstate);
 static bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
 static void ExecParallelHashJoinPartitionOuter(HashJoinState *node);
+static bool LoadInnerBatch(HashJoinState *hjstate);
+static TupleTableSlot *ExecHashJoinGetOuterTupleAtOffset(HashJoinState *hjstate, off_t offset);
 
+static	OuterOffsetMatchStatus *cursor = NULL;
 
 /* ----------------------------------------------------------------
  *		ExecHashJoinImpl
@@ -176,6 +180,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 	int			batchno;
 	ParallelHashJoinState *parallel_state;
 
+
 	/*
 	 * get information from HashJoin node
 	 */
@@ -198,6 +203,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 	 */
 	for (;;)
 	{
+
 		/*
 		 * It's possible to iterate this loop many times before returning a
 		 * tuple, in some pathological cases such as needing to move much of
@@ -368,9 +374,13 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 						node->hj_JoinState = HJ_NEED_NEW_BATCH;
 					continue;
 				}
-
+				/*
+				 * only initialize this to false during the first chunk --
+				 * otherwise, we will be resetting a tuple that had a match to false
+				 */
+				if (node->first_chunk || hashtable->curbatch == 0)
+					node->hj_MatchedOuter = false;
 				econtext->ecxt_outertuple = outerTupleSlot;
-				node->hj_MatchedOuter = false;
 
 				/*
 				 * Find the corresponding bucket for this tuple in the main
@@ -410,6 +420,47 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					continue;
 				}
 
+				/*
+				 * We need to construct the linked list of match statuses on the first chunk.
+				 * Note that node->first_chunk isn't true until HJ_NEED_NEW_BATCH
+				 * so this means that we don't construct this list on batch 0.
+				 */
+				if (node->first_chunk)
+				{
+					BufFile *outerFile = hashtable->outerBatchFile[batchno];
+
+					if (outerFile != NULL)
+					{
+						OuterOffsetMatchStatus *outerOffsetMatchStatus = NULL;
+
+						outerOffsetMatchStatus = palloc(sizeof(struct OuterOffsetMatchStatus));
+						outerOffsetMatchStatus->match_status = false;
+						outerOffsetMatchStatus->outer_tuple_start_offset = 0L;
+						outerOffsetMatchStatus->next = NULL;
+
+						if (node->first_outer_offset_match_status != NULL)
+						{
+							node->current_outer_offset_match_status->next = outerOffsetMatchStatus;
+							node->current_outer_offset_match_status = outerOffsetMatchStatus;
+						}
+						else
+						{
+							node->first_outer_offset_match_status = outerOffsetMatchStatus;
+							node->current_outer_offset_match_status = node->first_outer_offset_match_status;
+						}
+
+						outerOffsetMatchStatus->outer_tuple_val = DatumGetInt32(outerTupleSlot->tts_values[0]);
+						outerOffsetMatchStatus->outer_tuple_start_offset = node->HJ_NEED_NEW_OUTER_tup_start;
+					}
+				}
+				else if (node->hj_HashTable->curbatch > 0)
+				{
+					if (node->current_outer_offset_match_status == NULL)
+						node->current_outer_offset_match_status = node->first_outer_offset_match_status;
+					else
+						node->current_outer_offset_match_status = node->current_outer_offset_match_status->next;
+				}
+
 				/* OK, let's scan the bucket for matches */
 				node->hj_JoinState = HJ_SCAN_BUCKET;
 
@@ -455,6 +506,8 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				{
 					node->hj_MatchedOuter = true;
 					HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
+					if (node->current_outer_offset_match_status)
+						node->current_outer_offset_match_status->match_status = true;
 
 					/* In an antijoin, we never return a matched tuple */
 					if (node->js.jointype == JOIN_ANTI)
@@ -492,6 +545,8 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				if (!node->hj_MatchedOuter &&
 					HJ_FILL_OUTER(node))
 				{
+					if (node->current_outer_offset_match_status)
+						break;
 					/*
 					 * Generate a fake join tuple with nulls for the inner
 					 * tuple, and return it if it passes the non-join quals.
@@ -543,12 +598,56 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				}
 				else
 				{
-					if (!ExecHashJoinNewBatch(node))
-						return NULL;	/* end of parallel-oblivious join */
+					if (node->inner_page_offset == 0L)
+					{
+						/*
+						 * This case is entered on two separate conditions:
+						 * when we need to load the first batch ever in this hash join;
+						 * or when we've exhausted the outer side of the current batch.
+						 */
+						if (node->first_outer_offset_match_status && HJ_FILL_OUTER(node))
+						{
+							node->hj_JoinState = HJ_ADAPTIVE_EMIT_UNMATCHED;
+							cursor = node->first_outer_offset_match_status;
+							break;
+						}
+
+						if (!ExecHashJoinAdvanceBatch(node))
+							return NULL;    /* end of parallel-oblivious join */
+					}
+					LoadInnerBatch(node);
+
+					if (node->first_chunk)
+						node->first_outer_offset_match_status = NULL;
+					node->current_outer_offset_match_status = NULL;
 				}
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 				break;
 
+			case HJ_ADAPTIVE_EMIT_UNMATCHED:
+				while (cursor)
+				{
+					TupleTableSlot *outer_unmatched_tup;
+					if (cursor->match_status == true)
+					{
+						cursor = cursor->next;
+						continue;
+					}
+					/*
+					 * if it is not a match, go to the offset in the page that it specifies
+					 * and emit it NULL-extended
+					 */
+					outer_unmatched_tup = ExecHashJoinGetOuterTupleAtOffset(node, cursor->outer_tuple_start_offset);
+					econtext->ecxt_outertuple = outer_unmatched_tup;
+					econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
+					cursor = cursor->next;
+					return ExecProject(node->js.ps.ps_ProjInfo);
+				}
+
+				node->hj_JoinState = HJ_NEED_NEW_BATCH;
+				node->first_outer_offset_match_status = NULL;
+				break;
+
 			default:
 				elog(ERROR, "unrecognized hashjoin state: %d",
 					 (int) node->hj_JoinState);
@@ -628,6 +727,11 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate->js.ps.ExecProcNode = ExecHashJoin;
 	hjstate->js.jointype = node->join.jointype;
 
+	hjstate->inner_page_offset = 0L;
+	hjstate->HJ_NEED_NEW_OUTER_tup_start = 0L;
+	hjstate->HJ_NEED_NEW_OUTER_tup_end = 0L;
+	hjstate->current_outer_offset_match_status = NULL;
+	hjstate->first_outer_offset_match_status = NULL;
 	/*
 	 * Miscellaneous initialization
 	 *
@@ -805,6 +909,28 @@ ExecEndHashJoin(HashJoinState *node)
 	ExecEndNode(innerPlanState(node));
 }
 
+static TupleTableSlot *
+ExecHashJoinGetOuterTupleAtOffset(HashJoinState *hjstate, off_t offset)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int curbatch = hashtable->curbatch;
+	TupleTableSlot *slot;
+	uint32 hashvalue;
+
+	BufFile    *file = hashtable->outerBatchFile[curbatch];
+	/* ? should fileno always be 0? */
+	if (BufFileSeek(file, 0, offset, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+						errmsg("could not rewind hash-join temporary file: %m")));
+
+	slot = ExecHashJoinGetSavedTuple(hjstate,
+									 file,
+									 &hashvalue,
+									 hjstate->hj_OuterTupleSlot);
+	return slot;
+}
+
 /*
  * ExecHashJoinOuterGetTuple
  *
@@ -951,20 +1077,17 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 }
 
 /*
- * ExecHashJoinNewBatch
+ * ExecHashJoinAdvanceBatch
  *		switch to a new hashjoin batch
  *
  * Returns true if successful, false if there are no more batches.
  */
 static bool
-ExecHashJoinNewBatch(HashJoinState *hjstate)
+ExecHashJoinAdvanceBatch(HashJoinState *hjstate)
 {
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	int			nbatch;
 	int			curbatch;
-	BufFile    *innerFile;
-	TupleTableSlot *slot;
-	uint32		hashvalue;
 
 	nbatch = hashtable->nbatch;
 	curbatch = hashtable->curbatch;
@@ -1039,10 +1162,31 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 		curbatch++;
 	}
 
+	hjstate->inner_page_offset = 0L;
+	hjstate->first_chunk = true;
 	if (curbatch >= nbatch)
 		return false;			/* no more batches */
 
 	hashtable->curbatch = curbatch;
+	return true;
+}
+
+/*
+ * Returns true if there are more chunks left, false otherwise
+ */
+static bool LoadInnerBatch(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int curbatch = hashtable->curbatch;
+	BufFile    *innerFile;
+	TupleTableSlot *slot;
+	uint32		hashvalue;
+
+	off_t tup_start_offset;
+	off_t chunk_start_offset;
+	off_t tup_end_offset;
+	int64 current_saved_size;
+	int current_fileno;
 
 	/*
 	 * Reload the hash table with the new inner batch (which could be empty)
@@ -1051,27 +1195,60 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 
 	innerFile = hashtable->innerBatchFile[curbatch];
 
+	/*
+	 * Reset this even if the innerfile is not null
+	 */
+	hjstate->first_chunk = hjstate->inner_page_offset == 0L;
+
 	if (innerFile != NULL)
 	{
-		if (BufFileSeek(innerFile, 0, 0L, SEEK_SET))
+		/* should fileno always be 0? */
+		if (BufFileSeek(innerFile, 0, hjstate->inner_page_offset, SEEK_SET))
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not rewind hash-join temporary file: %m")));
 
+		chunk_start_offset = hjstate->inner_page_offset;
+		tup_end_offset = hjstate->inner_page_offset;
 		while ((slot = ExecHashJoinGetSavedTuple(hjstate,
 												 innerFile,
 												 &hashvalue,
 												 hjstate->hj_HashTupleSlot)))
 		{
+			/* next tuple's start is last tuple's end */
+			tup_start_offset = tup_end_offset;
+			/* after we got the tuple, figure out what the offset is */
+			BufFileTell(innerFile, &current_fileno, &tup_end_offset);
+			current_saved_size = tup_end_offset - chunk_start_offset;
+			if (current_saved_size > work_mem)
+			{
+				hjstate->inner_page_offset = tup_start_offset;
+				/*
+				 * Rewind outer batch file (if present), so that we can start reading it.
+				 */
+				if (hashtable->outerBatchFile[curbatch] != NULL)
+				{
+					if (BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET))
+						ereport(ERROR,
+								(errcode_for_file_access(),
+										errmsg("could not rewind hash-join temporary file: %m")));
+				}
+				return true;
+			}
+			hjstate->inner_page_offset = tup_end_offset;
 			/*
-			 * NOTE: some tuples may be sent to future batches.  Also, it is
-			 * possible for hashtable->nbatch to be increased here!
+			 * NOTE: some tuples may be sent to future batches.
+			 * With current hashloop patch, however, it is not possible
+			 * for hashtable->nbatch to be increased here
 			 */
 			ExecHashTableInsert(hashtable, slot, hashvalue);
 		}
 
+		// this is the end of the file
+		hjstate->inner_page_offset = 0L;
+
 		/*
-		 * after we build the hash table, the inner batch file is no longer
+		 * after we processed all chunks, the inner batch file is no longer
 		 * needed
 		 */
 		BufFileClose(innerFile);
@@ -1088,8 +1265,7 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 					(errcode_for_file_access(),
 					 errmsg("could not rewind hash-join temporary file: %m")));
 	}
-
-	return true;
+	return false;
 }
 
 /*
@@ -1270,6 +1446,8 @@ ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 	uint32		header[2];
 	size_t		nread;
 	MinimalTuple tuple;
+	int dummy_fileno;
+
 
 	/*
 	 * We check for interrupts here because this is typically taken as an
@@ -1278,6 +1456,7 @@ ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 	 */
 	CHECK_FOR_INTERRUPTS();
 
+	BufFileTell(file, &dummy_fileno, &hjstate->HJ_NEED_NEW_OUTER_tup_start);
 	/*
 	 * Since both the hash value and the MinimalTuple length word are uint32,
 	 * we can read them both in one BufFileRead() call without any type
@@ -1304,6 +1483,7 @@ ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 				(errcode_for_file_access(),
 				 errmsg("could not read from hash-join temporary file: %m")));
 	ExecForceStoreMinimalTuple(tuple, tupleSlot, true);
+	BufFileTell(file, &dummy_fileno, &hjstate->HJ_NEED_NEW_OUTER_tup_end);
 	return tupleSlot;
 }
 
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 2c94b926d3..bd5aeba74c 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -59,6 +59,16 @@
  * if so, we just dump them out to the correct batch file.
  * ----------------------------------------------------------------
  */
+struct OuterOffsetMatchStatus;
+typedef struct OuterOffsetMatchStatus OuterOffsetMatchStatus;
+
+struct OuterOffsetMatchStatus
+{
+	bool match_status;
+	off_t outer_tuple_start_offset;
+	int32 outer_tuple_val;
+	struct OuterOffsetMatchStatus *next;
+};
 
 /* these are in nodes/execnodes.h: */
 /* typedef struct HashJoinTupleData *HashJoinTuple; */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 99b9fa414f..874fc47ffe 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -42,6 +42,8 @@ struct RangeTblEntry;			/* avoid including parsenodes.h here */
 struct ExprEvalStep;			/* avoid including execExpr.h everywhere */
 struct CopyMultiInsertBuffer;
 
+struct OuterOffsetMatchStatus;
+
 
 /* ----------------
  *		ExprState node
@@ -1899,6 +1901,14 @@ typedef struct HashJoinState
 	int			hj_JoinState;
 	bool		hj_MatchedOuter;
 	bool		hj_OuterNotEmpty;
+
+	off_t inner_page_offset;
+	bool first_chunk;
+	struct OuterOffsetMatchStatus *first_outer_offset_match_status;
+	struct OuterOffsetMatchStatus *current_outer_offset_match_status;
+
+	off_t HJ_NEED_NEW_OUTER_tup_start;
+	off_t HJ_NEED_NEW_OUTER_tup_end;
 } HashJoinState;
 
 
diff --git a/src/test/regress/expected/adaptive_hj.out b/src/test/regress/expected/adaptive_hj.out
new file mode 100644
index 0000000000..4bdc681994
--- /dev/null
+++ b/src/test/regress/expected/adaptive_hj.out
@@ -0,0 +1,209 @@
+drop table if exists t1;
+NOTICE:  table "t1" does not exist, skipping
+drop table if exists t2;
+NOTICE:  table "t2" does not exist, skipping
+create table t1(a int);
+create table t2(b int);
+insert into t1 values(1),(2);
+insert into t2 values(2),(3);
+insert into t1 select i from generate_series(1,10)i;
+insert into t2 select i from generate_series(2,10)i;
+insert into t1 select 2 from generate_series(1,5)i;
+insert into t2 select 2 from generate_series(2,7)i;
+set work_mem=64;
+set enable_mergejoin to off;
+select * from t1 left outer join t2 on a = b order by b;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+  1 |   
+  1 |   
+(67 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+    67
+(1 row)
+
+select * from t1, t2 where a = b;
+ a  | b  
+----+----
+  5 |  5
+  3 |  3
+  3 |  3
+  4 |  4
+  7 |  7
+  6 |  6
+  9 |  9
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  8 |  8
+ 10 | 10
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+(65 rows)
+
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+    65
+(1 row)
+
+truncate table t1;
+insert into t1 values (1),(2),(2),(3);
+truncate table t2;
+insert into t2 values(2),(2),(3),(3),(4);
+set work_mem=64;
+set enable_mergejoin to off;
+select * from t1 left outer join t2 on a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 1 |  
+(7 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+     7
+(1 row)
+
+select * from t1, t2 where a = b;
+ a | b 
+---+---
+ 3 | 3
+ 3 | 3
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+(6 rows)
+
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+     6
+(1 row)
+
diff --git a/src/test/regress/sql/adaptive_hj.sql b/src/test/regress/sql/adaptive_hj.sql
new file mode 100644
index 0000000000..6b7c4d9eff
--- /dev/null
+++ b/src/test/regress/sql/adaptive_hj.sql
@@ -0,0 +1,31 @@
+drop table if exists t1;
+drop table if exists t2;
+create table t1(a int);
+create table t2(b int);
+
+insert into t1 values(1),(2);
+insert into t2 values(2),(3);
+insert into t1 select i from generate_series(1,10)i;
+insert into t2 select i from generate_series(2,10)i;
+insert into t1 select 2 from generate_series(1,5)i;
+insert into t2 select 2 from generate_series(2,7)i;
+set work_mem=64;
+set enable_mergejoin to off;
+
+select * from t1 left outer join t2 on a = b order by b;
+select count(*) from t1 left outer join t2 on a = b;
+select * from t1, t2 where a = b;
+select count(*) from t1, t2 where a = b;
+
+truncate table t1;
+insert into t1 values (1),(2),(2),(3);
+truncate table t2;
+insert into t2 values(2),(2),(3),(3),(4);
+
+set work_mem=64;
+set enable_mergejoin to off;
+
+select * from t1 left outer join t2 on a = b order by b;
+select count(*) from t1 left outer join t2 on a = b;
+select * from t1, t2 where a = b;
+select count(*) from t1, t2 where a = b;
-- 
2.21.0

#33

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Melanie Plageman (#32)

1 attachment(s)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Tue, Jun 18, 2019 at 3:24 PM Melanie Plageman <melanieplageman@gmail.com>
wrote:

These questions will probably make a lot more sense with corresponding
code, so I will follow up with the second version of the state machine
patch once I finish it.

I have changed the state machine and resolved the questions I had
raised in the previous email. This seems to work for the parallel and
non-parallel cases. I have not yet rewritten the unmatched outer tuple
status as a bitmap in a spill file (for ease of debugging).

Before doing that, I wanted to ask what a desirable fallback condition
would be. In this patch, fallback to hashloop join happens only when
inserting tuples into the hashtable after batch 0 when inserting
another tuple from the batch file would exceed work_mem. This means
you can't increase nbatches, which, I would think is undesirable.

I thought a bit about when fallback should happen. So, let's say that
we would like to fallback to hashloop join when we have increased
nbatches X times. At that point, since we do not want to fall back to
hashloop join for all batches, we have to make a decision. After
increasing nbatches the Xth time, do we then fall back for all batches
for which inserting inner tuples exceeds work_mem? Do we use this
strategy but work_mem + some fudge factor?

Or, do we instead try to determine if data skew led us to increase
nbatches both times and then determine which batch, given new
nbatches, contains that data, set fallback to true only for that
batch, and let all other batches use the existing logic (with no
fallback option) unless they contain a value which leads to increasing
nbatches X number of times?

--
Melanie Plageman

Attachments:

v2-0001-hashloop-fallback.patchtext/x-patch; charset=US-ASCII; name=v2-0001-hashloop-fallback.patchDownload

From 2d6fec7d2bac90a41d4ec88ad5ac2011562a14a1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 10 Jun 2019 10:54:42 -0700
Subject: [PATCH v2] hashloop fallback

First part is to "chunk" the inner file into arbitrary partitions of
work_mem size

This chunks inner file and makes it so that the offset is along tuple
bounds.

Note that this makes it impossible to increase nbatches during the
loading of batches after initial hashtable creation

In preparation for doing this chunking, separate advance batch and load
batch. advance batch only if page offset is reset to 0, then load that
part of the batch

Second part was to: implement outer tuple batch rewinding per chunk of
inner batch

Would be a simple rewind and replay of outer side for each chunk of
inner if it weren't for LOJ.
Because we need to wait to emit NULL-extended tuples for LOJ until after
all chunks of inner have been processed.
To do this, make a list with an entry for each outer tuple and keep
track of its match status. Also, keep track of its offset so that we can
access the file at that offset in case the tuples are not processed in
order (like in parallel case)

For non-hashloop fallback scenario, this list should not be constructed
and unmatched outer tuples should be emitted as they are encountered.
---
 src/backend/executor/nodeHashjoin.c       | 379 ++++++++++++++++----
 src/include/executor/hashjoin.h           |  10 +
 src/include/nodes/execnodes.h             |  12 +
 src/test/regress/expected/adaptive_hj.out | 402 ++++++++++++++++++++++
 src/test/regress/parallel_schedule        |   2 +-
 src/test/regress/serial_schedule          |   1 +
 src/test/regress/sql/adaptive_hj.sql      |  39 +++
 7 files changed, 770 insertions(+), 75 deletions(-)
 create mode 100644 src/test/regress/expected/adaptive_hj.out
 create mode 100644 src/test/regress/sql/adaptive_hj.sql

diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 8484a287e7..e46b453a9b 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -124,9 +124,11 @@
 #define HJ_BUILD_HASHTABLE		1
 #define HJ_NEED_NEW_OUTER		2
 #define HJ_SCAN_BUCKET			3
-#define HJ_FILL_OUTER_TUPLE		4
-#define HJ_FILL_INNER_TUPLES	5
-#define HJ_NEED_NEW_BATCH		6
+#define HJ_FILL_INNER_TUPLES    4
+#define HJ_NEED_NEW_BATCH		5
+#define HJ_NEED_NEW_INNER_CHUNK 6
+#define HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT 7
+#define HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER 8
 
 /* Returns true if doing null-fill on outer relation */
 #define HJ_FILL_OUTER(hjstate)	((hjstate)->hj_NullInnerTupleSlot != NULL)
@@ -143,10 +145,15 @@ static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 												 BufFile *file,
 												 uint32 *hashvalue,
 												 TupleTableSlot *tupleSlot);
-static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
+static bool ExecHashJoinAdvanceBatch(HashJoinState *hjstate);
 static bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
 static void ExecParallelHashJoinPartitionOuter(HashJoinState *node);
+static bool LoadInner(HashJoinState *hjstate);
+static TupleTableSlot *ExecHashJoinGetOuterTupleAtOffset(HashJoinState *hjstate, off_t offset);
+static void rewindOuter(BufFile *bufFile);
 
+static TupleTableSlot *
+emitUnmatchedOuterTuple(ExprState *otherqual, ExprContext *econtext, HashJoinState *hjstate);
 
 /* ----------------------------------------------------------------
  *		ExecHashJoinImpl
@@ -198,6 +205,8 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 	 */
 	for (;;)
 	{
+		bool done = false;
+
 		/*
 		 * It's possible to iterate this loop many times before returning a
 		 * tuple, in some pathological cases such as needing to move much of
@@ -209,7 +218,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 		switch (node->hj_JoinState)
 		{
 			case HJ_BUILD_HASHTABLE:
-
+				elog(DEBUG1, "HJ_BUILD_HASHTABLE");
 				/*
 				 * First time through: build hash table for inner relation.
 				 */
@@ -343,7 +352,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				/* FALL THRU */
 
 			case HJ_NEED_NEW_OUTER:
-
+				elog(DEBUG1, "HJ_NEED_NEW_OUTER");
 				/*
 				 * We don't have an outer tuple, try to get the next one
 				 */
@@ -357,20 +366,29 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 
 				if (TupIsNull(outerTupleSlot))
 				{
-					/* end of batch, or maybe whole join */
+					/*
+					 * end of batch, or maybe whole join
+					 * for hashloop fallback, all we know is outer batch is exhausted
+					 * inner could have more chunks
+					 */
 					if (HJ_FILL_INNER(node))
 					{
 						/* set up to scan for unmatched inner tuples */
 						ExecPrepHashTableForUnmatched(node);
 						node->hj_JoinState = HJ_FILL_INNER_TUPLES;
+						break;
 					}
-					else
-						node->hj_JoinState = HJ_NEED_NEW_BATCH;
-					continue;
+					node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
+					break;
 				}
-
+				/*
+				 * only initialize this to false during the first chunk --
+				 * otherwise, we will be resetting hj_MatchedOuter
+				 * to false for an outer tuple that has already matched an inner tuple
+				 */
+				if (node->first_chunk || hashtable->curbatch == 0)
+					node->hj_MatchedOuter = false;
 				econtext->ecxt_outertuple = outerTupleSlot;
-				node->hj_MatchedOuter = false;
 
 				/*
 				 * Find the corresponding bucket for this tuple in the main
@@ -410,6 +428,48 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					continue;
 				}
 
+				/*
+				 * We need to construct the linked list of match statuses on the first chunk.
+				 * Note that node->first_chunk isn't true until HJ_NEED_NEW_BATCH
+				 * so this means that we don't construct this list on batch 0.
+				 * This list should also only be constructed for hashloop fallback
+				 */
+				if (node->first_chunk && hashtable->outerBatchFile && node->hashloop_fallback == true)
+				{
+					BufFile *outerFile = hashtable->outerBatchFile[batchno];
+
+					if (outerFile != NULL)
+					{
+						OuterOffsetMatchStatus *outerOffsetMatchStatus = NULL;
+
+						outerOffsetMatchStatus = palloc(sizeof(struct OuterOffsetMatchStatus));
+						outerOffsetMatchStatus->match_status = false;
+						outerOffsetMatchStatus->outer_tuple_start_offset = 0L;
+						outerOffsetMatchStatus->next = NULL;
+
+						if (node->first_outer_offset_match_status != NULL)
+						{
+							node->current_outer_offset_match_status->next = outerOffsetMatchStatus;
+							node->current_outer_offset_match_status = outerOffsetMatchStatus;
+						}
+						else /* node->first_outer_offset_match_status == NULL */
+						{
+							node->first_outer_offset_match_status = outerOffsetMatchStatus;
+							node->current_outer_offset_match_status = node->first_outer_offset_match_status;
+						}
+
+						outerOffsetMatchStatus->outer_tuple_val = DatumGetInt32(outerTupleSlot->tts_values[0]);
+						outerOffsetMatchStatus->outer_tuple_start_offset = node->HJ_NEED_NEW_OUTER_tup_start;
+					}
+				}
+				else if (node->hj_HashTable->curbatch > 0)
+				{
+					if (node->current_outer_offset_match_status == NULL)
+						node->current_outer_offset_match_status = node->first_outer_offset_match_status;
+					else
+						node->current_outer_offset_match_status = node->current_outer_offset_match_status->next;
+				}
+
 				/* OK, let's scan the bucket for matches */
 				node->hj_JoinState = HJ_SCAN_BUCKET;
 
@@ -417,28 +477,32 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 
 			case HJ_SCAN_BUCKET:
 
+				elog(DEBUG1, "HJ_SCAN_BUCKET");
 				/*
 				 * Scan the selected hash bucket for matches to current outer
 				 */
 				if (parallel)
-				{
-					if (!ExecParallelScanHashBucket(node, econtext))
-					{
-						/* out of matches; check for possible outer-join fill */
-						node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
-						continue;
-					}
-				}
+					done = !ExecParallelScanHashBucket(node, econtext);
 				else
+					done = !ExecScanHashBucket(node, econtext);
+
+				if (done)
 				{
-					if (!ExecScanHashBucket(node, econtext))
+					/*
+					 * The current outer tuple has run out of matches, so check
+					 * whether to emit a dummy outer-join tuple.  Whether we emit
+					 * one or not, the next state is NEED_NEW_OUTER.
+					 */
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+
+					if (node->hj_HashTable->curbatch == 0 || node->hashloop_fallback == false)
 					{
-						/* out of matches; check for possible outer-join fill */
-						node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
-						continue;
+						TupleTableSlot *slot = emitUnmatchedOuterTuple(otherqual, econtext, node);
+						if (slot != NULL)
+							return slot;
 					}
+					continue;
 				}
-
 				/*
 				 * We've got a match, but still need to test non-hashed quals.
 				 * ExecScanHashBucket already set up all the state needed to
@@ -455,6 +519,8 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				{
 					node->hj_MatchedOuter = true;
 					HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
+					if (node->current_outer_offset_match_status)
+						node->current_outer_offset_match_status->match_status = true;
 
 					/* In an antijoin, we never return a matched tuple */
 					if (node->js.jointype == JOIN_ANTI)
@@ -480,33 +546,9 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					InstrCountFiltered1(node, 1);
 				break;
 
-			case HJ_FILL_OUTER_TUPLE:
-
-				/*
-				 * The current outer tuple has run out of matches, so check
-				 * whether to emit a dummy outer-join tuple.  Whether we emit
-				 * one or not, the next state is NEED_NEW_OUTER.
-				 */
-				node->hj_JoinState = HJ_NEED_NEW_OUTER;
-
-				if (!node->hj_MatchedOuter &&
-					HJ_FILL_OUTER(node))
-				{
-					/*
-					 * Generate a fake join tuple with nulls for the inner
-					 * tuple, and return it if it passes the non-join quals.
-					 */
-					econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
-
-					if (otherqual == NULL || ExecQual(otherqual, econtext))
-						return ExecProject(node->js.ps.ps_ProjInfo);
-					else
-						InstrCountFiltered2(node, 1);
-				}
-				break;
-
 			case HJ_FILL_INNER_TUPLES:
 
+				elog(DEBUG1, "HJ_FILL_INNER_TUPLES");
 				/*
 				 * We have finished a batch, but we are doing right/full join,
 				 * so any unmatched inner tuples in the hashtable have to be
@@ -515,7 +557,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				if (!ExecScanHashTableForUnmatched(node, econtext))
 				{
 					/* no more unmatched tuples */
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
 					continue;
 				}
 
@@ -533,6 +575,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 
 			case HJ_NEED_NEW_BATCH:
 
+				elog(DEBUG1, "HJ_NEED_NEW_BATCH");
 				/*
 				 * Try to advance to next batch.  Done if there are no more.
 				 */
@@ -543,10 +586,100 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				}
 				else
 				{
-					if (!ExecHashJoinNewBatch(node))
-						return NULL;	/* end of parallel-oblivious join */
+					if (node->first_outer_offset_match_status && HJ_FILL_OUTER(node) && node->hashloop_fallback == true)
+					{
+						/*
+						 * For hashloop fallback, outer tuples are not emitted
+						 * until directly before advancing the batch (after all inner
+						 * chunks have been processed). node->hashloop_fallback should be
+						 * true because it is not reset to false until advancing the batches
+						 */
+						node->hj_JoinState = HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT;
+						break;
+					}
+
+					if (!ExecHashJoinAdvanceBatch(node))
+						return NULL;    /* end of parallel-oblivious join */
+
+					rewindOuter(node->hj_HashTable->outerBatchFile[node->hj_HashTable->curbatch]);
+					LoadInner(node);
+
+					/*
+					 * If we just loaded the first chunk of a new inner batch,
+					 * we should reset head of the list of outer tuple match statuses
+					 * so we can construct a new list for the new corresponding outer batch file
+					 * Doing it here works because we have not created any of the structs
+					 * of match statuses for the outer tuples until HJ_NEED_NEW_OUTER
+					 *
+					 * Even if we are not at the beginning of a new inner batch, we need
+					 * to reset the pointer to the current match status object for the current
+					 * outer tuple before transitioning to HJ_NEED_NEW_OUTER as a way of
+					 * rewinding the list.
+					 * we use the status of current -- NULL or non-NULL to determine in
+					 * HJ_NEED_NEW_OUTER if we should advance to the next item in the list or
+					 * "rewind" by setting head to current.
+					 */
+					if (node->first_chunk)
+						node->first_outer_offset_match_status = NULL;
+					node->current_outer_offset_match_status = NULL;
+				}
+				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				break;
+
+			case HJ_NEED_NEW_INNER_CHUNK:
+
+				elog(DEBUG1, "HJ_NEED_NEW_INNER_CHUNK");
+
+				// ?? Will inner_page_offset != 0 ever when curbatch == 0 ?
+				if (node->inner_page_offset == 0L || node->hj_HashTable->curbatch == 0) // inner batch is exhausted
+				{
+					/*
+					 * either it is the fallback case and there are no more chunks
+					 * or, there were never chunks because this is the non-fallback case
+					 * or this is batch 0
+					 * in any of these cases, load next batch
+					 */
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					break;
 				}
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				/*
+				 * Rewind outer batch file (if present), so that we can start reading it.
+				 */
+				rewindOuter(node->hj_HashTable->outerBatchFile[node->hj_HashTable->curbatch]);
+				LoadInner(node);
+				node->current_outer_offset_match_status = NULL;
+				break;
+
+			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT:
+
+				node->cursor = node->first_outer_offset_match_status;
+				node->first_outer_offset_match_status = NULL;
+				node->hj_JoinState = HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER;
+				/* fall through */
+
+			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER:
+				while (node->cursor)
+				{
+					if (node->cursor->match_status == true)
+					{
+						node->cursor = node->cursor->next;
+						continue;
+					}
+					/*
+					 * if it is not a match, go to the offset in the page that it specifies
+					 * and emit it NULL-extended
+					 */
+					econtext->ecxt_outertuple = ExecHashJoinGetOuterTupleAtOffset(node, node->cursor->outer_tuple_start_offset);
+					econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
+					node->cursor = node->cursor->next;
+					return ExecProject(node->js.ps.ps_ProjInfo);
+				}
+				node->cursor = NULL;
+				/*
+				 * came here from HJ_NEED_NEW_BATCH, so go back there
+				 */
+				node->hj_JoinState = HJ_NEED_NEW_BATCH;
 				break;
 
 			default:
@@ -628,6 +761,13 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate->js.ps.ExecProcNode = ExecHashJoin;
 	hjstate->js.jointype = node->join.jointype;
 
+	hjstate->hashloop_fallback = false;
+	hjstate->inner_page_offset = 0L;
+	hjstate->first_chunk = false;
+	hjstate->HJ_NEED_NEW_OUTER_tup_start = 0L;
+	hjstate->HJ_NEED_NEW_OUTER_tup_end = 0L;
+	hjstate->current_outer_offset_match_status = NULL;
+	hjstate->first_outer_offset_match_status = NULL;
 	/*
 	 * Miscellaneous initialization
 	 *
@@ -765,6 +905,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate->hj_MatchedOuter = false;
 	hjstate->hj_OuterNotEmpty = false;
 
+	hjstate->cursor = NULL;
+
 	return hjstate;
 }
 
@@ -805,6 +947,59 @@ ExecEndHashJoin(HashJoinState *node)
 	ExecEndNode(innerPlanState(node));
 }
 
+static TupleTableSlot *
+ExecHashJoinGetOuterTupleAtOffset(HashJoinState *hjstate, off_t offset)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int curbatch = hashtable->curbatch;
+	TupleTableSlot *slot;
+	uint32 hashvalue;
+
+	BufFile    *file = hashtable->outerBatchFile[curbatch];
+	/* ? should fileno always be 0? */
+	if (BufFileSeek(file, 0, offset, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+						errmsg("could not rewind hash-join temporary file: %m")));
+
+	slot = ExecHashJoinGetSavedTuple(hjstate,
+									 file,
+									 &hashvalue,
+									 hjstate->hj_OuterTupleSlot);
+	return slot;
+}
+
+static void rewindOuter(BufFile *bufFile)
+{
+	if (bufFile != NULL)
+	{
+		if (BufFileSeek(bufFile, 0, 0L, SEEK_SET))
+			ereport(ERROR,
+				(errcode_for_file_access(),
+					errmsg("could not rewind hash-join temporary file: %m")));
+	}
+}
+
+static TupleTableSlot *
+emitUnmatchedOuterTuple(ExprState *otherqual, ExprContext *econtext, HashJoinState *hjstate)
+{
+	if (hjstate->hj_MatchedOuter)
+		return NULL;
+
+	if (!HJ_FILL_OUTER(hjstate))
+		return NULL;
+
+	econtext->ecxt_innertuple = hjstate->hj_NullInnerTupleSlot;
+	/*
+	 * Generate a fake join tuple with nulls for the inner
+	 * tuple, and return it if it passes the non-join quals.
+	 */
+	if (otherqual == NULL || ExecQual(otherqual, econtext))
+		return ExecProject(hjstate->js.ps.ps_ProjInfo);
+
+	InstrCountFiltered2(hjstate, 1);
+	return NULL;
+}
 /*
  * ExecHashJoinOuterGetTuple
  *
@@ -951,20 +1146,17 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 }
 
 /*
- * ExecHashJoinNewBatch
+ * ExecHashJoinAdvanceBatch
  *		switch to a new hashjoin batch
  *
  * Returns true if successful, false if there are no more batches.
  */
 static bool
-ExecHashJoinNewBatch(HashJoinState *hjstate)
+ExecHashJoinAdvanceBatch(HashJoinState *hjstate)
 {
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	int			nbatch;
 	int			curbatch;
-	BufFile    *innerFile;
-	TupleTableSlot *slot;
-	uint32		hashvalue;
 
 	nbatch = hashtable->nbatch;
 	curbatch = hashtable->curbatch;
@@ -1039,10 +1231,32 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 		curbatch++;
 	}
 
+	hjstate->inner_page_offset = 0L;
+	hjstate->first_chunk = true;
+	hjstate->hashloop_fallback = false; /* new batch, so start it off false */
 	if (curbatch >= nbatch)
 		return false;			/* no more batches */
 
 	hashtable->curbatch = curbatch;
+	return true;
+}
+
+/*
+ * Returns true if there are more chunks left, false otherwise
+ */
+static bool LoadInner(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int curbatch = hashtable->curbatch;
+	BufFile    *innerFile;
+	TupleTableSlot *slot;
+	uint32		hashvalue;
+
+	off_t tup_start_offset;
+	off_t chunk_start_offset;
+	off_t tup_end_offset;
+	int64 current_saved_size;
+	int current_fileno;
 
 	/*
 	 * Reload the hash table with the new inner batch (which could be empty)
@@ -1051,45 +1265,58 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 
 	innerFile = hashtable->innerBatchFile[curbatch];
 
+	/*
+	 * Reset this even if the innerfile is not null
+	 */
+	hjstate->first_chunk = hjstate->inner_page_offset == 0L;
+
 	if (innerFile != NULL)
 	{
-		if (BufFileSeek(innerFile, 0, 0L, SEEK_SET))
+		/* should fileno always be 0? */
+		if (BufFileSeek(innerFile, 0, hjstate->inner_page_offset, SEEK_SET))
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not rewind hash-join temporary file: %m")));
 
+		chunk_start_offset = hjstate->inner_page_offset;
+		tup_end_offset = hjstate->inner_page_offset;
 		while ((slot = ExecHashJoinGetSavedTuple(hjstate,
 												 innerFile,
 												 &hashvalue,
 												 hjstate->hj_HashTupleSlot)))
 		{
+			/* next tuple's start is last tuple's end */
+			tup_start_offset = tup_end_offset;
+			/* after we got the tuple, figure out what the offset is */
+			BufFileTell(innerFile, &current_fileno, &tup_end_offset);
+			current_saved_size = tup_end_offset - chunk_start_offset;
+			if (current_saved_size > work_mem)
+			{
+				hjstate->inner_page_offset = tup_start_offset;
+				hjstate->hashloop_fallback = true;
+				return true;
+			}
+			hjstate->inner_page_offset = tup_end_offset;
 			/*
-			 * NOTE: some tuples may be sent to future batches.  Also, it is
-			 * possible for hashtable->nbatch to be increased here!
+			 * NOTE: some tuples may be sent to future batches.
+			 * With current hashloop patch, however, it is not possible
+			 * for hashtable->nbatch to be increased here
 			 */
 			ExecHashTableInsert(hashtable, slot, hashvalue);
 		}
 
+		/* this is the end of the file */
+		hjstate->inner_page_offset = 0L;
+
 		/*
-		 * after we build the hash table, the inner batch file is no longer
+		 * after we processed all chunks, the inner batch file is no longer
 		 * needed
 		 */
 		BufFileClose(innerFile);
 		hashtable->innerBatchFile[curbatch] = NULL;
 	}
 
-	/*
-	 * Rewind outer batch file (if present), so that we can start reading it.
-	 */
-	if (hashtable->outerBatchFile[curbatch] != NULL)
-	{
-		if (BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not rewind hash-join temporary file: %m")));
-	}
-
-	return true;
+	return false;
 }
 
 /*
@@ -1270,6 +1497,8 @@ ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 	uint32		header[2];
 	size_t		nread;
 	MinimalTuple tuple;
+	int dummy_fileno;
+
 
 	/*
 	 * We check for interrupts here because this is typically taken as an
@@ -1278,6 +1507,7 @@ ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 	 */
 	CHECK_FOR_INTERRUPTS();
 
+	BufFileTell(file, &dummy_fileno, &hjstate->HJ_NEED_NEW_OUTER_tup_start);
 	/*
 	 * Since both the hash value and the MinimalTuple length word are uint32,
 	 * we can read them both in one BufFileRead() call without any type
@@ -1304,6 +1534,7 @@ ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 				(errcode_for_file_access(),
 				 errmsg("could not read from hash-join temporary file: %m")));
 	ExecForceStoreMinimalTuple(tuple, tupleSlot, true);
+	BufFileTell(file, &dummy_fileno, &hjstate->HJ_NEED_NEW_OUTER_tup_end);
 	return tupleSlot;
 }
 
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 2c94b926d3..bd5aeba74c 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -59,6 +59,16 @@
  * if so, we just dump them out to the correct batch file.
  * ----------------------------------------------------------------
  */
+struct OuterOffsetMatchStatus;
+typedef struct OuterOffsetMatchStatus OuterOffsetMatchStatus;
+
+struct OuterOffsetMatchStatus
+{
+	bool match_status;
+	off_t outer_tuple_start_offset;
+	int32 outer_tuple_val;
+	struct OuterOffsetMatchStatus *next;
+};
 
 /* these are in nodes/execnodes.h: */
 /* typedef struct HashJoinTupleData *HashJoinTuple; */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 99b9fa414f..5edf48ae67 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -42,6 +42,8 @@ struct RangeTblEntry;			/* avoid including parsenodes.h here */
 struct ExprEvalStep;			/* avoid including execExpr.h everywhere */
 struct CopyMultiInsertBuffer;
 
+struct OuterOffsetMatchStatus;
+
 
 /* ----------------
  *		ExprState node
@@ -1899,6 +1901,16 @@ typedef struct HashJoinState
 	int			hj_JoinState;
 	bool		hj_MatchedOuter;
 	bool		hj_OuterNotEmpty;
+
+	bool hashloop_fallback;
+	off_t inner_page_offset;
+	bool first_chunk;
+	struct OuterOffsetMatchStatus *first_outer_offset_match_status;
+	struct OuterOffsetMatchStatus *current_outer_offset_match_status;
+	struct OuterOffsetMatchStatus *cursor;
+
+	off_t HJ_NEED_NEW_OUTER_tup_start;
+	off_t HJ_NEED_NEW_OUTER_tup_end;
 } HashJoinState;
 
 
diff --git a/src/test/regress/expected/adaptive_hj.out b/src/test/regress/expected/adaptive_hj.out
new file mode 100644
index 0000000000..a687ecf759
--- /dev/null
+++ b/src/test/regress/expected/adaptive_hj.out
@@ -0,0 +1,402 @@
+drop table if exists t1;
+NOTICE:  table "t1" does not exist, skipping
+drop table if exists t2;
+NOTICE:  table "t2" does not exist, skipping
+create table t1(a int);
+create table t2(b int);
+insert into t1 values(1),(2);
+insert into t2 values(2),(3),(11);
+insert into t1 select i from generate_series(1,10)i;
+insert into t2 select i from generate_series(2,10)i;
+insert into t1 select 2 from generate_series(1,5)i;
+insert into t2 select 2 from generate_series(2,7)i;
+set work_mem=64;
+set enable_mergejoin to off;
+select * from t1 left outer join t2 on a = b order by b;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+  1 |   
+  1 |   
+(67 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+    67
+(1 row)
+
+select * from t1, t2 where a = b;
+ a  | b  
+----+----
+  5 |  5
+  3 |  3
+  3 |  3
+  4 |  4
+  7 |  7
+  6 |  6
+  9 |  9
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  8 |  8
+ 10 | 10
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+(65 rows)
+
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+    65
+(1 row)
+
+select * from t1 right outer join t2 on a = b order by b;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+    | 11
+(66 rows)
+
+select count(*) from t1 right outer join t2 on a = b;
+ count 
+-------
+    66
+(1 row)
+
+select * from t1 full outer join t2 on a = b order by b;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+    | 11
+  1 |   
+  1 |   
+(68 rows)
+
+select count(*) from t1 full outer join t2 on a = b;
+ count 
+-------
+    68
+(1 row)
+
+truncate table t1;
+insert into t1 values (1),(2),(2),(3);
+truncate table t2;
+insert into t2 values(2),(2),(3),(3),(4);
+set work_mem=64;
+set enable_mergejoin to off;
+select * from t1 left outer join t2 on a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 1 |  
+(7 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+     7
+(1 row)
+
+select * from t1, t2 where a = b;
+ a | b 
+---+---
+ 3 | 3
+ 3 | 3
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+(6 rows)
+
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+     6
+(1 row)
+
+select * from t1 right outer join t2 on a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+   | 4
+(7 rows)
+
+select count(*) from t1 right outer join t2 on a = b;
+ count 
+-------
+     7
+(1 row)
+
+select * from t1 full outer join t2 on a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+   | 4
+ 1 |  
+(8 rows)
+
+select count(*) from t1 full outer join t2 on a = b;
+ count 
+-------
+     8
+(1 row)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 8fb55f045e..7492c2c45b 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan adaptive_hj
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index a39ca1012a..17099bf604 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -91,6 +91,7 @@ test: subselect
 test: union
 test: case
 test: join
+test: adaptive_hj
 test: aggregates
 test: transactions
 ignore: random
diff --git a/src/test/regress/sql/adaptive_hj.sql b/src/test/regress/sql/adaptive_hj.sql
new file mode 100644
index 0000000000..76c041e6f1
--- /dev/null
+++ b/src/test/regress/sql/adaptive_hj.sql
@@ -0,0 +1,39 @@
+drop table if exists t1;
+drop table if exists t2;
+create table t1(a int);
+create table t2(b int);
+
+insert into t1 values(1),(2);
+insert into t2 values(2),(3),(11);
+insert into t1 select i from generate_series(1,10)i;
+insert into t2 select i from generate_series(2,10)i;
+insert into t1 select 2 from generate_series(1,5)i;
+insert into t2 select 2 from generate_series(2,7)i;
+set work_mem=64;
+set enable_mergejoin to off;
+
+select * from t1 left outer join t2 on a = b order by b;
+select count(*) from t1 left outer join t2 on a = b;
+select * from t1, t2 where a = b;
+select count(*) from t1, t2 where a = b;
+select * from t1 right outer join t2 on a = b order by b;
+select count(*) from t1 right outer join t2 on a = b;
+select * from t1 full outer join t2 on a = b order by b;
+select count(*) from t1 full outer join t2 on a = b;
+
+truncate table t1;
+insert into t1 values (1),(2),(2),(3);
+truncate table t2;
+insert into t2 values(2),(2),(3),(3),(4);
+
+set work_mem=64;
+set enable_mergejoin to off;
+
+select * from t1 left outer join t2 on a = b order by b;
+select count(*) from t1 left outer join t2 on a = b;
+select * from t1, t2 where a = b;
+select count(*) from t1, t2 where a = b;
+select * from t1 right outer join t2 on a = b order by b;
+select count(*) from t1 right outer join t2 on a = b;
+select * from t1 full outer join t2 on a = b order by b;
+select count(*) from t1 full outer join t2 on a = b;
-- 
2.22.0

#34

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Melanie Plageman (#33)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Wed, Jul 03, 2019 at 02:22:09PM -0700, Melanie Plageman wrote:

On Tue, Jun 18, 2019 at 3:24 PM Melanie Plageman <melanieplageman@gmail.com>
wrote:

These questions will probably make a lot more sense with corresponding
code, so I will follow up with the second version of the state machine
patch once I finish it.

I have changed the state machine and resolved the questions I had
raised in the previous email. This seems to work for the parallel and
non-parallel cases. I have not yet rewritten the unmatched outer tuple
status as a bitmap in a spill file (for ease of debugging).

Before doing that, I wanted to ask what a desirable fallback condition
would be. In this patch, fallback to hashloop join happens only when
inserting tuples into the hashtable after batch 0 when inserting
another tuple from the batch file would exceed work_mem. This means
you can't increase nbatches, which, I would think is undesirable.

Yes, I think that's undesirable.

I thought a bit about when fallback should happen. So, let's say that
we would like to fallback to hashloop join when we have increased
nbatches X times. At that point, since we do not want to fall back to
hashloop join for all batches, we have to make a decision. After
increasing nbatches the Xth time, do we then fall back for all batches
for which inserting inner tuples exceeds work_mem? Do we use this
strategy but work_mem + some fudge factor?

Or, do we instead try to determine if data skew led us to increase
nbatches both times and then determine which batch, given new
nbatches, contains that data, set fallback to true only for that
batch, and let all other batches use the existing logic (with no
fallback option) unless they contain a value which leads to increasing
nbatches X number of times?

I think we should try to detect the skew and use this hashloop logic
only for the one batch. That's based on the assumption that the hashloop
is less efficient than the regular hashjoin.

We may need to apply it even for some non-skewed (but misestimated)
cases, though. At some point we'd need more than work_mem for BufFiles,
at which point we ought to use this hashloop.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#35

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#34)

1 attachment(s)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

So, I've rewritten the patch to use a BufFile for the outer table
batch file tuples' match statuses and write bytes to and from the file
which start as 0 and, upon encountering a match for a tuple, I set its
bit in the file to 1 (also rebased with current master).

It, of course, only works for parallel-oblivious hashjoin -- it relies
on deterministic order of tuples encountered in the outer side batch
file to set the right match bit and uses a counter to decide which bit
to set.

I did the "needlessly dumb implementation" Robert mentioned, though,
I thought about it and couldn't come up with a much smarter way to
write match bits to a file. I think there might be an optimization
opportunity in not writing the current_byte to the file each time that
the outer tuple matches and only doing this once we have advanced to a
tuple number that wouldn't have its match bit in the current_byte. I
didn't do that to keep it simple, and, I suspect there might be a bit
of gymnastics needed to make sure that that byte is actually written
to the file in case we exit from some other state before we encounter
the tuple represented in the last bit in that byte.

I plan to work on a separate implementation for parallel hashjoin
next--to understand what is required. I believe the logic to decide
when to fall back should be fairly easy to slot in at the end once
we've decided what that logic is.

On Sat, Jul 13, 2019 at 4:44 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Wed, Jul 03, 2019 at 02:22:09PM -0700, Melanie Plageman wrote:

On Tue, Jun 18, 2019 at 3:24 PM Melanie Plageman <

melanieplageman@gmail.com>

Before doing that, I wanted to ask what a desirable fallback condition
would be. In this patch, fallback to hashloop join happens only when
inserting tuples into the hashtable after batch 0 when inserting
another tuple from the batch file would exceed work_mem. This means
you can't increase nbatches, which, I would think is undesirable.

Yes, I think that's undesirable.

I thought a bit about when fallback should happen. So, let's say that
we would like to fallback to hashloop join when we have increased
nbatches X times. At that point, since we do not want to fall back to
hashloop join for all batches, we have to make a decision. After
increasing nbatches the Xth time, do we then fall back for all batches
for which inserting inner tuples exceeds work_mem? Do we use this
strategy but work_mem + some fudge factor?

Or, do we instead try to determine if data skew led us to increase
nbatches both times and then determine which batch, given new
nbatches, contains that data, set fallback to true only for that
batch, and let all other batches use the existing logic (with no
fallback option) unless they contain a value which leads to increasing
nbatches X number of times?

I think we should try to detect the skew and use this hashloop logic
only for the one batch. That's based on the assumption that the hashloop
is less efficient than the regular hashjoin.

We may need to apply it even for some non-skewed (but misestimated)
cases, though. At some point we'd need more than work_mem for BufFiles,
at which point we ought to use this hashloop.

I have not yet changed the logic for deciding to fall back from
my original design. It will still only fall back for a given batch if
that batch's inner batch file doesn't fit in memory. I haven't,
however, changed the logic to allow it to increase the number of
batches some number of times or according to some criteria before
falling back for that batch.

--
Melanie Plageman

Attachments:

v3-0001-hashloop-fallback.patchapplication/octet-stream; name=v3-0001-hashloop-fallback.patchDownload

From 0808af61a2db37ea46d2cabef944bca48e1bc443 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 10 Jun 2019 10:54:42 -0700
Subject: [PATCH v3] hashloop fallback

First part is to "chunk" the inner file into arbitrary partitions of
work_mem size

This chunks inner file and makes it so that the offset is along tuple
bounds.

Note that this makes it impossible to increase nbatches during the
loading of batches after initial hashtable creation

In preparation for doing this chunking, separate advance batch and load
batch. advance batch only if page offset is reset to 0, then load that
part of the batch

Second part was to: implement outer tuple batch rewinding per chunk of
inner batch

Would be a simple rewind and replay of outer side for each chunk of
inner if it weren't for LOJ.
Because we need to wait to emit NULL-extended tuples for LOJ until after
all chunks of inner have been processed.

To do this without incurring additional memory pressure, use a temporary
Buffile to capture the match status of each outer side tuple. Use one
bit per tuple to represent the match status, and, since for
parallel-oblivious hashjoin the outer side tuples are encountered in a
deterministic order, synchronizing the outer tuples match status file
with the outer tuples in the batch file to decide which ones to emit
NULL-extended is easy and can be done with a simple counter.

For non-hashloop fallback scenario (including batch 0), this file is not
created and unmatched outer tuples should be emitted as they are
encountered.

OuterTupleMatchStatuses are in a file as a bitmap instead of in memory
---
 src/backend/executor/nodeHashjoin.c       | 445 ++++++++--
 src/backend/storage/file/buffile.c        |  25 +
 src/include/nodes/execnodes.h             |  13 +
 src/include/storage/buffile.h             |   3 +
 src/test/regress/expected/adaptive_hj.out | 960 ++++++++++++++++++++++
 src/test/regress/parallel_schedule        |   2 +-
 src/test/regress/serial_schedule          |   1 +
 src/test/regress/sql/adaptive_hj.sql      |  64 ++
 8 files changed, 1442 insertions(+), 71 deletions(-)
 create mode 100644 src/test/regress/expected/adaptive_hj.out
 create mode 100644 src/test/regress/sql/adaptive_hj.sql

diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 8484a287e7..73cc6685e9 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -124,9 +124,11 @@
 #define HJ_BUILD_HASHTABLE		1
 #define HJ_NEED_NEW_OUTER		2
 #define HJ_SCAN_BUCKET			3
-#define HJ_FILL_OUTER_TUPLE		4
-#define HJ_FILL_INNER_TUPLES	5
-#define HJ_NEED_NEW_BATCH		6
+#define HJ_FILL_INNER_TUPLES    4
+#define HJ_NEED_NEW_BATCH		5
+#define HJ_NEED_NEW_INNER_CHUNK 6
+#define HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT 7
+#define HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER 8
 
 /* Returns true if doing null-fill on outer relation */
 #define HJ_FILL_OUTER(hjstate)	((hjstate)->hj_NullInnerTupleSlot != NULL)
@@ -143,10 +145,16 @@ static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 												 BufFile *file,
 												 uint32 *hashvalue,
 												 TupleTableSlot *tupleSlot);
-static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
+
+static bool ExecHashJoinAdvanceBatch(HashJoinState *hjstate);
+static bool ExecHashJoinLoadInnerBatch(HashJoinState *hjstate);
 static bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
 static void ExecParallelHashJoinPartitionOuter(HashJoinState *node);
 
+static BufFile *rewindOuterBatch(BufFile *bufFile);
+static TupleTableSlot *emitUnmatchedOuterTuple(ExprState *otherqual,
+											   ExprContext *econtext,
+											   HashJoinState *hjstate);
 
 /* ----------------------------------------------------------------
  *		ExecHashJoinImpl
@@ -176,6 +184,8 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 	int			batchno;
 	ParallelHashJoinState *parallel_state;
 
+	BufFile    *outerFileForAdaptiveRead;
+
 	/*
 	 * get information from HashJoin node
 	 */
@@ -198,6 +208,8 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 	 */
 	for (;;)
 	{
+		bool outerTupleMatchesExhausted = false;
+
 		/*
 		 * It's possible to iterate this loop many times before returning a
 		 * tuple, in some pathological cases such as needing to move much of
@@ -210,6 +222,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 		{
 			case HJ_BUILD_HASHTABLE:
 
+				elog(DEBUG1, "HJ_BUILD_HASHTABLE");
 				/*
 				 * First time through: build hash table for inner relation.
 				 */
@@ -344,6 +357,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 
 			case HJ_NEED_NEW_OUTER:
 
+				elog(DEBUG1, "HJ_NEED_NEW_OUTER");
 				/*
 				 * We don't have an outer tuple, try to get the next one
 				 */
@@ -357,20 +371,34 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 
 				if (TupIsNull(outerTupleSlot))
 				{
-					/* end of batch, or maybe whole join */
+					/*
+					 * end of batch, or maybe whole join.
+					 * for hashloop fallback, all we know is outer batch is
+					 * exhausted. inner could have more chunks
+					 */
 					if (HJ_FILL_INNER(node))
 					{
 						/* set up to scan for unmatched inner tuples */
 						ExecPrepHashTableForUnmatched(node);
 						node->hj_JoinState = HJ_FILL_INNER_TUPLES;
+						break;
 					}
-					else
-						node->hj_JoinState = HJ_NEED_NEW_BATCH;
-					continue;
+					node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
+					break;
 				}
-
+				/*
+				 * for the hashloop fallback case,
+				 * only initialize hj_MatchedOuter to false during the first chunk.
+				 * otherwise, we will be resetting hj_MatchedOuter to false for
+				 * an outer tuple that has already matched an inner tuple.
+				 * also, hj_MatchedOuter should be set to false for batch 0.
+				 * there are no chunks for batch 0, and node->hj_InnerFirstChunk isn't
+				 * set to true until HJ_NEED_NEW_BATCH,
+				 * so need to handle batch 0 explicitly
+				 */
+				if (node->hashloop_fallback == false || node->hj_InnerFirstChunk || hashtable->curbatch == 0)
+					node->hj_MatchedOuter = false;
 				econtext->ecxt_outertuple = outerTupleSlot;
-				node->hj_MatchedOuter = false;
 
 				/*
 				 * Find the corresponding bucket for this tuple in the main
@@ -410,6 +438,57 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					continue;
 				}
 
+				if (hashtable->outerBatchFile == NULL)
+				{
+					node->hj_JoinState = HJ_SCAN_BUCKET;
+					break;
+				}
+
+				BufFile *outerFile = hashtable->outerBatchFile[batchno];
+				if (outerFile == NULL)
+				{
+					node->hj_JoinState = HJ_SCAN_BUCKET;
+					break;
+				}
+
+				if (node->hashloop_fallback == true)
+				{
+					/* first tuple of new batch */
+					if (node->hj_OuterMatchStatusesFile == NULL)
+					{
+						node->hj_OuterTupleCount = 0;
+						node->hj_OuterMatchStatusesFile = BufFileCreateTemp(false);
+					}
+
+					/* for fallback case, always increment tuple count */
+					node->hj_OuterTupleCount++;
+
+					/* Use the next byte on every 8th tuple */
+					if ((node->hj_OuterTupleCount - 1) % 8 == 0)
+					{
+						/*
+						 * first chunk of new batch, so write and initialize
+						 * enough bytes in the outer tuple match status file to
+						 * capture all tuples' match statuses
+						 */
+						if (node->hj_InnerFirstChunk)
+						{
+							node->hj_OuterCurrentByte = 0;
+							BufFileWrite(node->hj_OuterMatchStatusesFile, &node->hj_OuterCurrentByte, 1);
+						}
+						/* otherwise, just read the next byte */
+						else
+							BufFileRead(node->hj_OuterMatchStatusesFile, &node->hj_OuterCurrentByte, 1);
+					}
+
+					elog(DEBUG1,
+						 "in HJ_NEED_NEW_OUTER. batchno %i. val %i. read  byte %hhu. cur tup %li.",
+						 batchno,
+						 DatumGetInt32(outerTupleSlot->tts_values[0]),
+						 node->hj_OuterCurrentByte,
+						 node->hj_OuterTupleCount);
+				}
+
 				/* OK, let's scan the bucket for matches */
 				node->hj_JoinState = HJ_SCAN_BUCKET;
 
@@ -417,28 +496,32 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 
 			case HJ_SCAN_BUCKET:
 
+				elog(DEBUG1, "HJ_SCAN_BUCKET");
 				/*
 				 * Scan the selected hash bucket for matches to current outer
 				 */
 				if (parallel)
-				{
-					if (!ExecParallelScanHashBucket(node, econtext))
-					{
-						/* out of matches; check for possible outer-join fill */
-						node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
-						continue;
-					}
-				}
+					outerTupleMatchesExhausted = !ExecParallelScanHashBucket(node, econtext);
 				else
+					outerTupleMatchesExhausted = !ExecScanHashBucket(node, econtext);
+
+				if (outerTupleMatchesExhausted)
 				{
-					if (!ExecScanHashBucket(node, econtext))
+					/*
+					 * The current outer tuple has run out of matches, so check
+					 * whether to emit a dummy outer-join tuple.  Whether we emit
+					 * one or not, the next state is NEED_NEW_OUTER.
+					 */
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+
+					if (node->hj_HashTable->curbatch == 0 || node->hashloop_fallback == false)
 					{
-						/* out of matches; check for possible outer-join fill */
-						node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
-						continue;
+						TupleTableSlot *slot = emitUnmatchedOuterTuple(otherqual, econtext, node);
+						if (slot != NULL)
+							return slot;
 					}
+					continue;
 				}
-
 				/*
 				 * We've got a match, but still need to test non-hashed quals.
 				 * ExecScanHashBucket already set up all the state needed to
@@ -471,42 +554,44 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					if (node->js.single_match)
 						node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
-					if (otherqual == NULL || ExecQual(otherqual, econtext))
-						return ExecProject(node->js.ps.ps_ProjInfo);
-					else
-						InstrCountFiltered2(node, 1);
-				}
-				else
-					InstrCountFiltered1(node, 1);
-				break;
+					/*
+					 * Set the match bit for this outer tuple in the match
+					 * status file
+					 */
+					if (node->hj_OuterMatchStatusesFile != NULL)
+					{
+						Assert(node->hashloop_fallback == true);
+						int byte_to_set = (node->hj_OuterTupleCount - 1) / 8;
+						int bit_to_set_in_byte = (node->hj_OuterTupleCount - 1) % 8;
 
-			case HJ_FILL_OUTER_TUPLE:
+						if (BufFileSeek(node->hj_OuterMatchStatusesFile, 0, byte_to_set, SEEK_SET) != 0)
+							elog(DEBUG1, "at beginning of file");
 
-				/*
-				 * The current outer tuple has run out of matches, so check
-				 * whether to emit a dummy outer-join tuple.  Whether we emit
-				 * one or not, the next state is NEED_NEW_OUTER.
-				 */
-				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+						node->hj_OuterCurrentByte = node->hj_OuterCurrentByte | (1 << bit_to_set_in_byte);
 
-				if (!node->hj_MatchedOuter &&
-					HJ_FILL_OUTER(node))
-				{
-					/*
-					 * Generate a fake join tuple with nulls for the inner
-					 * tuple, and return it if it passes the non-join quals.
-					 */
-					econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
+						elog(DEBUG1,
+								"in HJ_SCAN_BUCKET.    batchno %i. val %i. write byte %hhu. cur tup %li. bitnum %i. bytenum %i.",
+								node->hj_HashTable->curbatch,
+								DatumGetInt32(econtext->ecxt_outertuple->tts_values[0]),
+								node->hj_OuterCurrentByte,
+								node->hj_OuterTupleCount,
+								bit_to_set_in_byte,
+								byte_to_set);
 
+						BufFileWrite(node->hj_OuterMatchStatusesFile, &node->hj_OuterCurrentByte, 1);
+					}
 					if (otherqual == NULL || ExecQual(otherqual, econtext))
 						return ExecProject(node->js.ps.ps_ProjInfo);
 					else
 						InstrCountFiltered2(node, 1);
 				}
+				else
+					InstrCountFiltered1(node, 1);
 				break;
 
 			case HJ_FILL_INNER_TUPLES:
 
+				elog(DEBUG1, "HJ_FILL_INNER_TUPLES");
 				/*
 				 * We have finished a batch, but we are doing right/full join,
 				 * so any unmatched inner tuples in the hashtable have to be
@@ -515,7 +600,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				if (!ExecScanHashTableForUnmatched(node, econtext))
 				{
 					/* no more unmatched tuples */
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
 					continue;
 				}
 
@@ -533,6 +618,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 
 			case HJ_NEED_NEW_BATCH:
 
+				elog(DEBUG1, "HJ_NEED_NEW_BATCH");
 				/*
 				 * Try to advance to next batch.  Done if there are no more.
 				 */
@@ -543,12 +629,156 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				}
 				else
 				{
-					if (!ExecHashJoinNewBatch(node))
-						return NULL;	/* end of parallel-oblivious join */
+					/*
+					 * for batches after batch 0 for which hashloop_fallback is
+					 * true, if inner is exhausted, need to consider emitting
+					 * unmatched tuples we should never get here when
+					 * hashloop_fallback is false but hj_InnerExhausted is true,
+					 * however, it felt more clear to check for
+					 * hashloop_fallback explicitly
+					 */
+					if (node->hashloop_fallback == true && HJ_FILL_OUTER(node) && node->hj_InnerExhausted == true)
+					{
+						/*
+						 * For hashloop fallback, outer tuples are not emitted
+						 * until directly before advancing the batch (after all
+						 * inner chunks have been processed).
+						 * node->hashloop_fallback should be true because it is
+						 * not reset to false until advancing the batches
+						 */
+						node->hj_InnerExhausted = false;
+						node->hj_JoinState = HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT;
+						break;
+					}
+
+					if (!ExecHashJoinAdvanceBatch(node))
+						return NULL;    /* end of parallel-oblivious join */
+
+					if (rewindOuterBatch(node->hj_HashTable->outerBatchFile[node->hj_HashTable->curbatch]) != NULL)
+						ExecHashJoinLoadInnerBatch(node); /* TODO: should I ever load inner when outer file is not present? */
 				}
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 				break;
 
+			case HJ_NEED_NEW_INNER_CHUNK:
+
+				elog(DEBUG1, "HJ_NEED_NEW_INNER_CHUNK");
+
+				/*
+				 * there were never chunks because this is the normal case (not
+				 * hashloop fallback) or this is batch 0. batch 0 cannot have
+				 * chunks. hashloop_fallback should always be false when
+				 * curbatch is 0 here. proceed to HJ_NEED_NEW_BATCH to either
+				 * advance to the next batch or complete the join
+				 */
+				if (node->hj_HashTable->curbatch == 0)
+				{
+					Assert(node->hashloop_fallback == false);
+					if(node->hj_InnerPageOffset != 0L)
+						elog(NOTICE, "hj_InnerPageOffset is not reset to 0 on batch 0");
+				}
+
+				if (node->hashloop_fallback == false)
+				{
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					break;
+				}
+
+				/*
+				 * it is the hashloop fallback case and there are no more chunks
+				 * inner is exhausted, so we must advance the batches
+				 */
+				if (node->hj_InnerPageOffset == 0L)
+				{
+					node->hj_InnerExhausted = true;
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					break;
+				}
+
+				/*
+				 * This is the hashloop fallback case and we have more chunks in
+				 * inner. curbatch > 0. Rewind outer batch file (if present) so
+				 * that we can start reading it. Rewind outer match statuses
+				 * file if present so that we can set match bits as needed Reset
+				 * the tuple count and load the next chunk of inner. Then
+				 * proceed to get a new outer tuple from our rewound outer batch
+				 * file
+				 */
+				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+
+				if (rewindOuterBatch(node->hj_HashTable->outerBatchFile[node->hj_HashTable->curbatch]) == NULL)
+					break; /* TODO: Is breaking here the right thing to do when outer file is not present? */
+				rewindOuterBatch(node->hj_OuterMatchStatusesFile);
+				node->hj_OuterTupleCount = 0;
+				ExecHashJoinLoadInnerBatch(node);
+				break;
+
+			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT:
+
+				elog(DEBUG1, "HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT");
+
+				node->hj_OuterTupleCount = 0;
+				rewindOuterBatch(node->hj_OuterMatchStatusesFile);
+
+				/* TODO: is it okay to use the hashtable to get the outer batch file here? */
+				outerFileForAdaptiveRead = hashtable->outerBatchFile[hashtable->curbatch];
+				if (outerFileForAdaptiveRead == NULL) /* TODO: could this happen */
+				{
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					break;
+				}
+				rewindOuterBatch(outerFileForAdaptiveRead);
+
+				node->hj_JoinState = HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER;
+				/* fall through */
+
+			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER:
+
+				elog(DEBUG1, "HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER");
+
+				outerFileForAdaptiveRead = hashtable->outerBatchFile[hashtable->curbatch];
+
+				while (true)
+				{
+					uint32 unmatchedOuterHashvalue;
+					TupleTableSlot *temp = ExecHashJoinGetSavedTuple(node, outerFileForAdaptiveRead, &unmatchedOuterHashvalue, node->hj_OuterTupleSlot);
+					node->hj_OuterTupleCount++;
+
+					if (temp == NULL)
+					{
+						node->hj_JoinState = HJ_NEED_NEW_BATCH;
+						break;
+					}
+
+					unsigned char bit = (node->hj_OuterTupleCount - 1) % 8;
+
+					/* need to read the next byte */
+					if (bit == 0)
+						BufFileRead(node->hj_OuterMatchStatusesFile, &node->hj_OuterCurrentByte, 1);
+
+					elog(DEBUG1, "in HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER. batchno %i. val %i. num %li. bitnum %hhu. current byte %hhu.",
+						 node->hj_HashTable->curbatch,
+						 DatumGetInt32(temp->tts_values[0]),
+						 node->hj_OuterTupleCount,
+						 bit,
+						 node->hj_OuterCurrentByte);
+
+					/* if the match bit is set for this tuple, continue */
+					if ((node->hj_OuterCurrentByte >> bit) & 1)
+						continue;
+					/*
+					 * if it is not a match
+					 * emit it NULL-extended
+					 */
+					econtext->ecxt_outertuple = temp;
+					econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
+					return ExecProject(node->js.ps.ps_ProjInfo);
+				}
+
+				/* came here from HJ_NEED_NEW_BATCH, so go back there */
+				node->hj_JoinState = HJ_NEED_NEW_BATCH;
+				break;
+
 			default:
 				elog(ERROR, "unrecognized hashjoin state: %d",
 					 (int) node->hj_JoinState);
@@ -628,6 +858,14 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate->js.ps.ExecProcNode = ExecHashJoin;
 	hjstate->js.jointype = node->join.jointype;
 
+	hjstate->hashloop_fallback = false;
+	hjstate->hj_InnerPageOffset = 0L;
+	hjstate->hj_InnerFirstChunk = false;
+	hjstate->hj_OuterCurrentByte = 0;
+
+	hjstate->hj_OuterMatchStatusesFile = NULL;
+	hjstate->hj_OuterTupleCount  = 0;
+	hjstate->hj_InnerExhausted = false;
 	/*
 	 * Miscellaneous initialization
 	 *
@@ -805,6 +1043,40 @@ ExecEndHashJoin(HashJoinState *node)
 	ExecEndNode(innerPlanState(node));
 }
 
+static BufFile *rewindOuterBatch(BufFile *bufFile)
+{
+	if (bufFile != NULL)
+	{
+		if (BufFileSeek(bufFile, 0, 0L, SEEK_SET))
+			ereport(ERROR,
+				(errcode_for_file_access(),
+					errmsg("could not rewind hash-join temporary file: %m")));
+		return bufFile;
+	}
+	return NULL;
+}
+
+static TupleTableSlot *
+emitUnmatchedOuterTuple(ExprState *otherqual, ExprContext *econtext, HashJoinState *hjstate)
+{
+	if (hjstate->hj_MatchedOuter)
+		return NULL;
+
+	if (!HJ_FILL_OUTER(hjstate))
+		return NULL;
+
+	econtext->ecxt_innertuple = hjstate->hj_NullInnerTupleSlot;
+	/*
+	 * Generate a fake join tuple with nulls for the inner
+	 * tuple, and return it if it passes the non-join quals.
+	 */
+	if (otherqual == NULL || ExecQual(otherqual, econtext))
+		return ExecProject(hjstate->js.ps.ps_ProjInfo);
+
+	InstrCountFiltered2(hjstate, 1);
+	return NULL;
+}
+
 /*
  * ExecHashJoinOuterGetTuple
  *
@@ -951,20 +1223,17 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 }
 
 /*
- * ExecHashJoinNewBatch
+ * ExecHashJoinAdvanceBatch
  *		switch to a new hashjoin batch
  *
  * Returns true if successful, false if there are no more batches.
  */
 static bool
-ExecHashJoinNewBatch(HashJoinState *hjstate)
+ExecHashJoinAdvanceBatch(HashJoinState *hjstate)
 {
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	int			nbatch;
 	int			curbatch;
-	BufFile    *innerFile;
-	TupleTableSlot *slot;
-	uint32		hashvalue;
 
 	nbatch = hashtable->nbatch;
 	curbatch = hashtable->curbatch;
@@ -1039,10 +1308,35 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 		curbatch++;
 	}
 
+	hjstate->hj_InnerPageOffset = 0L;
+	hjstate->hj_InnerFirstChunk = true;
+	hjstate->hashloop_fallback = false; /* new batch, so start it off false */
+	if (hjstate->hj_OuterMatchStatusesFile != NULL)
+		BufFileClose(hjstate->hj_OuterMatchStatusesFile);
+	hjstate->hj_OuterMatchStatusesFile = NULL;
 	if (curbatch >= nbatch)
 		return false;			/* no more batches */
 
 	hashtable->curbatch = curbatch;
+	return true;
+}
+
+/*
+ * Returns true if there are more chunks left, false otherwise
+ */
+static bool ExecHashJoinLoadInnerBatch(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int curbatch = hashtable->curbatch;
+	BufFile    *innerFile;
+	TupleTableSlot *slot;
+	uint32		hashvalue;
+
+	off_t tup_start_offset;
+	off_t chunk_start_offset;
+	off_t tup_end_offset;
+	int64 current_saved_size;
+	int current_fileno;
 
 	/*
 	 * Reload the hash table with the new inner batch (which could be empty)
@@ -1051,45 +1345,56 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 
 	innerFile = hashtable->innerBatchFile[curbatch];
 
+	/* Reset this even if the innerfile is not null */
+	hjstate->hj_InnerFirstChunk = hjstate->hj_InnerPageOffset == 0L;
+
 	if (innerFile != NULL)
 	{
-		if (BufFileSeek(innerFile, 0, 0L, SEEK_SET))
+		/* TODO: should fileno always be 0? */
+		if (BufFileSeek(innerFile, 0, hjstate->hj_InnerPageOffset, SEEK_SET))
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not rewind hash-join temporary file: %m")));
 
+		chunk_start_offset = hjstate->hj_InnerPageOffset;
+		tup_end_offset = hjstate->hj_InnerPageOffset;
 		while ((slot = ExecHashJoinGetSavedTuple(hjstate,
 												 innerFile,
 												 &hashvalue,
 												 hjstate->hj_HashTupleSlot)))
 		{
+			/* next tuple's start is last tuple's end */
+			tup_start_offset = tup_end_offset;
+			/* after we got the tuple, figure out what the offset is */
+			BufFileTell(innerFile, &current_fileno, &tup_end_offset);
+			current_saved_size = tup_end_offset - chunk_start_offset;
+			if (current_saved_size > work_mem)
+			{
+				hjstate->hj_InnerPageOffset = tup_start_offset;
+				hjstate->hashloop_fallback = true;
+				return true;
+			}
+			hjstate->hj_InnerPageOffset = tup_end_offset;
 			/*
-			 * NOTE: some tuples may be sent to future batches.  Also, it is
-			 * possible for hashtable->nbatch to be increased here!
+			 * NOTE: some tuples may be sent to future batches.
+			 * With current hashloop patch, however, it is not possible
+			 * for hashtable->nbatch to be increased here
 			 */
 			ExecHashTableInsert(hashtable, slot, hashvalue);
 		}
 
+		/* this is the end of the file */
+		hjstate->hj_InnerPageOffset = 0L;
+
 		/*
-		 * after we build the hash table, the inner batch file is no longer
+		 * after we processed all chunks, the inner batch file is no longer
 		 * needed
 		 */
 		BufFileClose(innerFile);
 		hashtable->innerBatchFile[curbatch] = NULL;
 	}
 
-	/*
-	 * Rewind outer batch file (if present), so that we can start reading it.
-	 */
-	if (hashtable->outerBatchFile[curbatch] != NULL)
-	{
-		if (BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not rewind hash-join temporary file: %m")));
-	}
-
-	return true;
+	return false;
 }
 
 /*
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index b40e6f3fde..ed5d663b17 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -203,6 +203,9 @@ BufFileCreateTemp(bool interXact)
 	file = makeBufFile(pfile);
 	file->isInterXact = interXact;
 
+	if (file->files[0] == 0)
+		elog(NOTICE, "file is 0");
+
 	return file;
 }
 
@@ -737,6 +740,18 @@ BufFileTell(BufFile *file, int *fileno, off_t *offset)
 	*offset = file->curOffset + file->pos;
 }
 
+int
+BufFileTellPos(BufFile *file)
+{
+	return file->pos;
+}
+
+off_t
+BufFileTellOffset(BufFile *file)
+{
+	return file->curOffset;
+}
+
 /*
  * BufFileSeekBlock --- block-oriented seek
  *
@@ -801,6 +816,16 @@ BufFileSize(BufFile *file)
 		lastFileSize;
 }
 
+int64
+BufFileBytesUsed(BufFile *file)
+{
+	int64 lastFileSize = FileSize(file->files[file->numFiles - 1]);
+	if (lastFileSize >= 0)
+		return lastFileSize;
+	else
+		return 0;
+}
+
 /*
  * Append the contents of source file (managed within shared fileset) to
  * end of target file (managed within same shared fileset).
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 98bdcbcef5..efac63ca2e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -14,6 +14,7 @@
 #ifndef EXECNODES_H
 #define EXECNODES_H
 
+#include <storage/buffile.h>
 #include "access/tupconvert.h"
 #include "executor/instrument.h"
 #include "lib/pairingheap.h"
@@ -1899,6 +1900,18 @@ typedef struct HashJoinState
 	int			hj_JoinState;
 	bool		hj_MatchedOuter;
 	bool		hj_OuterNotEmpty;
+
+	/* hashloop fallback */
+	bool hashloop_fallback;
+	/* hashloop fallback inner side */
+	bool hj_InnerFirstChunk;
+	bool hj_InnerExhausted;
+	off_t hj_InnerPageOffset;
+
+	/* hashloop fallback outer side */
+	unsigned char hj_OuterCurrentByte;
+	BufFile *hj_OuterMatchStatusesFile;
+	int64 hj_OuterTupleCount;
 } HashJoinState;
 
 
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index 1fba404fe2..74ee0f292d 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -42,8 +42,11 @@ extern size_t BufFileRead(BufFile *file, void *ptr, size_t size);
 extern size_t BufFileWrite(BufFile *file, void *ptr, size_t size);
 extern int	BufFileSeek(BufFile *file, int fileno, off_t offset, int whence);
 extern void BufFileTell(BufFile *file, int *fileno, off_t *offset);
+extern int BufFileTellPos(BufFile *file);
+extern off_t BufFileTellOffset(BufFile *file);
 extern int	BufFileSeekBlock(BufFile *file, long blknum);
 extern int64 BufFileSize(BufFile *file);
+int64 BufFileBytesUsed(BufFile *file);
 extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
diff --git a/src/test/regress/expected/adaptive_hj.out b/src/test/regress/expected/adaptive_hj.out
new file mode 100644
index 0000000000..7a33316bfe
--- /dev/null
+++ b/src/test/regress/expected/adaptive_hj.out
@@ -0,0 +1,960 @@
+drop table if exists t1;
+NOTICE:  table "t1" does not exist, skipping
+drop table if exists t2;
+NOTICE:  table "t2" does not exist, skipping
+create table t1(a int);
+create table t2(b int);
+insert into t1 values(1),(2);
+insert into t2 values(2),(3),(11);
+insert into t1 select i from generate_series(1,10)i;
+insert into t2 select i from generate_series(2,10)i;
+insert into t1 select 2 from generate_series(1,5)i;
+insert into t2 select 2 from generate_series(2,7)i;
+set work_mem=64;
+set enable_mergejoin to off;
+select * from t1 left outer join t2 on a = b order by a;
+ a  | b  
+----+----
+  1 |   
+  1 |   
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+(67 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+    67
+(1 row)
+
+select * from t1, t2 where a = b order by b;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+(65 rows)
+
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+    65
+(1 row)
+
+select * from t1 right outer join t2 on a = b order by b;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+    | 11
+(66 rows)
+
+select count(*) from t1 right outer join t2 on a = b;
+ count 
+-------
+    66
+(1 row)
+
+select * from t1 full outer join t2 on a = b order by b;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+    | 11
+  1 |   
+  1 |   
+(68 rows)
+
+select count(*) from t1 full outer join t2 on a = b;
+ count 
+-------
+    68
+(1 row)
+
+truncate table t1;
+insert into t1 values (1),(2),(2),(3);
+truncate table t2;
+insert into t2 values(2),(2),(3),(3),(4);
+select * from t1 left outer join t2 on a = b order by a;
+ a | b 
+---+---
+ 1 |  
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+(7 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+     7
+(1 row)
+
+select * from t1, t2 where a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+(6 rows)
+
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+     6
+(1 row)
+
+select * from t1 right outer join t2 on a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+   | 4
+(7 rows)
+
+select count(*) from t1 right outer join t2 on a = b;
+ count 
+-------
+     7
+(1 row)
+
+select * from t1 full outer join t2 on a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+   | 4
+ 1 |  
+(8 rows)
+
+select count(*) from t1 full outer join t2 on a = b;
+ count 
+-------
+     8
+(1 row)
+
+truncate table t1;
+insert into t1 values(1),(1);
+insert into t1 select 2 from generate_series(1,7)i;
+insert into t1 select i from generate_series(3,10)i;
+truncate table t2;
+insert into t2 select 2 from generate_series(1,7)i;
+insert into t2 values(3),(3);
+insert into t2 select i from generate_series(5,9)i;
+select * from t1 left outer join t2 on a = b order by a;
+ a  | b 
+----+---
+  1 |  
+  1 |  
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  3 | 3
+  3 | 3
+  4 |  
+  5 | 5
+  6 | 6
+  7 | 7
+  8 | 8
+  9 | 9
+ 10 |  
+(60 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+    60
+(1 row)
+
+select * from t1, t2 where a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+    56
+(1 row)
+
+select * from t1 right outer join t2 on a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+select count(*) from t1 right outer join t2 on a = b;
+ count 
+-------
+    56
+(1 row)
+
+select * from t1 full outer join t2 on a = b order by b;
+ a  | b 
+----+---
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  3 | 3
+  3 | 3
+  5 | 5
+  6 | 6
+  7 | 7
+  8 | 8
+  9 | 9
+ 10 |  
+  4 |  
+  1 |  
+  1 |  
+(60 rows)
+
+select count(*) from t1 full outer join t2 on a = b;
+ count 
+-------
+    60
+(1 row)
+
+select * from t2 left outer join t1 on a = b order by a;
+ b | a 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+select count(*) from t2 left outer join t1 on a = b;
+ count 
+-------
+    56
+(1 row)
+
+select * from t2, t1 where a = b order by b;
+ b | a 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+select count(*) from t2, t1 where a = b;
+ count 
+-------
+    56
+(1 row)
+
+select * from t2 right outer join t1 on a = b order by b;
+ b | a  
+---+----
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 3 |  3
+ 3 |  3
+ 5 |  5
+ 6 |  6
+ 7 |  7
+ 8 |  8
+ 9 |  9
+   | 10
+   |  4
+   |  1
+   |  1
+(60 rows)
+
+select count(*) from t2 right outer join t1 on a = b;
+ count 
+-------
+    60
+(1 row)
+
+select * from t2 full outer join t1 on a = b order by b;
+ b | a  
+---+----
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 3 |  3
+ 3 |  3
+ 5 |  5
+ 6 |  6
+ 7 |  7
+ 8 |  8
+ 9 |  9
+   | 10
+   |  4
+   |  1
+   |  1
+(60 rows)
+
+select count(*) from t2 full outer join t1 on a = b;
+ count 
+-------
+    60
+(1 row)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 8fb55f045e..7492c2c45b 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan adaptive_hj
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index a39ca1012a..17099bf604 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -91,6 +91,7 @@ test: subselect
 test: union
 test: case
 test: join
+test: adaptive_hj
 test: aggregates
 test: transactions
 ignore: random
diff --git a/src/test/regress/sql/adaptive_hj.sql b/src/test/regress/sql/adaptive_hj.sql
new file mode 100644
index 0000000000..7e74aac603
--- /dev/null
+++ b/src/test/regress/sql/adaptive_hj.sql
@@ -0,0 +1,64 @@
+drop table if exists t1;
+drop table if exists t2;
+create table t1(a int);
+create table t2(b int);
+
+insert into t1 values(1),(2);
+insert into t2 values(2),(3),(11);
+insert into t1 select i from generate_series(1,10)i;
+insert into t2 select i from generate_series(2,10)i;
+insert into t1 select 2 from generate_series(1,5)i;
+insert into t2 select 2 from generate_series(2,7)i;
+
+set work_mem=64;
+set enable_mergejoin to off;
+
+select * from t1 left outer join t2 on a = b order by a;
+select count(*) from t1 left outer join t2 on a = b;
+select * from t1, t2 where a = b order by b;
+select count(*) from t1, t2 where a = b;
+select * from t1 right outer join t2 on a = b order by b;
+select count(*) from t1 right outer join t2 on a = b;
+select * from t1 full outer join t2 on a = b order by b;
+select count(*) from t1 full outer join t2 on a = b;
+
+truncate table t1;
+insert into t1 values (1),(2),(2),(3);
+truncate table t2;
+insert into t2 values(2),(2),(3),(3),(4);
+
+select * from t1 left outer join t2 on a = b order by a;
+select count(*) from t1 left outer join t2 on a = b;
+select * from t1, t2 where a = b order by b;
+select count(*) from t1, t2 where a = b;
+select * from t1 right outer join t2 on a = b order by b;
+select count(*) from t1 right outer join t2 on a = b;
+select * from t1 full outer join t2 on a = b order by b;
+select count(*) from t1 full outer join t2 on a = b;
+
+truncate table t1;
+insert into t1 values(1),(1);
+insert into t1 select 2 from generate_series(1,7)i;
+insert into t1 select i from generate_series(3,10)i;
+truncate table t2;
+insert into t2 select 2 from generate_series(1,7)i;
+insert into t2 values(3),(3);
+insert into t2 select i from generate_series(5,9)i;
+
+select * from t1 left outer join t2 on a = b order by a;
+select count(*) from t1 left outer join t2 on a = b;
+select * from t1, t2 where a = b order by b;
+select count(*) from t1, t2 where a = b;
+select * from t1 right outer join t2 on a = b order by b;
+select count(*) from t1 right outer join t2 on a = b;
+select * from t1 full outer join t2 on a = b order by b;
+select count(*) from t1 full outer join t2 on a = b;
+
+select * from t2 left outer join t1 on a = b order by a;
+select count(*) from t2 left outer join t1 on a = b;
+select * from t2, t1 where a = b order by b;
+select count(*) from t2, t1 where a = b;
+select * from t2 right outer join t1 on a = b order by b;
+select count(*) from t2 right outer join t1 on a = b;
+select * from t2 full outer join t1 on a = b order by b;
+select count(*) from t2 full outer join t1 on a = b;
-- 
2.22.0

#36

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Melanie Plageman (#35)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Tue, Jul 30, 2019 at 2:47 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I did the "needlessly dumb implementation" Robert mentioned, though,
I thought about it and couldn't come up with a much smarter way to
write match bits to a file. I think there might be an optimization
opportunity in not writing the current_byte to the file each time that
the outer tuple matches and only doing this once we have advanced to a
tuple number that wouldn't have its match bit in the current_byte. I
didn't do that to keep it simple, and, I suspect there might be a bit
of gymnastics needed to make sure that that byte is actually written
to the file in case we exit from some other state before we encounter
the tuple represented in the last bit in that byte.

I mean, I was assuming we'd write in like 8kB blocks or something.
Doing it a byte at a time seems like it'd produce way too many
syscals.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#37

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Robert Haas (#36)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Tue, Jul 30, 2019 at 4:36 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jul 30, 2019 at 2:47 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I did the "needlessly dumb implementation" Robert mentioned, though,
I thought about it and couldn't come up with a much smarter way to
write match bits to a file. I think there might be an optimization
opportunity in not writing the current_byte to the file each time that
the outer tuple matches and only doing this once we have advanced to a
tuple number that wouldn't have its match bit in the current_byte. I
didn't do that to keep it simple, and, I suspect there might be a bit
of gymnastics needed to make sure that that byte is actually written
to the file in case we exit from some other state before we encounter
the tuple represented in the last bit in that byte.

I mean, I was assuming we'd write in like 8kB blocks or something.
Doing it a byte at a time seems like it'd produce way too many
syscals.

For the actual write to disk, I'm pretty sure I get that for free from
the BufFile API, no?
I was more thinking about optimizing when I call BufFileWrite at all.

--
Melanie Plageman

#38

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Melanie Plageman (#37)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Tue, Jul 30, 2019 at 8:07 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

For the actual write to disk, I'm pretty sure I get that for free from
the BufFile API, no?
I was more thinking about optimizing when I call BufFileWrite at all.

Right. Clearly several existing buffile.c users regularly have very
small BufFileWrite() size arguments. tuplestore.c, for one.

--
Peter Geoghegan

#39

Thomas Munro

thomas.munro@gmail.com

over 6 years ago

In reply to: Melanie Plageman (#35)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Wed, Jul 31, 2019 at 6:47 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

So, I've rewritten the patch to use a BufFile for the outer table
batch file tuples' match statuses and write bytes to and from the file
which start as 0 and, upon encountering a match for a tuple, I set its
bit in the file to 1 (also rebased with current master).

It, of course, only works for parallel-oblivious hashjoin -- it relies
on deterministic order of tuples encountered in the outer side batch
file to set the right match bit and uses a counter to decide which bit
to set.

I did the "needlessly dumb implementation" Robert mentioned, though,
I thought about it and couldn't come up with a much smarter way to
write match bits to a file. I think there might be an optimization
opportunity in not writing the current_byte to the file each time that
the outer tuple matches and only doing this once we have advanced to a
tuple number that wouldn't have its match bit in the current_byte. I
didn't do that to keep it simple, and, I suspect there might be a bit
of gymnastics needed to make sure that that byte is actually written
to the file in case we exit from some other state before we encounter
the tuple represented in the last bit in that byte.

Thanks for working on this! I plan to poke at it a bit in the next few weeks.

I plan to work on a separate implementation for parallel hashjoin
next--to understand what is required. I believe the logic to decide
when to fall back should be fairly easy to slot in at the end once
we've decided what that logic is.

Seems like a good time for me to try to summarise what I think the
main problems are here:

1. The match-bit storage problem already discussed. The tuples that
each process receives while reading from SharedTupleStore are
non-deterministic (like other parallel scans). To use a bitmap-based
approach, I guess we'd need to invent some way to give the tuples a
stable identifier within some kind of densely packed number space that
we could use to address the bitmap, or take the IO hit and write all
the tuples back. That might involve changing the way SharedTupleStore
holds data.

2. Tricky problems relating to barriers and flow control. First, let
me explain why PHJ doesn't support full/right outer joins yet. At
first I thought it was going to be easy, because, although the shared
memory hash table is read-only after it has been built, it seems safe
to weaken that only slightly and let the match flag be set by any
process during probing: it's OK if two processes clobber each other's
writes, as the only transition is a single bit going strictly from 0
to 1, and there will certainly be a full memory barrier before anyone
tries to read those match bits. Then during the scan for unmatched,
you just have to somehow dole out hash table buckets or ranges of
buckets to processes on a first-come-first-served basis. But.... then
I crashed into the following problem:

* You can't begin the scan for unmatched tuples until every process
has finished probing (ie until you have the final set of match bits).
* You can't wait for every process to finish probing, because any
process that has emitted a tuple might never come back if there is
another node that is also waiting for all processes (ie deadlock
against another PHJ doing the same thing), and probing is a phase that
emits tuples.

Generally, it's not safe to emit tuples while you are attached to a
Barrier, unless you're only going to detach from it, not wait at it,
because emitting tuples lets the program counter escape your control.
Generally, it's not safe to detach from a Barrier while accessing
resources whose lifetime it controls, such as a hash table, because
then it might go away underneath you.

The PHJ plans that are supported currently adhere to that programming
rule and so don't have a problem: after the Barrier reaches the
probing phase, processes never wait for each other again so they're
free to begin emitting tuples. They just detach when they're done
probing, and the last to detach cleans up (frees the hash table etc).
If there is more than one batch, they detach from one batch and attach
to another when they're ready (each batch has its own Barrier), so we
can consider the batches to be entirely independent.

There is probably a way to make a scan-for-unmatched-inner phase work,
possibly involving another Barrier or something like that, but I ran
out of time trying to figure it out and wanted to ship a working PHJ
for the more common plan types. I suppose PHLJ will face two variants
of this problem: (1) you need to synchronise the loops (you can't dump
the hash table in preparation for the next loop until all have
finished probing for the current loop), and yet you've already emitted
tuples, so you're not allowed to wait for other processes and they're
not allowed to wait for you, and (2) you can't start the
scan-for-unmatched-outer until all the probe loops belonging to one
batch are done. The first problem is sort of analogous to a problem I
faced with batches in the first place, which Robert and I found a
solution to by processing the batches in parallel, and could perhaps
be solved in the same way: run the loops in parallel (if that sounds
crazy, recall that every worker has its own quota of work_mem and the
data is entirely prepartitioned up front, which is why we are able to
run the batches in parallel; in constrast, single-batch mode makes a
hash table with a quota of nparticipants * work_mem). The second
problem is sort of analogous to the existing scan-for-unmatched-inner
problem that I haven't solved.

I think there may be ways to make that general class of deadlock
problem go away in a future asynchronous executor model where N
streams conceptually run concurrently in event-driven nodes so that
control never gets stuck in a node, but that seems quite far off and I
haven't worked out the details. The same problem comes up in a
hypothetical Parallel Repartition node: you're not done with your
partition until all processes have run out of input tuples, so you
have to wait for all of them to send an EOF, so you risk deadlock if
they are waiting for you elsewhere in the tree. A stupid version of
the idea is to break the node up into a consumer part and a producer
part, and put the producer into a subprocess so that its program
counter can never escape and deadlock somewhere in the consumer part
of the plan. Obviously we don't want to have loads of extra OS
processes all over the place, but I think you can get the same effect
using a form of asynchronous execution where the program counter jumps
between nodes and streams based on readiness, and yields control
instead of blocking. Similar ideas have been proposed to deal with
asynchronous IO.

--
Thomas Munro
https://enterprisedb.com

#40

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Thomas Munro (#39)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Thu, Sep 5, 2019 at 10:35 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Seems like a good time for me to try to summarise what I think the
main problems are here:

1. The match-bit storage problem already discussed. The tuples that
each process receives while reading from SharedTupleStore are
non-deterministic (like other parallel scans). To use a bitmap-based
approach, I guess we'd need to invent some way to give the tuples a
stable identifier within some kind of densely packed number space that
we could use to address the bitmap, or take the IO hit and write all
the tuples back. That might involve changing the way SharedTupleStore
holds data.

This I've dealt with by adding a tuplenum to the SharedTupleStore
itself which I atomically increment in sts_puttuple().
In ExecParallelHashJoinPartitionOuter(), as each worker writes tuples
to the batch files, they call sts_puttuple() and this increments the
number so each tuple has a unique number.
For persisting this number, I added the tuplenum to the meta data
section of the MinimalTuple (along with the hashvalue -- there was a
comment about this meta data that said it could be used for other
things in the future, so this seemed like a good place to put it) and
write that out to the batch file.

At the end of ExecParallelHashJoinPartitionOuter(), I make the outer
match status bitmap file. I use the final tuplenum count to determine
the number of bytes to write to it. Each worker has a file with a
bitmap which has the number of bytes required to represent the number
of tuples in that batch.

Because one worker may beat the other(s) and build the whole batch
file for a batch before the others have a chance, I also make the
outer match status bitmap file for workers who missed out in
ExecParallelHashJoinOuterGetTuple() using the final tuplenum as well.

2. Tricky problems relating to barriers and flow control. First, let
me explain why PHJ doesn't support full/right outer joins yet. At
first I thought it was going to be easy, because, although the shared
memory hash table is read-only after it has been built, it seems safe
to weaken that only slightly and let the match flag be set by any
process during probing: it's OK if two processes clobber each other's
writes, as the only transition is a single bit going strictly from 0
to 1, and there will certainly be a full memory barrier before anyone
tries to read those match bits. Then during the scan for unmatched,
you just have to somehow dole out hash table buckets or ranges of
buckets to processes on a first-come-first-served basis. But.... then
I crashed into the following problem:

* You can't begin the scan for unmatched tuples until every process
has finished probing (ie until you have the final set of match bits).
* You can't wait for every process to finish probing, because any
process that has emitted a tuple might never come back if there is
another node that is also waiting for all processes (ie deadlock
against another PHJ doing the same thing), and probing is a phase that
emits tuples.

Generally, it's not safe to emit tuples while you are attached to a
Barrier, unless you're only going to detach from it, not wait at it,
because emitting tuples lets the program counter escape your control.
Generally, it's not safe to detach from a Barrier while accessing
resources whose lifetime it controls, such as a hash table, because
then it might go away underneath you.

The PHJ plans that are supported currently adhere to that programming
rule and so don't have a problem: after the Barrier reaches the
probing phase, processes never wait for each other again so they're
free to begin emitting tuples. They just detach when they're done
probing, and the last to detach cleans up (frees the hash table etc).
If there is more than one batch, they detach from one batch and attach
to another when they're ready (each batch has its own Barrier), so we
can consider the batches to be entirely independent.

There is probably a way to make a scan-for-unmatched-inner phase work,
possibly involving another Barrier or something like that, but I ran
out of time trying to figure it out and wanted to ship a working PHJ
for the more common plan types. I suppose PHLJ will face two variants
of this problem: (1) you need to synchronise the loops (you can't dump
the hash table in preparation for the next loop until all have
finished probing for the current loop), and yet you've already emitted
tuples, so you're not allowed to wait for other processes and they're
not allowed to wait for you, and (2) you can't start the
scan-for-unmatched-outer until all the probe loops belonging to one
batch are done. The first problem is sort of analogous to a problem I
faced with batches in the first place, which Robert and I found a
solution to by processing the batches in parallel, and could perhaps
be solved in the same way: run the loops in parallel (if that sounds
crazy, recall that every worker has its own quota of work_mem and the
data is entirely prepartitioned up front, which is why we are able to
run the batches in parallel; in constrast, single-batch mode makes a
hash table with a quota of nparticipants * work_mem). The second
problem is sort of analogous to the existing scan-for-unmatched-inner
problem that I haven't solved.

I "solved" these problem for now by having all workers except for one
detach from the outer batch file after finishing probing. The last
worker to arrive does not detach from the batch and instead iterates
through all of the workers' outer match status files per participant
shared mem SharedTuplestoreParticipant) and create a single unified
bitmap. All the other workers continue to wait at the barrier until
the sole remaining worker has finished with iterating through the
outer match status bitmap files.

Admittedly, I'm still fighting with this step a bit, but, my intent is
to have all the backends wait until the lone remaining worker has
created the unified bitmap, then, that worker, which is still attached
to the outer batch will scan the outer batch file and the unified
outer match status bitmap and emit unmatched tuples.

I thought that the other workers can move on and stop waiting at the
barrier once the lone remaining worker has scanned their outer match
status files. All the probe loops would be done, and the worker that
is emitting tuples is not referencing the inner side hashtable at all
and only the outer batch file and the combined bitmap.

--
Melanie Plageman

#41

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Melanie Plageman (#40)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Fri, Sep 06, 2019 at 10:54:13AM -0700, Melanie Plageman wrote:

On Thu, Sep 5, 2019 at 10:35 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Seems like a good time for me to try to summarise what I think the
main problems are here:

1. The match-bit storage problem already discussed. The tuples that
each process receives while reading from SharedTupleStore are
non-deterministic (like other parallel scans). To use a bitmap-based
approach, I guess we'd need to invent some way to give the tuples a
stable identifier within some kind of densely packed number space that
we could use to address the bitmap, or take the IO hit and write all
the tuples back. That might involve changing the way SharedTupleStore
holds data.

This I've dealt with by adding a tuplenum to the SharedTupleStore
itself which I atomically increment in sts_puttuple().
In ExecParallelHashJoinPartitionOuter(), as each worker writes tuples
to the batch files, they call sts_puttuple() and this increments the
number so each tuple has a unique number.
For persisting this number, I added the tuplenum to the meta data
section of the MinimalTuple (along with the hashvalue -- there was a
comment about this meta data that said it could be used for other
things in the future, so this seemed like a good place to put it) and
write that out to the batch file.

At the end of ExecParallelHashJoinPartitionOuter(), I make the outer
match status bitmap file. I use the final tuplenum count to determine
the number of bytes to write to it. Each worker has a file with a
bitmap which has the number of bytes required to represent the number
of tuples in that batch.

Because one worker may beat the other(s) and build the whole batch
file for a batch before the others have a chance, I also make the
outer match status bitmap file for workers who missed out in
ExecParallelHashJoinOuterGetTuple() using the final tuplenum as well.

That seems like a perfectly sensible solution to me. I'm sure there are
ways to optimize it (say, having a bitmap optimized for sparse data, or
bitmap shared by all the workers or something like that), but that's
definitely not needed for v1.

Even having a bitmap per worker is pretty cheap. Assume we have 1B rows,
the bitmap is 1B/8 bytes = ~120MB per worker. So with 16 workers that's
~2GB, give or take. But with 100B rows, the original data is ~100GB. So
the bitmaps are not free, but it's not terrible either.

2. Tricky problems relating to barriers and flow control. First, let
me explain why PHJ doesn't support full/right outer joins yet. At
first I thought it was going to be easy, because, although the shared
memory hash table is read-only after it has been built, it seems safe
to weaken that only slightly and let the match flag be set by any
process during probing: it's OK if two processes clobber each other's
writes, as the only transition is a single bit going strictly from 0
to 1, and there will certainly be a full memory barrier before anyone
tries to read those match bits. Then during the scan for unmatched,
you just have to somehow dole out hash table buckets or ranges of
buckets to processes on a first-come-first-served basis. But.... then
I crashed into the following problem:

* You can't begin the scan for unmatched tuples until every process
has finished probing (ie until you have the final set of match bits).
* You can't wait for every process to finish probing, because any
process that has emitted a tuple might never come back if there is
another node that is also waiting for all processes (ie deadlock
against another PHJ doing the same thing), and probing is a phase that
emits tuples.

Generally, it's not safe to emit tuples while you are attached to a
Barrier, unless you're only going to detach from it, not wait at it,
because emitting tuples lets the program counter escape your control.
Generally, it's not safe to detach from a Barrier while accessing
resources whose lifetime it controls, such as a hash table, because
then it might go away underneath you.

The PHJ plans that are supported currently adhere to that programming
rule and so don't have a problem: after the Barrier reaches the
probing phase, processes never wait for each other again so they're
free to begin emitting tuples. They just detach when they're done
probing, and the last to detach cleans up (frees the hash table etc).
If there is more than one batch, they detach from one batch and attach
to another when they're ready (each batch has its own Barrier), so we
can consider the batches to be entirely independent.

There is probably a way to make a scan-for-unmatched-inner phase work,
possibly involving another Barrier or something like that, but I ran
out of time trying to figure it out and wanted to ship a working PHJ
for the more common plan types. I suppose PHLJ will face two variants
of this problem: (1) you need to synchronise the loops (you can't dump
the hash table in preparation for the next loop until all have
finished probing for the current loop), and yet you've already emitted
tuples, so you're not allowed to wait for other processes and they're
not allowed to wait for you, and (2) you can't start the
scan-for-unmatched-outer until all the probe loops belonging to one
batch are done. The first problem is sort of analogous to a problem I
faced with batches in the first place, which Robert and I found a
solution to by processing the batches in parallel, and could perhaps
be solved in the same way: run the loops in parallel (if that sounds
crazy, recall that every worker has its own quota of work_mem and the
data is entirely prepartitioned up front, which is why we are able to
run the batches in parallel; in constrast, single-batch mode makes a
hash table with a quota of nparticipants * work_mem). The second
problem is sort of analogous to the existing scan-for-unmatched-inner
problem that I haven't solved.

I "solved" these problem for now by having all workers except for one
detach from the outer batch file after finishing probing. The last
worker to arrive does not detach from the batch and instead iterates
through all of the workers' outer match status files per participant
shared mem SharedTuplestoreParticipant) and create a single unified
bitmap. All the other workers continue to wait at the barrier until
the sole remaining worker has finished with iterating through the
outer match status bitmap files.

Why did you put solved in quotation marks? This seems like a reasonable
solution to me, at least for now, but the quotation marks kinda suggest
you think it's either not correct or not good enough. Or did I miss some
flaw that makes this unacceptable?

Admittedly, I'm still fighting with this step a bit, but, my intent is
to have all the backends wait until the lone remaining worker has
created the unified bitmap, then, that worker, which is still attached
to the outer batch will scan the outer batch file and the unified
outer match status bitmap and emit unmatched tuples.

Makes sense, I think.

The one "issue" this probably has is that it serializes the last step,
i.e. the search for unmatched tuples is done in a single process, instead
of parallelized over multiple workers. That's certainly unfortunate, but
is that really an issue in practice? Probably not for queries with just a
small number of unmatched tuples. And for cases with many unmatched rows
it's probably going to degrade to non-parallel case.

I thought that the other workers can move on and stop waiting at the
barrier once the lone remaining worker has scanned their outer match
status files. All the probe loops would be done, and the worker that
is emitting tuples is not referencing the inner side hashtable at all
and only the outer batch file and the combined bitmap.

Why would the workers need to wait for the lone worker to scan their
bitmap file? Or do the files disappear with the workers, or something
like that?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#42

Melanie Plageman

melanieplageman@gmail.com

about 6 years ago

In reply to: Melanie Plageman (#40)

1 attachment(s)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

So, I finally have a prototype to share of parallel hashloop fallback.

See the commit message for a full description of the functionality of the
patch.

This patch does contain refactoring of nodeHashjoin.

I have split the Parallel HashJoin and Serial HashJoin state machines
up, as they were diverging in my patch to a point that made for a
really cluttered ExecHashJoinImpl() (ExecHashJoinImpl() is now gone).

The reason I didn't do this refactoring in one patch and then put the
adaptive hashjoin code on top of it is that I might like to make
Parallel HashJoin and Serial HashJoin different nodes.

I think that has been discussed elsewhere and was looking to
understand the rationale for keeping them in the same node.

The patch is a rough prototype. Below are some of the high-level
pieces of work that I plan to do next. (there are many TODOs in the
code as well).

Some of the major outstanding work:

- correctness:
- haven't tried it with anti-joins and don't think it works
- number of batches is not deterministic from run-to-run

- performance:
- join_hash.sql is *much* slower.
While there are loads of performance fixes needed in the patch,
the basic criteria for "falling back" is likely the culprit here.
- There are many bottlenecks (there are several places where a
barrier could be moved to somewhere less hot, an atomic used
instead of a lock, or a method of coordination could be used to
allow workers to do backend-local accounting and aggregate it)
- need to make sure it does not create outer match status files when
it shouldn't (inner joins, for example)

- testing:
- many unexercised cases
- add number of chunks to EXPLAIN (for users and for testing)

- refactoring:
- The match status bitmap should have its own API or, at least,
manipulation of it should be done in a centralized set of
functions
- Rename "chunk" (as in chunks of inner side) to something that is
not already used in the context of memory chunks and, more
importantly, SharedTuplestoreChunk
- Make references to "hashloop fallback" and "adaptive hashjoin"
more consistent
- Rename adaptiveHashjoin.h/.c files and change what is in the files
which are separate from nodeHashjoin.h/.c (depending on outcome of
"new node")
- The state machines are big and unwieldy now, so, there is probably
some larger restructuring that could be done
- Should probably use the ParallelHashJoinBatchAccessor to access
the ParallelHashJoinBatch everywhere (realized this recently)

--
Melanie Plageman

Attachments:

v3-0001-hashloop-fallback.patchapplication/octet-stream; name=v3-0001-hashloop-fallback.patchDownload

From 7a9b2e508f6361c0b3044675ffb4b3aecb280147 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 29 Dec 2019 18:56:42 -0800
Subject: [PATCH v3] Implement Adaptive Hashjoin

Serial Hashloop Fallback:

"Chunk" the inner file into arbitrary partitions of work_mem size
offset along tuple bounds while loading the batch into the hashtable.

Note that this makes it impossible to increase nbatches during the
loading of batches after initial hashtable creation.

In preparation for doing this chunking, separate "advance batch" and
"load batch".

Implement outer tuple batch rewinding per chunk of inner batch. Would
be a simple rewind and replay of outer side for each chunk of inner if
it weren't for LOJ. Because we need to wait to emit NULL-extended
tuples for LOJ until after all chunks of inner have been processed.

To do this without incurring additional memory pressure, use a
temporary BufFile to capture the match status of each outer side
tuple. Use one bit per tuple to represent the match status, and, since
for parallel-oblivious hashjoin the outer side tuples are encountered
in a deterministic order, synchronizing the outer tuples match status
file with the outer tuples in the batch file to decide which ones to
emit NULL-extended is easy and can be done with a simple counter.

For non-hashloop fallback scenario (including batch 0), this file is
not created and unmatched outer tuples should be emitted as they are
encountered.

Parallel Hashloop Fallback:

During initial allocation of the hashtable, each time the number of
batches is increased, a new variable in the ParallelHashJoinState,
batch_increases, is incremented.

In PHJ_GROW_BATCHES_DECIDING, if pstate->batch_increases >= 2,
parallel_hashloop_fallback will be enabled for qualifying batches.
From then on, if a batch is still too large to fit into the
space_allowed, then parallel_hashloop_fallback is set on that batch.
It will not be allowed to divide further and, during execution, the
fallback strategy will be used.

For a batch which has parallel_hashloop_fallback set, tuples inserted
into the the batch's inner and outer batch files will have an
additional piece of metadata (other than the hashvalue). For the inner
side, this additional metadata is the chunk number, For the outer
side, this additional metadata is the tuple identifier--needed when
rescanning the outer side batch file for each chunk of the inner.

During execution of a parallel hashjoin batch which needs to fall
back, the worker will create an "outer match status file" which
contains a bitmap tracking which outer tuples have matched an inner
tuple. All bits in the worker's outer match status file are initially
unset. During probing, the worker will set the corresponding bit (the
bit at the index of the tuple identifier) in the outer match status
bitmap for an outer tuple which matches any inner tuple.

Workers probing a fallback batch will wait until all workers have
finished probing before moving on so that an elected worker can read
and combine the outer match status files into a single bitmap and use
it to emit unmatched outer tuples after all chunks of the inner side
have been processed.
---
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/adaptiveHashjoin.c       |  323 +++++
 src/backend/executor/nodeHash.c               |   95 +-
 src/backend/executor/nodeHashjoin.c           | 1167 +++++++++++-----
 src/backend/postmaster/pgstat.c               |   21 +
 src/backend/storage/file/buffile.c            |   64 +
 src/backend/storage/ipc/barrier.c             |   85 ++
 src/backend/utils/sort/sharedtuplestore.c     |  127 +-
 src/include/executor/adaptiveHashjoin.h       |    9 +
 src/include/executor/hashjoin.h               |   28 +-
 src/include/executor/nodeHash.h               |    5 +-
 src/include/executor/tuptable.h               |    3 +-
 src/include/nodes/execnodes.h                 |   17 +
 src/include/pgstat.h                          |    8 +
 src/include/storage/barrier.h                 |    1 +
 src/include/storage/buffile.h                 |    3 +
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/sharedtuplestore.h          |   19 +
 src/test/regress/expected/adaptive_hj.out     | 1233 +++++++++++++++++
 .../regress/expected/parallel_adaptive_hj.out |  343 +++++
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/post_schedule                |    8 +
 src/test/regress/pre_schedule                 |  120 ++
 src/test/regress/serial_schedule              |    2 +
 src/test/regress/sql/adaptive_hj.sql          |  240 ++++
 src/test/regress/sql/parallel_adaptive_hj.sql |  182 +++
 26 files changed, 3726 insertions(+), 381 deletions(-)
 create mode 100644 src/backend/executor/adaptiveHashjoin.c
 create mode 100644 src/include/executor/adaptiveHashjoin.h
 create mode 100644 src/test/regress/expected/adaptive_hj.out
 create mode 100644 src/test/regress/expected/parallel_adaptive_hj.out
 create mode 100644 src/test/regress/post_schedule
 create mode 100644 src/test/regress/pre_schedule
 create mode 100644 src/test/regress/sql/adaptive_hj.sql
 create mode 100644 src/test/regress/sql/parallel_adaptive_hj.sql

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..54799d7644 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	adaptiveHashjoin.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/adaptiveHashjoin.c b/src/backend/executor/adaptiveHashjoin.c
new file mode 100644
index 0000000000..6962ba986d
--- /dev/null
+++ b/src/backend/executor/adaptiveHashjoin.c
@@ -0,0 +1,323 @@
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/parallel.h"
+#include "executor/executor.h"
+#include "executor/hashjoin.h"
+#include "executor/nodeHash.h"
+#include "executor/nodeHashjoin.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/sharedtuplestore.h"
+
+#include "executor/adaptiveHashjoin.h"
+
+
+
+
+bool
+ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
+{
+	HashJoinTable hashtable;
+	int batchno;
+	ParallelHashJoinBatch *phj_batch;
+	SharedTuplestoreAccessor *outer_tuples;
+	SharedTuplestoreAccessor *inner_tuples;
+	Barrier *chunk_barrier;
+
+	hashtable = hjstate->hj_HashTable;
+	batchno = hashtable->curbatch;
+	phj_batch = hashtable->batches[batchno].shared;
+	outer_tuples = hashtable->batches[batchno].outer_tuples;
+	inner_tuples = hashtable->batches[batchno].inner_tuples;
+
+	/*
+	 * This chunk_barrier is initialized in the ELECTING phase when this worker
+	 * attached to the batch in ExecParallelHashJoinNewBatch()
+	 */
+	chunk_barrier = &hashtable->batches[batchno].shared->chunk_barrier;
+
+	/*
+	 * If this worker just came from probing (from HJ_SCAN_BUCKET) we need to
+	 * advance the chunk number here. Otherwise this worker isn't attached yet
+	 * to the chunk barrier.
+	 */
+	if (advance_from_probing)
+	{
+		/*
+		 * The current chunk number can't be incremented if *any* worker isn't
+		 * done yet (otherwise they might access the wrong data structure!)
+		 */
+		if (BarrierArriveAndWait(chunk_barrier,
+		                         WAIT_EVENT_HASH_CHUNK_PROBING))
+			phj_batch->current_chunk_num++;
+
+		/* Once the barrier is advanced we'll be in the DONE phase */
+	}
+	else
+		BarrierAttach(chunk_barrier);
+
+	/*
+	 * The outer side is exhausted and either
+	 * 1) the current chunk of the inner side is exhausted and it is time to advance the chunk
+	 * 2) the last chunk of the inner side is exhausted and it is time to advance the batch
+	 */
+	switch (BarrierPhase(chunk_barrier))
+	{
+		// TODO: remove this phase and coordinate access to hashtable above goto and after incrementing current_chunk_num
+		case PHJ_CHUNK_ELECTING:
+		phj_chunk_electing:
+			BarrierArriveAndWait(chunk_barrier,
+									 WAIT_EVENT_HASH_CHUNK_ELECTING);
+			/* Fall through. */
+
+		case PHJ_CHUNK_LOADING:
+			/* Start (or join in) loading the next chunk of inner tuples. */
+			sts_begin_parallel_scan(inner_tuples);
+
+			MinimalTuple  tuple;
+			tupleMetadata metadata;
+
+			while ((tuple = sts_parallel_scan_next(inner_tuples, &metadata)))
+			{
+				if (metadata.tupleid != phj_batch->current_chunk_num)
+					continue;
+
+				ExecForceStoreMinimalTuple(tuple,
+										   hjstate->hj_HashTupleSlot,
+										   false);
+
+				ExecParallelHashTableInsertCurrentBatch(
+					hashtable,
+					hjstate->hj_HashTupleSlot,
+					metadata.hashvalue);
+			}
+			sts_end_parallel_scan(inner_tuples);
+			BarrierArriveAndWait(chunk_barrier,
+								 WAIT_EVENT_HASH_CHUNK_LOADING);
+			/* Fall through. */
+
+		case PHJ_CHUNK_PROBING:
+			sts_begin_parallel_scan(outer_tuples);
+			return true;
+
+		case PHJ_CHUNK_DONE:
+
+			BarrierArriveAndWait(chunk_barrier, WAIT_EVENT_HASH_CHUNK_DONE);
+
+			if (phj_batch->current_chunk_num > phj_batch->total_num_chunks)
+			{
+				BarrierDetach(chunk_barrier);
+				return false;
+			}
+
+			/*
+			 * Otherwise it is time for the next chunk.
+			 * One worker should reset the hashtable
+			 */
+			if (BarrierArriveExplicitAndWait(chunk_barrier, PHJ_CHUNK_ELECTING, WAIT_EVENT_HASH_ADVANCE_CHUNK))
+			{
+				/* rewind/reset outer tuplestore and rewind outer match status files */
+				sts_reinitialize(outer_tuples);
+
+				/* reset inner's hashtable and recycle the existing bucket array. */
+				dsa_pointer_atomic *buckets = (dsa_pointer_atomic *)
+						dsa_get_address(hashtable->area, phj_batch->buckets);
+				for (size_t i = 0; i < hashtable->nbuckets; ++i)
+					dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
+
+				// TODO: this will unfortunately rescan all inner tuples in the batch for each chunk
+				// should be able to save the block in the file which starts the next chunk instead
+				sts_reinitialize(inner_tuples);
+			}
+			goto phj_chunk_electing;
+
+		case PHJ_CHUNK_FINAL:
+			BarrierDetach(chunk_barrier);
+			return false;
+
+		default:
+			elog(ERROR, "unexpected chunk phase %d. pid %i. batch %i.",
+				 BarrierPhase(chunk_barrier), MyProcPid, batchno);
+	}
+
+	return false;
+}
+
+
+/*
+ * Choose a batch to work on, and attach to it.  Returns true if successful,
+ * false if there are no more batches.
+ */
+bool
+ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			start_batchno;
+	int			batchno;
+
+	/*
+	 * If we started up so late that the batch tracking array has been freed
+	 * already by ExecHashTableDetach(), then we are finished.  See also
+	 * ExecParallelHashEnsureBatchAccessors().
+	 */
+	if (hashtable->batches == NULL)
+		return false;
+
+	/*
+	 * For hashloop fallback only
+	 * Only the elected worker who was chosen to combine the outer match status bitmaps
+	 * should reach here. This worker must do some final cleanup and then detach from the batch
+	 */
+	if (hjstate->combined_bitmap != NULL)
+	{
+		BufFileClose(hjstate->combined_bitmap);
+		hjstate->combined_bitmap = NULL;
+		hashtable->batches[hashtable->curbatch].done = true;
+		ExecHashTableDetachBatch(hashtable);
+	}
+
+	/*
+	 * If we were already attached to a batch, remember not to bother checking
+	 * it again, and detach from it (possibly freeing the hash table if we are
+	 * last to detach).
+	 * curbatch is set when the batch_barrier phase is either PHJ_BATCH_LOADING
+	 * or PHJ_BATCH_CHUNKING (note that the PHJ_BATCH_LOADING case will fall through
+	 * to the PHJ_BATCH_CHUNKING case). The PHJ_BATCH_CHUNKING case returns to the
+	 * caller. So when this function is reentered with a curbatch >= 0 then we must
+	 * be done probing.
+	 */
+	if (hashtable->curbatch >= 0)
+	{
+		ParallelHashJoinBatchAccessor *accessor = hashtable->batches + hashtable->curbatch;
+		ParallelHashJoinBatch *batch = accessor->shared;
+
+		/*
+		 * End the parallel scan on the outer tuples before we arrive at the next barrier
+		 * so that the last worker to arrive at that barrier can reinitialize the SharedTuplestore
+		 * for another parallel scan.
+		 */
+
+		if (!batch->parallel_hashloop_fallback)
+			BarrierArriveAndWait(&batch->batch_barrier,
+			                     WAIT_EVENT_HASH_BATCH_PROBING);
+		else
+		{
+			sts_close_outer_match_status_file(accessor->outer_tuples);
+
+			/*
+			 * If all workers (including this one) have finished probing the batch, one worker is elected to
+			 * Combine all the outer match status files from the workers who were attached to this batch
+			 * Loop through the outer match status files from all workers that were attached to this batch
+			 * Combine them into one bitmap
+			 * Use the bitmap, loop through the outer batch file again, and emit unmatched tuples
+			 */
+
+			if (BarrierArriveAndWait(&batch->batch_barrier,
+									 WAIT_EVENT_HASH_BATCH_PROBING))
+			{
+				hjstate->combined_bitmap = sts_combine_outer_match_status_files(accessor->outer_tuples);
+				hjstate->last_worker = true;
+				return true;
+			}
+		}
+
+		/* the elected combining worker should not reach here */
+		hashtable->batches[hashtable->curbatch].done = true;
+		ExecHashTableDetachBatch(hashtable);
+	}
+
+	/*
+	 * Search for a batch that isn't done.  We use an atomic counter to start
+	 * our search at a different batch in every participant when there are
+	 * more batches than participants.
+	 */
+	batchno = start_batchno =
+			pg_atomic_fetch_add_u32(&hashtable->parallel_state->distributor, 1) %
+			hashtable->nbatch;
+
+	do
+	{
+		if (!hashtable->batches[batchno].done)
+		{
+			Barrier *batch_barrier =
+					&hashtable->batches[batchno].shared->batch_barrier;
+
+			switch (BarrierAttach(batch_barrier))
+			{
+				case PHJ_BATCH_ELECTING:
+					/* One backend allocates the hash table. */
+					if (BarrierArriveAndWait(batch_barrier,
+											 WAIT_EVENT_HASH_BATCH_ELECTING))
+					{
+						ExecParallelHashTableAlloc(hashtable, batchno);
+						Barrier *chunk_barrier =
+								&hashtable->batches[batchno].shared->chunk_barrier;
+						BarrierInit(chunk_barrier, 0);
+						hashtable->batches[batchno].shared->current_chunk_num = 1;
+					}
+					/* Fall through. */
+
+				case PHJ_BATCH_ALLOCATING:
+					/* Wait for allocation to complete. */
+					BarrierArriveAndWait(batch_barrier,
+										 WAIT_EVENT_HASH_BATCH_ALLOCATING);
+					/* Fall through. */
+
+				case PHJ_BATCH_CHUNKING:
+					/*
+					 * This batch is ready to probe.  Return control to
+					 * caller. We stay attached to batch_barrier so that the
+					 * hash table stays alive until everyone's finished
+					 * probing it, but no participant is allowed to wait at
+					 * this barrier again (or else a deadlock could occur).
+					 * All attached participants must eventually call
+					 * BarrierArriveAndDetach() so that the final phase
+					 * PHJ_BATCH_DONE can be reached.
+					 */
+					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
+
+					if (batchno == 0)
+						sts_begin_parallel_scan(hashtable->batches[batchno].outer_tuples);
+					/*
+					 * Create an outer match status file for this batch for this worker
+					 * This file must be accessible to the other workers
+					 * But *only* written to by this worker.
+					 * Written to by this worker and readable by any worker
+					 */
+					if (hashtable->batches[batchno].shared->parallel_hashloop_fallback)
+						sts_make_outer_match_status_file(hashtable->batches[batchno].outer_tuples);
+
+					return true;
+
+				case PHJ_BATCH_OUTER_MATCH_STATUS_PROCESSING:
+					/*
+					 * The batch isn't done but this worker can't contribute
+					 * anything to it so it might as well be done from this
+					 * worker's perspective. (Only one worker can do work in
+					 * this phase).
+					 */
+
+					/* Fall through. */
+
+				case PHJ_BATCH_DONE:
+					/*
+					 * Already done. Detach and go around again (if any remain).
+					 */
+					BarrierDetach(batch_barrier);
+
+					hashtable->batches[batchno].done = true;
+					hashtable->curbatch = -1;
+					break;
+
+				default:
+					elog(ERROR, "unexpected batch phase %d. pid %i. batchno %i.",
+						 BarrierPhase(batch_barrier), MyProcPid, batchno);
+			}
+		}
+		batchno = (batchno + 1) % hashtable->nbatch;
+	} while (batchno != start_batchno);
+
+	return false;
+}
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 568938667f..76e174a9b4 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -588,7 +588,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 		 * Attach to the build barrier.  The corresponding detach operation is
 		 * in ExecHashTableDetach.  Note that we won't attach to the
 		 * batch_barrier for batch 0 yet.  We'll attach later and start it out
-		 * in PHJ_BATCH_PROBING phase, because batch 0 is allocated up front
+		 * in PHJ_BATCH_CHUNKING phase, because batch 0 is allocated up front
 		 * and then loaded while hashing (the standard hybrid hash join
 		 * algorithm), and we'll coordinate that using build_barrier.
 		 */
@@ -1061,7 +1061,9 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 	int			i;
 
 	Assert(BarrierPhase(&pstate->build_barrier) == PHJ_BUILD_HASHING_INNER);
-
+	LWLockAcquire(&pstate->lock, LW_EXCLUSIVE);
+	pstate->batch_increases++;
+	LWLockRelease(&pstate->lock);
 	/*
 	 * It's unlikely, but we need to be prepared for new participants to show
 	 * up while we're in the middle of this operation so we need to switch on
@@ -1216,11 +1218,17 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 			{
 				bool		space_exhausted = false;
 				bool		extreme_skew_detected = false;
+				bool		excessive_batch_num_increases = false;
 
 				/* Make sure that we have the current dimensions and buckets. */
 				ExecParallelHashEnsureBatchAccessors(hashtable);
 				ExecParallelHashTableSetCurrentBatch(hashtable, 0);
 
+				LWLockAcquire(&pstate->lock, LW_EXCLUSIVE);
+				if (pstate->batch_increases >= 2)
+					excessive_batch_num_increases = true;
+				LWLockRelease(&pstate->lock);
+
 				/* Are any of the new generation of batches exhausted? */
 				for (i = 0; i < hashtable->nbatch; ++i)
 				{
@@ -1233,6 +1241,14 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 
 						space_exhausted = true;
 
+						// only once we've increased the number of batches overall many times should we start setting
+						// some batches to use the fallback strategy. Those that are still too big will have this option set
+						// we better not repartition again (growth should be disabled), so that we don't overwrite this value
+						// this tells us if we have set fallback to true or not and how many chunks -- useful for seeing how many chunks
+						// we can get to before setting it to true (since we still mark chunks (work_mem sized chunks)) in batches even if we don't fall back
+						// same for below but opposite
+						if (excessive_batch_num_increases == true)
+							batch->parallel_hashloop_fallback = true;
 						/*
 						 * Did this batch receive ALL of the tuples from its
 						 * parent batch?  That would indicate that further
@@ -1248,6 +1264,8 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 				/* Don't keep growing if it's not helping or we'd overflow. */
 				if (extreme_skew_detected || hashtable->nbatch >= INT_MAX / 2)
 					pstate->growth = PHJ_GROWTH_DISABLED;
+				else if (excessive_batch_num_increases && space_exhausted)
+					pstate->growth = PHJ_GROWTH_DISABLED;
 				else if (space_exhausted)
 					pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
 				else
@@ -1315,9 +1333,26 @@ ExecParallelHashRepartitionFirst(HashJoinTable hashtable)
 				MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 
 				/* It belongs in a later batch. */
+				ParallelHashJoinBatch *phj_batch = hashtable->batches[batchno].shared;
+
+				LWLockAcquire(&phj_batch->lock, LW_EXCLUSIVE);
+				// TODO: should I check batch estimated size here at all?
+				if (phj_batch->parallel_hashloop_fallback == true && (phj_batch->estimated_chunk_size + tuple_size > hashtable->parallel_state->space_allowed))
+				{
+					phj_batch->total_num_chunks++;
+					phj_batch->estimated_chunk_size = tuple_size;
+				}
+				else
+					phj_batch->estimated_chunk_size += tuple_size;
+
+				tupleMetadata metadata;
+				metadata.hashvalue = hashTuple->hashvalue;
+				metadata.tupleid = phj_batch->total_num_chunks;
+				LWLockRelease(&phj_batch->lock);
+
 				hashtable->batches[batchno].estimated_size += tuple_size;
 				sts_puttuple(hashtable->batches[batchno].inner_tuples,
-							 &hashTuple->hashvalue, tuple);
+							 &metadata, tuple);
 			}
 
 			/* Count this tuple. */
@@ -1369,12 +1404,14 @@ ExecParallelHashRepartitionRest(HashJoinTable hashtable)
 
 		/* Scan one partition from the previous generation. */
 		sts_begin_parallel_scan(old_inner_tuples[i]);
-		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i], &hashvalue)))
+		tupleMetadata metadata;
+		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i], &metadata)))
 		{
 			size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 			int			bucketno;
 			int			batchno;
 
+			hashvalue = metadata.hashvalue;
 			/* Decide which partition it goes to in the new generation. */
 			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
 									  &batchno);
@@ -1383,10 +1420,23 @@ ExecParallelHashRepartitionRest(HashJoinTable hashtable)
 			++hashtable->batches[batchno].ntuples;
 			++hashtable->batches[i].old_ntuples;
 
+			ParallelHashJoinBatch *phj_batch = hashtable->batches[batchno].shared;
+			LWLockAcquire(&phj_batch->lock, LW_EXCLUSIVE);
+			// TODO: should I check batch estimated size here at all?
+			if (phj_batch->parallel_hashloop_fallback == true && (phj_batch->estimated_chunk_size + tuple_size > pstate->space_allowed))
+			{
+				phj_batch->total_num_chunks++;
+				phj_batch->estimated_chunk_size = tuple_size;
+			}
+			else
+				phj_batch->estimated_chunk_size += tuple_size;
+			metadata.tupleid = phj_batch->total_num_chunks;
+			LWLockRelease(&phj_batch->lock);
 			/* Store the tuple its new batch. */
 			sts_puttuple(hashtable->batches[batchno].inner_tuples,
-						 &hashvalue, tuple);
+						 &metadata, tuple);
 
+			// TODO: should I zero out metadata here to make sure old values aren't reused?
 			CHECK_FOR_INTERRUPTS();
 		}
 		sts_end_parallel_scan(old_inner_tuples[i]);
@@ -1719,7 +1769,7 @@ retry:
 		size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 
 		Assert(batchno > 0);
-
+		ParallelHashJoinState *pstate = hashtable->parallel_state;
 		/* Try to preallocate space in the batch if necessary. */
 		if (hashtable->batches[batchno].preallocated < tuple_size)
 		{
@@ -1729,7 +1779,25 @@ retry:
 
 		Assert(hashtable->batches[batchno].preallocated >= tuple_size);
 		hashtable->batches[batchno].preallocated -= tuple_size;
-		sts_puttuple(hashtable->batches[batchno].inner_tuples, &hashvalue,
+		ParallelHashJoinBatch *phj_batch = hashtable->batches[batchno].shared;
+		LWLockAcquire(&phj_batch->lock, LW_EXCLUSIVE);
+
+		// TODO: should batch estimated size be considered here?
+		// TODO: should this be done in ExecParallelHashTableInsertCurrentBatch instead?
+		if (phj_batch->parallel_hashloop_fallback == true && (phj_batch->estimated_chunk_size + tuple_size > pstate->space_allowed))
+		{
+			phj_batch->total_num_chunks++;
+			phj_batch->estimated_chunk_size = tuple_size;
+		}
+		else
+			phj_batch->estimated_chunk_size += tuple_size;
+
+		tupleMetadata metadata;
+		metadata.hashvalue = hashvalue;
+		metadata.tupleid = phj_batch->total_num_chunks;
+		LWLockRelease(&phj_batch->lock);
+
+		sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata,
 					 tuple);
 	}
 	++hashtable->batches[batchno].ntuples;
@@ -2936,6 +3004,13 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
 		char		name[MAXPGPATH];
 
+		shared->parallel_hashloop_fallback = false;
+		LWLockInitialize(&shared->lock,
+		                 LWTRANCHE_PARALLEL_HASH_JOIN_BATCH);
+		shared->current_chunk_num = 0;
+		shared->total_num_chunks = 1;
+		shared->estimated_chunk_size = 0;
+
 		/*
 		 * All members of shared were zero-initialized.  We just need to set
 		 * up the Barrier.
@@ -2945,7 +3020,7 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 		{
 			/* Batch 0 doesn't need to be loaded. */
 			BarrierAttach(&shared->batch_barrier);
-			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_PROBING)
+			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_CHUNKING)
 				BarrierArriveAndWait(&shared->batch_barrier, 0);
 			BarrierDetach(&shared->batch_barrier);
 		}
@@ -2959,7 +3034,7 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 			sts_initialize(ParallelHashJoinBatchInner(shared),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
@@ -2969,7 +3044,7 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 													  pstate->nparticipants),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index ec37558c12..d9e967b41a 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -81,11 +81,11 @@
  *  PHJ_BATCH_ELECTING       -- initial state
  *  PHJ_BATCH_ALLOCATING     -- one allocates buckets
  *  PHJ_BATCH_LOADING        -- all load the hash table from disk
- *  PHJ_BATCH_PROBING        -- all probe
+ *  PHJ_BATCH_CHUNKING       -- all probe
  *  PHJ_BATCH_DONE           -- end
  *
  * Batch 0 is a special case, because it starts out in phase
- * PHJ_BATCH_PROBING; populating batch 0's hash table is done during
+ * PHJ_BATCH_CHUNKING; populating batch 0's hash table is done during
  * PHJ_BUILD_HASHING_INNER so we can skip loading.
  *
  * Initially we try to plan for a single-batch hash join using the combined
@@ -98,7 +98,7 @@
  * already arrived.  Practically, that means that we never return a tuple
  * while attached to a barrier, unless the barrier has reached its final
  * state.  In the slightly special case of the per-batch barrier, we return
- * tuples while in PHJ_BATCH_PROBING phase, but that's OK because we use
+ * tuples while in PHJ_BATCH_CHUNKING phase, but that's OK because we use
  * BarrierArriveAndDetach() to advance it to PHJ_BATCH_DONE without waiting.
  *
  *-------------------------------------------------------------------------
@@ -117,6 +117,8 @@
 #include "utils/memutils.h"
 #include "utils/sharedtuplestore.h"
 
+#include "executor/adaptiveHashjoin.h"
+
 
 /*
  * States of the ExecHashJoin state machine
@@ -124,9 +126,11 @@
 #define HJ_BUILD_HASHTABLE		1
 #define HJ_NEED_NEW_OUTER		2
 #define HJ_SCAN_BUCKET			3
-#define HJ_FILL_OUTER_TUPLE		4
-#define HJ_FILL_INNER_TUPLES	5
-#define HJ_NEED_NEW_BATCH		6
+#define HJ_FILL_INNER_TUPLES    4
+#define HJ_NEED_NEW_BATCH		5
+#define HJ_NEED_NEW_INNER_CHUNK 6
+#define HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT 7
+#define HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER 8
 
 /* Returns true if doing null-fill on outer relation */
 #define HJ_FILL_OUTER(hjstate)	((hjstate)->hj_NullInnerTupleSlot != NULL)
@@ -143,10 +147,15 @@ static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 												 BufFile *file,
 												 uint32 *hashvalue,
 												 TupleTableSlot *tupleSlot);
-static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
-static bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
+
+static bool ExecHashJoinAdvanceBatch(HashJoinState *hjstate);
+static bool ExecHashJoinLoadInnerBatch(HashJoinState *hjstate);
 static void ExecParallelHashJoinPartitionOuter(HashJoinState *node);
 
+static TupleTableSlot *emitUnmatchedOuterTuple(ExprState *otherqual,
+											   ExprContext *econtext,
+											   HashJoinState *hjstate);
+
 
 /* ----------------------------------------------------------------
  *		ExecHashJoinImpl
@@ -161,8 +170,15 @@ static void ExecParallelHashJoinPartitionOuter(HashJoinState *node);
  *			  the other one is "outer".
  * ----------------------------------------------------------------
  */
-static pg_attribute_always_inline TupleTableSlot *
-ExecHashJoinImpl(PlanState *pstate, bool parallel)
+
+/* ----------------------------------------------------------------
+ *		ExecHashJoin
+ *
+ *		Parallel-oblivious version.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *			/* return: a tuple or NULL */
+ExecHashJoin(PlanState *pstate)
 {
 	HashJoinState *node = castNode(HashJoinState, pstate);
 	PlanState  *outerNode;
@@ -174,7 +190,8 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 	TupleTableSlot *outerTupleSlot;
 	uint32		hashvalue;
 	int			batchno;
-	ParallelHashJoinState *parallel_state;
+
+	BufFile    *outerFileForAdaptiveRead;
 
 	/*
 	 * get information from HashJoin node
@@ -185,7 +202,6 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 	outerNode = outerPlanState(node);
 	hashtable = node->hj_HashTable;
 	econtext = node->js.ps.ps_ExprContext;
-	parallel_state = hashNode->parallel_state;
 
 	/*
 	 * Reset per-tuple memory context to free any expression evaluation
@@ -238,49 +254,514 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				 * from the outer plan node.  If we succeed, we have to stash
 				 * it away for later consumption by ExecHashJoinOuterGetTuple.
 				 */
-				if (HJ_FILL_INNER(node))
-				{
-					/* no chance to not build the hash table */
-					node->hj_FirstOuterTupleSlot = NULL;
-				}
-				else if (parallel)
-				{
-					/*
-					 * The empty-outer optimization is not implemented for
-					 * shared hash tables, because no one participant can
-					 * determine that there are no outer tuples, and it's not
-					 * yet clear that it's worth the synchronization overhead
-					 * of reaching consensus to figure that out.  So we have
-					 * to build the hash table.
-					 */
-					node->hj_FirstOuterTupleSlot = NULL;
-				}
-				else if (HJ_FILL_OUTER(node) ||
-						 (outerNode->plan->startup_cost < hashNode->ps.plan->total_cost &&
-						  !node->hj_OuterNotEmpty))
-				{
-					node->hj_FirstOuterTupleSlot = ExecProcNode(outerNode);
-					if (TupIsNull(node->hj_FirstOuterTupleSlot))
-					{
-						node->hj_OuterNotEmpty = false;
-						return NULL;
-					}
-					else
-						node->hj_OuterNotEmpty = true;
-				}
-				else
-					node->hj_FirstOuterTupleSlot = NULL;
+				if (HJ_FILL_INNER(node))
+				{
+					/* no chance to not build the hash table */
+					node->hj_FirstOuterTupleSlot = NULL;
+				}
+				else if (HJ_FILL_OUTER(node) ||
+					(outerNode->plan->startup_cost < hashNode->ps.plan->total_cost &&
+						!node->hj_OuterNotEmpty))
+				{
+					node->hj_FirstOuterTupleSlot = ExecProcNode(outerNode);
+					if (TupIsNull(node->hj_FirstOuterTupleSlot))
+					{
+						node->hj_OuterNotEmpty = false;
+						return NULL;
+					}
+					else
+						node->hj_OuterNotEmpty = true;
+				}
+				else
+					node->hj_FirstOuterTupleSlot = NULL;
+
+				/* Create the hash table. */
+				hashtable = ExecHashTableCreate(hashNode,
+				                                node->hj_HashOperators,
+				                                node->hj_Collations,
+				                                HJ_FILL_INNER(node));
+				node->hj_HashTable = hashtable;
+
+				/* Execute the Hash node, to build the hash table. */
+				hashNode->hashtable = hashtable;
+				(void) MultiExecProcNode((PlanState *) hashNode);
+
+				/*
+				 * If the inner relation is completely empty, and we're not
+				 * doing a left outer join, we can quit without scanning the
+				 * outer relation.
+				 */
+				if (hashtable->totalTuples == 0 && !HJ_FILL_OUTER(node))
+					return NULL;
+
+				/*
+				 * need to remember whether nbatch has increased since we
+				 * began scanning the outer relation
+				 */
+				hashtable->nbatch_outstart = hashtable->nbatch;
+
+				/*
+				 * Reset OuterNotEmpty for scan.  (It's OK if we fetched a
+				 * tuple above, because ExecHashJoinOuterGetTuple will
+				 * immediately set it again.)
+				 */
+				node->hj_OuterNotEmpty = false;
+
+				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+
+				/* FALL THRU */
+
+			case HJ_NEED_NEW_OUTER:
+
+				/*
+				 * We don't have an outer tuple, try to get the next one
+				 */
+				outerTupleSlot =
+					ExecHashJoinOuterGetTuple(outerNode, node, &hashvalue);
+
+				if (TupIsNull(outerTupleSlot))
+				{
+					/*
+					 * end of batch, or maybe whole join.
+					 * for hashloop fallback, all we know is outer batch is
+					 * exhausted. inner could have more chunks
+					 */
+					if (HJ_FILL_INNER(node))
+					{
+						/* set up to scan for unmatched inner tuples */
+						ExecPrepHashTableForUnmatched(node);
+						node->hj_JoinState = HJ_FILL_INNER_TUPLES;
+						break;
+					}
+					node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
+					break;
+				}
+
+				econtext->ecxt_outertuple = outerTupleSlot;
+
+				/*
+				 * Find the corresponding bucket for this tuple in the main
+				 * hash table or skew hash table.
+				 */
+				node->hj_CurHashValue = hashvalue;
+				ExecHashGetBucketAndBatch(hashtable, hashvalue,
+				                          &node->hj_CurBucketNo, &batchno);
+				node->hj_CurSkewBucketNo = ExecHashGetSkewBucket(hashtable,
+				                                                 hashvalue);
+				node->hj_CurTuple = NULL;
+
+				/*
+				 * for the hashloop fallback case,
+				 * only initialize hj_MatchedOuter to false during the first chunk.
+				 * otherwise, we will be resetting hj_MatchedOuter to false for
+				 * an outer tuple that has already matched an inner tuple.
+				 * also, hj_MatchedOuter should be set to false for batch 0.
+				 * there are no chunks for batch 0, and node->hj_InnerFirstChunk isn't
+				 * set to true until HJ_NEED_NEW_BATCH,
+				 * so need to handle batch 0 explicitly
+				 */
+
+				if (!node->hashloop_fallback || hashtable->curbatch == 0 || node->hj_InnerFirstChunk)
+					node->hj_MatchedOuter = false;
+
+				/*
+				 * The tuple might not belong to the current batch (where
+				 * "current batch" includes the skew buckets if any).
+				 */
+				if (batchno != hashtable->curbatch &&
+					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO)
+				{
+					bool		shouldFree;
+					MinimalTuple mintuple = ExecFetchSlotMinimalTuple(outerTupleSlot,
+					                                                  &shouldFree);
+
+					/*
+					 * Need to postpone this outer tuple to a later batch.
+					 * Save it in the corresponding outer-batch file.
+					 */
+					Assert(batchno > hashtable->curbatch);
+					ExecHashJoinSaveTuple(mintuple, hashvalue,
+					                      &hashtable->outerBatchFile[batchno]);
+
+					if (shouldFree)
+						heap_free_minimal_tuple(mintuple);
+
+					/* Loop around, staying in HJ_NEED_NEW_OUTER state */
+					continue;
+				}
+
+				if (node->hashloop_fallback)
+				{
+					/* first tuple of new batch */
+					if (node->hj_OuterMatchStatusesFile == NULL)
+					{
+						node->hj_OuterTupleCount = 0;
+						node->hj_OuterMatchStatusesFile = BufFileCreateTemp(false);
+					}
+
+					/* for fallback case, always increment tuple count */
+					node->hj_OuterTupleCount++;
+
+					/* Use the next byte on every 8th tuple */
+					if ((node->hj_OuterTupleCount - 1) % 8 == 0)
+					{
+						/*
+						 * first chunk of new batch, so write and initialize
+						 * enough bytes in the outer tuple match status file to
+						 * capture all tuples' match statuses
+						 */
+						if (node->hj_InnerFirstChunk)
+						{
+							node->hj_OuterCurrentByte = 0;
+							BufFileWrite(node->hj_OuterMatchStatusesFile, &node->hj_OuterCurrentByte, 1);
+						}
+							/* otherwise, just read the next byte */
+						else
+							BufFileRead(node->hj_OuterMatchStatusesFile, &node->hj_OuterCurrentByte, 1);
+					}
+				}
+
+				/* OK, let's scan the bucket for matches */
+				node->hj_JoinState = HJ_SCAN_BUCKET;
+
+				/* FALL THRU */
+
+			case HJ_SCAN_BUCKET:
+
+				/*
+				 * Scan the selected hash bucket for matches to current outer
+				 */
+				if (!ExecScanHashBucket(node, econtext))
+				{
+					/*
+					 * The current outer tuple has run out of matches, so check
+					 * whether to emit a dummy outer-join tuple.  Whether we emit
+					 * one or not, the next state is NEED_NEW_OUTER.
+					 */
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+					if (!node->hashloop_fallback || node->hj_HashTable->curbatch == 0)
+					{
+						TupleTableSlot *slot = emitUnmatchedOuterTuple(otherqual, econtext, node);
+						if (slot != NULL)
+							return slot;
+					}
+					continue;
+				}
+
+				if (joinqual != NULL && !ExecQual(joinqual, econtext))
+				{
+					InstrCountFiltered1(node, 1);
+					break;
+				}
+
+				/*
+				 * We've got a match, but still need to test non-hashed quals.
+				 * ExecScanHashBucket already set up all the state needed to
+				 * call ExecQual.
+				 *
+				 * If we pass the qual, then save state for next call and have
+				 * ExecProject form the projection, store it in the tuple
+				 * table, and return the slot.
+				 *
+				 * Only the joinquals determine tuple match status, but all
+				 * quals must pass to actually return the tuple.
+				 */
+
+				node->hj_MatchedOuter = true;
+				HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
+
+				/* In an antijoin, we never return a matched tuple */
+				if (node->js.jointype == JOIN_ANTI)
+				{
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+					continue;
+				}
+
+				/*
+				 * If we only need to join to the first matching inner tuple,
+				 * then consider returning this one, but after that, continue
+				 * with next outer tuple.
+				 */
+				// TODO: is semi-join correct for AHJ
+				if (node->js.single_match)
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+
+				/*
+				 * Set the match bit for this outer tuple in the match
+				 * status file
+				 */
+				if (node->hj_OuterMatchStatusesFile != NULL)
+				{
+					Assert(node->hashloop_fallback == true);
+					int byte_to_set = (node->hj_OuterTupleCount - 1) / 8;
+					int bit_to_set_in_byte = (node->hj_OuterTupleCount - 1) % 8;
+
+					BufFileSeek(node->hj_OuterMatchStatusesFile, 0, byte_to_set, SEEK_SET);
+
+					node->hj_OuterCurrentByte = node->hj_OuterCurrentByte | (1 << bit_to_set_in_byte);
+
+					BufFileWrite(node->hj_OuterMatchStatusesFile, &node->hj_OuterCurrentByte, 1);
+				}
+
+				if (otherqual == NULL || ExecQual(otherqual, econtext))
+					return ExecProject(node->js.ps.ps_ProjInfo);
+				InstrCountFiltered2(node, 1);
+				break;
+
+			case HJ_FILL_INNER_TUPLES:
+
+				/*
+				 * We have finished a batch, but we are doing right/full join,
+				 * so any unmatched inner tuples in the hashtable have to be
+				 * emitted before we continue to the next batch.
+				 */
+				if (!ExecScanHashTableForUnmatched(node, econtext))
+				{
+					/* no more unmatched tuples */
+					node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
+					continue;
+				}
+
+				/*
+				 * Generate a fake join tuple with nulls for the outer tuple,
+				 * and return it if it passes the non-join quals.
+				 */
+				econtext->ecxt_outertuple = node->hj_NullOuterTupleSlot;
+
+				if (otherqual == NULL || ExecQual(otherqual, econtext))
+					return ExecProject(node->js.ps.ps_ProjInfo);
+				InstrCountFiltered2(node, 1);
+				break;
+
+			case HJ_NEED_NEW_BATCH:
+
+				/*
+				 * Try to advance to next batch.  Done if there are no more.
+				 * for batches after batch 0 for which hashloop_fallback is
+				 * true, if inner is exhausted, need to consider emitting
+				 * unmatched tuples we should never get here when
+				 * hashloop_fallback is false but hj_InnerExhausted is true,
+				 * however, it felt more clear to check for
+				 * hashloop_fallback explicitly
+				 */
+				if (node->hashloop_fallback && HJ_FILL_OUTER(node) && node->hj_InnerExhausted)
+				{
+					/*
+					 * For hashloop fallback, outer tuples are not emitted
+					 * until directly before advancing the batch (after all
+					 * inner chunks have been processed).
+					 * node->hashloop_fallback should be true because it is
+					 * not reset to false until advancing the batches
+					 */
+					node->hj_InnerExhausted = false;
+					node->hj_JoinState = HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT;
+					break;
+				}
+
+				if (!ExecHashJoinAdvanceBatch(node))
+					return NULL;
+
+				// TODO: need to find a better way to distinguish if I should load inner batch again than checking for outer batch file
+				// I need to also do this even if it is NULL when it is a ROJ
+				// need to load inner again if it is an inner or left outer join and there are outer tuples in the batch OR
+				// if it is a ROJ and there are inner tuples in the batch -- should never have no tuples in either batch...
+				if (BufFileRewindIfExists(node->hj_HashTable->outerBatchFile[node->hj_HashTable->curbatch]) != NULL ||
+					(node->hj_HashTable->innerBatchFile[node->hj_HashTable->curbatch] != NULL && HJ_FILL_INNER(node)))
+					ExecHashJoinLoadInnerBatch(node); /* TODO: should I ever load inner when outer file is not present? */
+
+				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				break;
+
+			case HJ_NEED_NEW_INNER_CHUNK:
+
+				if (!node->hashloop_fallback)
+				{
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					break;
+				}
+
+				/*
+				 * it is the hashloop fallback case and there are no more chunks
+				 * inner is exhausted, so we must advance the batches
+				 */
+				if (node->hj_InnerPageOffset == 0L)
+				{
+					node->hj_InnerExhausted = true;
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					break;
+				}
+
+				/*
+				 * This is the hashloop fallback case and we have more chunks in
+				 * inner. curbatch > 0. Rewind outer batch file (if present) so
+				 * that we can start reading it. Rewind outer match statuses
+				 * file if present so that we can set match bits as needed. Reset
+				 * the tuple count and load the next chunk of inner. Then
+				 * proceed to get a new outer tuple from our rewound outer batch
+				 * file
+				 */
+				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+
+				// TODO: need to find a better way to distinguish if I should load inner batch again than checking for outer batch file
+				// I need to also do this even if it is NULL when it is a ROJ
+				// need to load inner again if it is an inner or left outer join and there are outer tuples in the batch OR
+				// if it is a ROJ and there are inner tuples in the batch -- should never have no tuples in either batch...
+				// if outer is not null or if it is a ROJ and inner is not null, must rewind outer match status and load inner
+				if (BufFileRewindIfExists(node->hj_HashTable->outerBatchFile[node->hj_HashTable->curbatch]) != NULL ||
+					(node->hj_HashTable->innerBatchFile[node->hj_HashTable->curbatch] != NULL && HJ_FILL_INNER(node)))
+				{
+					BufFileRewindIfExists(node->hj_OuterMatchStatusesFile);
+					node->hj_OuterTupleCount = 0;
+					ExecHashJoinLoadInnerBatch(node);
+				}
+				break;
+
+			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT:
+
+				node->hj_OuterTupleCount = 0;
+				BufFileRewindIfExists(node->hj_OuterMatchStatusesFile);
+
+				/* TODO: is it okay to use the hashtable to get the outer batch file here? */
+				outerFileForAdaptiveRead = hashtable->outerBatchFile[hashtable->curbatch];
+				if (outerFileForAdaptiveRead == NULL) /* TODO: could this happen */
+				{
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					break;
+				}
+				BufFileRewindIfExists(outerFileForAdaptiveRead);
+
+				node->hj_JoinState = HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER;
+				/* fall through */
+
+			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER:
+
+				outerFileForAdaptiveRead = hashtable->outerBatchFile[hashtable->curbatch];
+
+				while (true)
+				{
+					uint32 unmatchedOuterHashvalue;
+					TupleTableSlot *slot = ExecHashJoinGetSavedTuple(node,
+						outerFileForAdaptiveRead,
+						&unmatchedOuterHashvalue,
+						node->hj_OuterTupleSlot);
+					node->hj_OuterTupleCount++;
+
+					if (slot == NULL)
+					{
+						node->hj_JoinState = HJ_NEED_NEW_BATCH;
+						break;
+					}
+
+					unsigned char bit = (node->hj_OuterTupleCount - 1) % 8;
+
+					/* need to read the next byte */
+					if (bit == 0)
+						BufFileRead(node->hj_OuterMatchStatusesFile, &node->hj_OuterCurrentByte, 1);
+
+					/* if the match bit is set for this tuple, continue */
+					if ((node->hj_OuterCurrentByte >> bit) & 1)
+						continue;
+
+					/* if it is not a match then emit it NULL-extended */
+					econtext->ecxt_outertuple = slot;
+					econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
+					return ExecProject(node->js.ps.ps_ProjInfo);
+				}
+				/* came here from HJ_NEED_NEW_BATCH, so go back there */
+				node->hj_JoinState = HJ_NEED_NEW_BATCH;
+				break;
+
+			default:
+				elog(ERROR, "unrecognized hashjoin state: %d",
+				     (int) node->hj_JoinState);
+		}
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecParallelHashJoin
+ *
+ *		Parallel-aware version.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *			/* return: a tuple or NULL */
+ExecParallelHashJoin(PlanState *pstate)
+{
+	HashJoinState *node = castNode(HashJoinState, pstate);
+	PlanState  *outerNode;
+	HashState  *hashNode;
+	ExprState  *joinqual;
+	ExprState  *otherqual;
+	ExprContext *econtext;
+	HashJoinTable hashtable;
+	TupleTableSlot *outerTupleSlot;
+	uint32		hashvalue;
+	int			batchno;
+	ParallelHashJoinState *parallel_state;
+
+	/*
+	 * get information from HashJoin node
+	 */
+	joinqual = node->js.joinqual;
+	otherqual = node->js.ps.qual;
+	hashNode = (HashState *) innerPlanState(node);
+	outerNode = outerPlanState(node);
+	hashtable = node->hj_HashTable;
+	econtext = node->js.ps.ps_ExprContext;
+	parallel_state = hashNode->parallel_state;
+
+	bool advance_from_probing = false;
+
+	/*
+	 * Reset per-tuple memory context to free any expression evaluation
+	 * storage allocated in the previous tuple cycle.
+	 */
+	ResetExprContext(econtext);
+
+	/*
+	 * run the hash join state machine
+	 */
+	for (;;)
+	{
+		SharedTuplestoreAccessor *outer_acc;
+
+		/*
+		 * It's possible to iterate this loop many times before returning a
+		 * tuple, in some pathological cases such as needing to move much of
+		 * the current batch to a later batch.  So let's check for interrupts
+		 * each time through.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		switch (node->hj_JoinState)
+		{
+			case HJ_BUILD_HASHTABLE:
+
+				/*
+				 * First time through: build hash table for inner relation.
+				 */
+				Assert(hashtable == NULL);
+				//volatile int mybp = 0; while (mybp == 0);
+
+				/*
+				 * The empty-outer optimization is not implemented for
+				 * shared hash tables, because no one participant can
+				 * determine that there are no outer tuples, and it's not
+				 * yet clear that it's worth the synchronization overhead
+				 * of reaching consensus to figure that out.  So we have
+				 * to build the hash table.
+				 */
+				node->hj_FirstOuterTupleSlot = NULL;
 
 				/*
 				 * Create the hash table.  If using Parallel Hash, then
 				 * whoever gets here first will create the hash table and any
 				 * later arrivals will merely attach to it.
 				 */
-				hashtable = ExecHashTableCreate(hashNode,
-												node->hj_HashOperators,
-												node->hj_Collations,
-												HJ_FILL_INNER(node));
-				node->hj_HashTable = hashtable;
+				node->hj_HashTable = hashtable = ExecHashTableCreate(hashNode,
+				                                node->hj_HashOperators,
+				                                node->hj_Collations,
+				                                HJ_FILL_INNER(node));
 
 				/*
 				 * Execute the Hash node, to build the hash table.  If using
@@ -311,66 +792,59 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				 */
 				node->hj_OuterNotEmpty = false;
 
-				if (parallel)
-				{
-					Barrier    *build_barrier;
-
-					build_barrier = &parallel_state->build_barrier;
-					Assert(BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER ||
-						   BarrierPhase(build_barrier) == PHJ_BUILD_DONE);
-					if (BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER)
-					{
-						/*
-						 * If multi-batch, we need to hash the outer relation
-						 * up front.
-						 */
-						if (hashtable->nbatch > 1)
-							ExecParallelHashJoinPartitionOuter(node);
-						BarrierArriveAndWait(build_barrier,
-											 WAIT_EVENT_HASH_BUILD_HASHING_OUTER);
-					}
-					Assert(BarrierPhase(build_barrier) == PHJ_BUILD_DONE);
-
-					/* Each backend should now select a batch to work on. */
-					hashtable->curbatch = -1;
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+				Barrier    *build_barrier;
 
-					continue;
+				build_barrier = &parallel_state->build_barrier;
+				Assert(BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER ||
+					       BarrierPhase(build_barrier) == PHJ_BUILD_DONE);
+				if (BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER)
+				{
+					/*
+					 * If multi-batch, we need to hash the outer relation
+					 * up front.
+					 */
+					if (hashtable->nbatch > 1)
+						ExecParallelHashJoinPartitionOuter(node);
+					BarrierArriveAndWait(build_barrier,
+					                     WAIT_EVENT_HASH_BUILD_HASHING_OUTER);
 				}
-				else
-					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				Assert(BarrierPhase(build_barrier) == PHJ_BUILD_DONE);
 
-				/* FALL THRU */
+				/* Each backend should now select a batch to work on. */
+				hashtable->curbatch = -1;
+				node->hj_JoinState = HJ_NEED_NEW_BATCH;
+
+				continue;
 
 			case HJ_NEED_NEW_OUTER:
 
 				/*
 				 * We don't have an outer tuple, try to get the next one
 				 */
-				if (parallel)
-					outerTupleSlot =
-						ExecParallelHashJoinOuterGetTuple(outerNode, node,
-														  &hashvalue);
-				else
-					outerTupleSlot =
-						ExecHashJoinOuterGetTuple(outerNode, node, &hashvalue);
+				outerTupleSlot =
+					ExecParallelHashJoinOuterGetTuple(outerNode, node,
+					                                  &hashvalue);
 
 				if (TupIsNull(outerTupleSlot))
 				{
-					/* end of batch, or maybe whole join */
+					/*
+					 * end of batch, or maybe whole join.
+					 * for hashloop fallback, all we know is outer batch is
+					 * exhausted. inner could have more chunks
+					 */
 					if (HJ_FILL_INNER(node))
 					{
 						/* set up to scan for unmatched inner tuples */
 						ExecPrepHashTableForUnmatched(node);
 						node->hj_JoinState = HJ_FILL_INNER_TUPLES;
+						break;
 					}
-					else
-						node->hj_JoinState = HJ_NEED_NEW_BATCH;
-					continue;
+					advance_from_probing = true;
+					node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
+					break;
 				}
 
 				econtext->ecxt_outertuple = outerTupleSlot;
-				node->hj_MatchedOuter = false;
 
 				/*
 				 * Find the corresponding bucket for this tuple in the main
@@ -378,39 +852,24 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				 */
 				node->hj_CurHashValue = hashvalue;
 				ExecHashGetBucketAndBatch(hashtable, hashvalue,
-										  &node->hj_CurBucketNo, &batchno);
+				                          &node->hj_CurBucketNo, &batchno);
 				node->hj_CurSkewBucketNo = ExecHashGetSkewBucket(hashtable,
-																 hashvalue);
+				                                                 hashvalue);
 				node->hj_CurTuple = NULL;
 
 				/*
-				 * The tuple might not belong to the current batch (where
-				 * "current batch" includes the skew buckets if any).
+				 * for the hashloop fallback case,
+				 * only initialize hj_MatchedOuter to false during the first chunk.
+				 * otherwise, we will be resetting hj_MatchedOuter to false for
+				 * an outer tuple that has already matched an inner tuple.
+				 * also, hj_MatchedOuter should be set to false for batch 0.
+				 * there are no chunks for batch 0
 				 */
-				if (batchno != hashtable->curbatch &&
-					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO)
-				{
-					bool		shouldFree;
-					MinimalTuple mintuple = ExecFetchSlotMinimalTuple(outerTupleSlot,
-																	  &shouldFree);
-
-					/*
-					 * Need to postpone this outer tuple to a later batch.
-					 * Save it in the corresponding outer-batch file.
-					 */
-					Assert(parallel_state == NULL);
-					Assert(batchno > hashtable->curbatch);
-					ExecHashJoinSaveTuple(mintuple, hashvalue,
-										  &hashtable->outerBatchFile[batchno]);
-
-					if (shouldFree)
-						heap_free_minimal_tuple(mintuple);
 
-					/* Loop around, staying in HJ_NEED_NEW_OUTER state */
-					continue;
-				}
+				ParallelHashJoinBatch *phj_batch = node->hj_HashTable->batches[node->hj_HashTable->curbatch].shared;
 
-				/* OK, let's scan the bucket for matches */
+				if (!phj_batch->parallel_hashloop_fallback || phj_batch->current_chunk_num == 1)
+					node->hj_MatchedOuter = false;
 				node->hj_JoinState = HJ_SCAN_BUCKET;
 
 				/* FALL THRU */
@@ -420,25 +879,24 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				/*
 				 * Scan the selected hash bucket for matches to current outer
 				 */
-				if (parallel)
-				{
-					if (!ExecParallelScanHashBucket(node, econtext))
-					{
-						/* out of matches; check for possible outer-join fill */
-						node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
-						continue;
-					}
-				}
-				else
+				phj_batch = node->hj_HashTable->batches[node->hj_HashTable->curbatch].shared;
+
+				if (!ExecParallelScanHashBucket(node, econtext))
 				{
-					if (!ExecScanHashBucket(node, econtext))
+					/*
+					 * The current outer tuple has run out of matches, so check
+					 * whether to emit a dummy outer-join tuple.  Whether we emit
+					 * one or not, the next state is NEED_NEW_OUTER.
+					 */
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+					if (!phj_batch->parallel_hashloop_fallback)
 					{
-						/* out of matches; check for possible outer-join fill */
-						node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
-						continue;
+						TupleTableSlot *slot = emitUnmatchedOuterTuple(otherqual, econtext, node);
+						if (slot != NULL)
+							return slot;
 					}
+					continue;
 				}
-
 				/*
 				 * We've got a match, but still need to test non-hashed quals.
 				 * ExecScanHashBucket already set up all the state needed to
@@ -451,58 +909,45 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				 * Only the joinquals determine tuple match status, but all
 				 * quals must pass to actually return the tuple.
 				 */
-				if (joinqual == NULL || ExecQual(joinqual, econtext))
+				if (joinqual != NULL && !ExecQual(joinqual, econtext))
 				{
-					node->hj_MatchedOuter = true;
-					HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
-
-					/* In an antijoin, we never return a matched tuple */
-					if (node->js.jointype == JOIN_ANTI)
-					{
-						node->hj_JoinState = HJ_NEED_NEW_OUTER;
-						continue;
-					}
+					InstrCountFiltered1(node, 1);
+					break;
+				}
 
-					/*
-					 * If we only need to join to the first matching inner
-					 * tuple, then consider returning this one, but after that
-					 * continue with next outer tuple.
-					 */
-					if (node->js.single_match)
-						node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				node->hj_MatchedOuter = true;
+				HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
 
-					if (otherqual == NULL || ExecQual(otherqual, econtext))
-						return ExecProject(node->js.ps.ps_ProjInfo);
-					else
-						InstrCountFiltered2(node, 1);
+				// TODO: how does this interact with PAHJ -- do I need to set matchbit?
+				/* In an antijoin, we never return a matched tuple */
+				if (node->js.jointype == JOIN_ANTI)
+				{
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+					continue;
 				}
-				else
-					InstrCountFiltered1(node, 1);
-				break;
-
-			case HJ_FILL_OUTER_TUPLE:
 
 				/*
-				 * The current outer tuple has run out of matches, so check
-				 * whether to emit a dummy outer-join tuple.  Whether we emit
-				 * one or not, the next state is NEED_NEW_OUTER.
+				 * If we only need to join to the first matching inner
+				 * tuple, then consider returning this one, but after that
+				 * continue with next outer tuple.
 				 */
-				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				if (node->js.single_match)
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
-				if (!node->hj_MatchedOuter &&
-					HJ_FILL_OUTER(node))
+				/*
+				 * Set the match bit for this outer tuple in the match
+				 * status file
+				 */
+				if (phj_batch->parallel_hashloop_fallback)
 				{
-					/*
-					 * Generate a fake join tuple with nulls for the inner
-					 * tuple, and return it if it passes the non-join quals.
-					 */
-					econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
+					sts_set_outer_match_status(hashtable->batches[hashtable->curbatch].outer_tuples,
+					                           econtext->ecxt_outertuple->tuplenum);
 
-					if (otherqual == NULL || ExecQual(otherqual, econtext))
-						return ExecProject(node->js.ps.ps_ProjInfo);
-					else
-						InstrCountFiltered2(node, 1);
 				}
+				if (otherqual == NULL || ExecQual(otherqual, econtext))
+					return ExecProject(node->js.ps.ps_ProjInfo);
+				else
+					InstrCountFiltered2(node, 1);
 				break;
 
 			case HJ_FILL_INNER_TUPLES:
@@ -515,7 +960,8 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				if (!ExecScanHashTableForUnmatched(node, econtext))
 				{
 					/* no more unmatched tuples */
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					advance_from_probing = true;
+					node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
 					continue;
 				}
 
@@ -533,61 +979,110 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 
 			case HJ_NEED_NEW_BATCH:
 
+				phj_batch = hashtable->batches[hashtable->curbatch].shared;
 				/*
 				 * Try to advance to next batch.  Done if there are no more.
 				 */
-				if (parallel)
+				if (!ExecParallelHashJoinNewBatch(node))
+					return NULL;	/* end of parallel-aware join */
+
+				if (node->last_worker
+					&& HJ_FILL_OUTER(node) && phj_batch->parallel_hashloop_fallback)
 				{
-					if (!ExecParallelHashJoinNewBatch(node))
-						return NULL;	/* end of parallel-aware join */
+					node->last_worker = false;
+					node->hj_JoinState = HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT;
+					break;
 				}
-				else
+				if (node->hj_HashTable->curbatch == 0)
 				{
-					if (!ExecHashJoinNewBatch(node))
-						return NULL;	/* end of parallel-oblivious join */
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+					break;
 				}
-				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				advance_from_probing = false;
+				node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
+				/* FALL THRU */
+
+			case HJ_NEED_NEW_INNER_CHUNK:
+
+				if (hashtable->curbatch == -1 || hashtable->curbatch == 0)
+					/*
+					 * If we're not attached to a batch at all then we need to
+					 * go to HJ_NEED_NEW_BATCH. Also batch 0 doesn't have more
+					 * than 1 chunk.
+					 */
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+				else if (!ExecParallelHashJoinNewChunk(node, advance_from_probing))
+					/* If there's no next chunk then go to the next batch */
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+				else
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
 				break;
 
+			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT:
+
+				outer_acc = hashtable->batches[hashtable->curbatch].outer_tuples;
+				sts_reinitialize(outer_acc);
+				sts_begin_parallel_scan(outer_acc);
+
+				node->hj_JoinState = HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER;
+				/* FALL THRU */
+
+			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER:
+
+				Assert(node->combined_bitmap != NULL);
+
+				outer_acc = node->hj_HashTable->batches[node->hj_HashTable->curbatch].outer_tuples;
+
+				MinimalTuple tuple;
+				do
+				{
+					tupleMetadata metadata;
+					if ((tuple = sts_parallel_scan_next(outer_acc, &metadata)) == NULL)
+						break;
+
+					int bytenum = metadata.tupleid / 8;
+					unsigned char bit = metadata.tupleid % 8;
+					unsigned char byte_to_check = 0;
+
+					/* seek to byte to check */
+					if (BufFileSeek(node->combined_bitmap, 0, bytenum, SEEK_SET))
+						ereport(ERROR,
+						        (errcode_for_file_access(),
+							        errmsg("could not rewind shared outer temporary file: %m")));
+					// read byte containing ntuple bit
+					if (BufFileRead(node->combined_bitmap, &byte_to_check, 1) == 0)
+						ereport(ERROR,
+						        (errcode_for_file_access(),
+							        errmsg("could not read byte in outer match status bitmap: %m.")));
+					// if bit is set
+					bool match = ((byte_to_check) >> bit) & 1;
+					if (!match)
+						break;
+				} while (1);
+
+				if (tuple == NULL)
+				{
+					sts_end_parallel_scan(outer_acc);
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					break;
+				}
+
+				/* Emit the unmatched tuple */
+				ExecForceStoreMinimalTuple(tuple,
+				                           econtext->ecxt_outertuple,
+				                           false);
+				econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
+
+				return ExecProject(node->js.ps.ps_ProjInfo);
+
+
 			default:
 				elog(ERROR, "unrecognized hashjoin state: %d",
-					 (int) node->hj_JoinState);
+				     (int) node->hj_JoinState);
 		}
 	}
 }
 
-/* ----------------------------------------------------------------
- *		ExecHashJoin
- *
- *		Parallel-oblivious version.
- * ----------------------------------------------------------------
- */
-static TupleTableSlot *			/* return: a tuple or NULL */
-ExecHashJoin(PlanState *pstate)
-{
-	/*
-	 * On sufficiently smart compilers this should be inlined with the
-	 * parallel-aware branches removed.
-	 */
-	return ExecHashJoinImpl(pstate, false);
-}
-
-/* ----------------------------------------------------------------
- *		ExecParallelHashJoin
- *
- *		Parallel-aware version.
- * ----------------------------------------------------------------
- */
-static TupleTableSlot *			/* return: a tuple or NULL */
-ExecParallelHashJoin(PlanState *pstate)
-{
-	/*
-	 * On sufficiently smart compilers this should be inlined with the
-	 * parallel-oblivious branches removed.
-	 */
-	return ExecHashJoinImpl(pstate, true);
-}
-
 /* ----------------------------------------------------------------
  *		ExecInitHashJoin
  *
@@ -622,6 +1117,17 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate->js.ps.ExecProcNode = ExecHashJoin;
 	hjstate->js.jointype = node->join.jointype;
 
+	hjstate->hashloop_fallback = false;
+	hjstate->hj_InnerPageOffset = 0L;
+	hjstate->hj_InnerFirstChunk = false;
+	hjstate->hj_OuterCurrentByte = 0;
+
+	hjstate->hj_OuterMatchStatusesFile = NULL;
+	hjstate->hj_OuterTupleCount  = 0;
+	hjstate->hj_InnerExhausted = false;
+
+	hjstate->last_worker = false;
+	hjstate->combined_bitmap = NULL;
 	/*
 	 * Miscellaneous initialization
 	 *
@@ -773,6 +1279,29 @@ ExecEndHashJoin(HashJoinState *node)
 	ExecEndNode(innerPlanState(node));
 }
 
+
+static TupleTableSlot *
+emitUnmatchedOuterTuple(ExprState *otherqual, ExprContext *econtext, HashJoinState *hjstate)
+{
+	if (hjstate->hj_MatchedOuter)
+		return NULL;
+
+	if (!HJ_FILL_OUTER(hjstate))
+		return NULL;
+
+	econtext->ecxt_innertuple = hjstate->hj_NullInnerTupleSlot;
+	/*
+	 * Generate a fake join tuple with nulls for the inner
+	 * tuple, and return it if it passes the non-join quals.
+	 */
+
+	if (otherqual == NULL || ExecQual(otherqual, econtext))
+		return ExecProject(hjstate->js.ps.ps_ProjInfo);
+
+	InstrCountFiltered2(hjstate, 1);
+	return NULL;
+}
+
 /*
  * ExecHashJoinOuterGetTuple
  *
@@ -900,13 +1429,19 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 	{
 		MinimalTuple tuple;
 
+		tupleMetadata metadata;
+		int tupleid;
 		tuple = sts_parallel_scan_next(hashtable->batches[curbatch].outer_tuples,
-									   hashvalue);
+									   &metadata);
 		if (tuple != NULL)
 		{
+			// where is this hashvalue being used?
+			*hashvalue = metadata.hashvalue;
+			tupleid = metadata.tupleid;
 			ExecForceStoreMinimalTuple(tuple,
 									   hjstate->hj_OuterTupleSlot,
 									   false);
+			hjstate->hj_OuterTupleSlot->tuplenum = tupleid;
 			slot = hjstate->hj_OuterTupleSlot;
 			return slot;
 		}
@@ -919,20 +1454,17 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 }
 
 /*
- * ExecHashJoinNewBatch
+ * ExecHashJoinAdvanceBatch
  *		switch to a new hashjoin batch
  *
  * Returns true if successful, false if there are no more batches.
  */
 static bool
-ExecHashJoinNewBatch(HashJoinState *hjstate)
+ExecHashJoinAdvanceBatch(HashJoinState *hjstate)
 {
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	int			nbatch;
 	int			curbatch;
-	BufFile    *innerFile;
-	TupleTableSlot *slot;
-	uint32		hashvalue;
 
 	nbatch = hashtable->nbatch;
 	curbatch = hashtable->curbatch;
@@ -1007,10 +1539,35 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 		curbatch++;
 	}
 
+	hjstate->hj_InnerPageOffset = 0L;
+	hjstate->hj_InnerFirstChunk = true;
+	hjstate->hashloop_fallback = false; /* new batch, so start it off false */
+	if (hjstate->hj_OuterMatchStatusesFile != NULL)
+		BufFileClose(hjstate->hj_OuterMatchStatusesFile);
+	hjstate->hj_OuterMatchStatusesFile = NULL;
 	if (curbatch >= nbatch)
 		return false;			/* no more batches */
 
 	hashtable->curbatch = curbatch;
+	return true;
+}
+
+/*
+ * Returns true if there are more chunks left, false otherwise
+ */
+static bool ExecHashJoinLoadInnerBatch(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int curbatch = hashtable->curbatch;
+	BufFile    *innerFile;
+	TupleTableSlot *slot;
+	uint32		hashvalue;
+
+	off_t tup_start_offset;
+	off_t chunk_start_offset;
+	off_t tup_end_offset;
+	int64 current_saved_size;
+	int current_fileno;
 
 	/*
 	 * Reload the hash table with the new inner batch (which could be empty)
@@ -1019,171 +1576,59 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 
 	innerFile = hashtable->innerBatchFile[curbatch];
 
+	/* Reset this even if the innerfile is not null */
+	hjstate->hj_InnerFirstChunk = hjstate->hj_InnerPageOffset == 0L;
+
 	if (innerFile != NULL)
 	{
-		if (BufFileSeek(innerFile, 0, 0L, SEEK_SET))
+		/* TODO: should fileno always be 0? */
+		if (BufFileSeek(innerFile, 0, hjstate->hj_InnerPageOffset, SEEK_SET))
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not rewind hash-join temporary file: %m")));
 
+		chunk_start_offset = hjstate->hj_InnerPageOffset;
+		tup_end_offset = hjstate->hj_InnerPageOffset;
 		while ((slot = ExecHashJoinGetSavedTuple(hjstate,
 												 innerFile,
 												 &hashvalue,
 												 hjstate->hj_HashTupleSlot)))
 		{
+			/* next tuple's start is last tuple's end */
+			tup_start_offset = tup_end_offset;
+			/* after we got the tuple, figure out what the offset is */
+			BufFileTell(innerFile, &current_fileno, &tup_end_offset);
+			current_saved_size = tup_end_offset - chunk_start_offset;
+			if (current_saved_size > work_mem)
+			{
+				hjstate->hj_InnerPageOffset = tup_start_offset;
+				hjstate->hashloop_fallback = true;
+				return true;
+			}
+			hjstate->hj_InnerPageOffset = tup_end_offset;
 			/*
-			 * NOTE: some tuples may be sent to future batches.  Also, it is
-			 * possible for hashtable->nbatch to be increased here!
+			 * NOTE: some tuples may be sent to future batches.
+			 * With current hashloop patch, however, it is not possible
+			 * for hashtable->nbatch to be increased here
 			 */
 			ExecHashTableInsert(hashtable, slot, hashvalue);
 		}
 
+		/* this is the end of the file */
+		hjstate->hj_InnerPageOffset = 0L;
+
 		/*
-		 * after we build the hash table, the inner batch file is no longer
+		 * after we processed all chunks, the inner batch file is no longer
 		 * needed
 		 */
 		BufFileClose(innerFile);
 		hashtable->innerBatchFile[curbatch] = NULL;
 	}
 
-	/*
-	 * Rewind outer batch file (if present), so that we can start reading it.
-	 */
-	if (hashtable->outerBatchFile[curbatch] != NULL)
-	{
-		if (BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not rewind hash-join temporary file: %m")));
-	}
-
-	return true;
-}
-
-/*
- * Choose a batch to work on, and attach to it.  Returns true if successful,
- * false if there are no more batches.
- */
-static bool
-ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
-{
-	HashJoinTable hashtable = hjstate->hj_HashTable;
-	int			start_batchno;
-	int			batchno;
-
-	/*
-	 * If we started up so late that the batch tracking array has been freed
-	 * already by ExecHashTableDetach(), then we are finished.  See also
-	 * ExecParallelHashEnsureBatchAccessors().
-	 */
-	if (hashtable->batches == NULL)
-		return false;
-
-	/*
-	 * If we were already attached to a batch, remember not to bother checking
-	 * it again, and detach from it (possibly freeing the hash table if we are
-	 * last to detach).
-	 */
-	if (hashtable->curbatch >= 0)
-	{
-		hashtable->batches[hashtable->curbatch].done = true;
-		ExecHashTableDetachBatch(hashtable);
-	}
-
-	/*
-	 * Search for a batch that isn't done.  We use an atomic counter to start
-	 * our search at a different batch in every participant when there are
-	 * more batches than participants.
-	 */
-	batchno = start_batchno =
-		pg_atomic_fetch_add_u32(&hashtable->parallel_state->distributor, 1) %
-		hashtable->nbatch;
-	do
-	{
-		uint32		hashvalue;
-		MinimalTuple tuple;
-		TupleTableSlot *slot;
-
-		if (!hashtable->batches[batchno].done)
-		{
-			SharedTuplestoreAccessor *inner_tuples;
-			Barrier    *batch_barrier =
-			&hashtable->batches[batchno].shared->batch_barrier;
-
-			switch (BarrierAttach(batch_barrier))
-			{
-				case PHJ_BATCH_ELECTING:
-
-					/* One backend allocates the hash table. */
-					if (BarrierArriveAndWait(batch_barrier,
-											 WAIT_EVENT_HASH_BATCH_ELECTING))
-						ExecParallelHashTableAlloc(hashtable, batchno);
-					/* Fall through. */
-
-				case PHJ_BATCH_ALLOCATING:
-					/* Wait for allocation to complete. */
-					BarrierArriveAndWait(batch_barrier,
-										 WAIT_EVENT_HASH_BATCH_ALLOCATING);
-					/* Fall through. */
-
-				case PHJ_BATCH_LOADING:
-					/* Start (or join in) loading tuples. */
-					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					inner_tuples = hashtable->batches[batchno].inner_tuples;
-					sts_begin_parallel_scan(inner_tuples);
-					while ((tuple = sts_parallel_scan_next(inner_tuples,
-														   &hashvalue)))
-					{
-						ExecForceStoreMinimalTuple(tuple,
-												   hjstate->hj_HashTupleSlot,
-												   false);
-						slot = hjstate->hj_HashTupleSlot;
-						ExecParallelHashTableInsertCurrentBatch(hashtable, slot,
-																hashvalue);
-					}
-					sts_end_parallel_scan(inner_tuples);
-					BarrierArriveAndWait(batch_barrier,
-										 WAIT_EVENT_HASH_BATCH_LOADING);
-					/* Fall through. */
-
-				case PHJ_BATCH_PROBING:
-
-					/*
-					 * This batch is ready to probe.  Return control to
-					 * caller. We stay attached to batch_barrier so that the
-					 * hash table stays alive until everyone's finished
-					 * probing it, but no participant is allowed to wait at
-					 * this barrier again (or else a deadlock could occur).
-					 * All attached participants must eventually call
-					 * BarrierArriveAndDetach() so that the final phase
-					 * PHJ_BATCH_DONE can be reached.
-					 */
-					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					sts_begin_parallel_scan(hashtable->batches[batchno].outer_tuples);
-					return true;
-
-				case PHJ_BATCH_DONE:
-
-					/*
-					 * Already done.  Detach and go around again (if any
-					 * remain).
-					 */
-					BarrierDetach(batch_barrier);
-					hashtable->batches[batchno].done = true;
-					hashtable->curbatch = -1;
-					break;
-
-				default:
-					elog(ERROR, "unexpected batch phase %d",
-						 BarrierPhase(batch_barrier));
-			}
-		}
-		batchno = (batchno + 1) % hashtable->nbatch;
-	} while (batchno != start_batchno);
-
 	return false;
 }
 
+
 /*
  * ExecHashJoinSaveTuple
  *		save a tuple to a batch file.
@@ -1377,6 +1822,7 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 	/* Execute outer plan, writing all tuples to shared tuplestores. */
 	for (;;)
 	{
+		tupleMetadata metadata;
 		slot = ExecProcNode(outerState);
 		if (TupIsNull(slot))
 			break;
@@ -1394,8 +1840,10 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 
 			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
 									  &batchno);
-			sts_puttuple(hashtable->batches[batchno].outer_tuples,
-						 &hashvalue, mintup);
+			metadata.hashvalue = hashvalue;
+			SharedTuplestoreAccessor *accessor = hashtable->batches[batchno].outer_tuples;
+			metadata.tupleid = sts_increment_tuplenum(accessor);
+			sts_puttuple(accessor, &metadata, mintup);
 
 			if (shouldFree)
 				heap_free_minimal_tuple(mintup);
@@ -1444,6 +1892,7 @@ ExecHashJoinInitializeDSM(HashJoinState *state, ParallelContext *pcxt)
 	 * and space_allowed.
 	 */
 	pstate->nbatch = 0;
+	pstate->batch_increases = 0;
 	pstate->space_allowed = 0;
 	pstate->batches = InvalidDsaPointer;
 	pstate->old_batches = InvalidDsaPointer;
@@ -1483,7 +1932,7 @@ ExecHashJoinReInitializeDSM(HashJoinState *state, ParallelContext *cxt)
 	/*
 	 * It would be possible to reuse the shared hash table in single-batch
 	 * cases by resetting and then fast-forwarding build_barrier to
-	 * PHJ_BUILD_DONE and batch 0's batch_barrier to PHJ_BATCH_PROBING, but
+	 * PHJ_BUILD_DONE and batch 0's batch_barrier to PHJ_BATCH_CHUNKING, but
 	 * currently shared hash tables are already freed by now (by the last
 	 * participant to detach from the batch).  We could consider keeping it
 	 * around for single-batch joins.  We'd also need to adjust
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7410b2ff5e..ddaa3f6a5a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3767,6 +3767,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_BATCH_LOADING:
 			event_name = "Hash/Batch/Loading";
 			break;
+		case WAIT_EVENT_HASH_BATCH_PROBING:
+			event_name = "Hash/Batch/Probing";
+			break;
 		case WAIT_EVENT_HASH_BUILD_ALLOCATING:
 			event_name = "Hash/Build/Allocating";
 			break;
@@ -3779,6 +3782,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
 			event_name = "Hash/Build/HashingOuter";
 			break;
+		case WAIT_EVENT_HASH_BUILD_CREATE_OUTER_MATCH_STATUS_BITMAP_FILES:
+			event_name = "Hash/Build/CreateOuterMatchStatusBitmapFiles";
+			break;
 		case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
 			event_name = "Hash/GrowBatches/Allocating";
 			break;
@@ -3803,6 +3809,21 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
 			event_name = "Hash/GrowBuckets/Reinserting";
 			break;
+		case WAIT_EVENT_HASH_CHUNK_ELECTING:
+			event_name = "Hash/Chunk/Electing";
+			break;
+		case WAIT_EVENT_HASH_CHUNK_LOADING:
+			event_name = "Hash/Chunk/Loading";
+			break;
+		case WAIT_EVENT_HASH_CHUNK_PROBING:
+			event_name = "Hash/Chunk/Probing";
+			break;
+		case WAIT_EVENT_HASH_CHUNK_DONE:
+			event_name = "Hash/Chunk/Done";
+			break;
+		case WAIT_EVENT_HASH_ADVANCE_CHUNK:
+			event_name = "Hash/Chunk/Final";
+			break;
 		case WAIT_EVENT_LOGICAL_SYNC_DATA:
 			event_name = "LogicalSyncData";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 440ff77e1f..dd4267bb7f 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -269,6 +269,57 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
 	return file;
 }
 
+/*
+ * Open a shared file created by any backend if it exists, otherwise return NULL
+ */
+BufFile *
+BufFileOpenSharedIfExists(SharedFileSet *fileset, const char *name)
+{
+	BufFile    *file;
+	char		segment_name[MAXPGPATH];
+	Size		capacity = 16;
+	File	   *files;
+	int			nfiles = 0;
+
+	files = palloc(sizeof(File) * capacity);
+
+	/*
+	 * We don't know how many segments there are, so we'll probe the
+	 * filesystem to find out.
+	 */
+	for (;;)
+	{
+		/* See if we need to expand our file segment array. */
+		if (nfiles + 1 > capacity)
+		{
+			capacity *= 2;
+			files = repalloc(files, sizeof(File) * capacity);
+		}
+		/* Try to load a segment. */
+		SharedSegmentName(segment_name, name, nfiles);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		if (files[nfiles] <= 0)
+			break;
+		++nfiles;
+
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	/*
+	 * If we didn't find any files at all, then no BufFile exists with this
+	 * name.
+	 */
+	if (nfiles == 0)
+		return NULL;
+	file = makeBufFileCommon(nfiles);
+	file->files = files;
+	file->readOnly = true;		/* Can't write to files opened this way */
+	file->fileset = fileset;
+	file->name = pstrdup(name);
+
+	return file;
+}
+
 /*
  * Open a file that was previously created in another backend (or this one)
  * with BufFileCreateShared in the same SharedFileSet using the same name.
@@ -843,3 +894,16 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+BufFile *BufFileRewindIfExists(BufFile *bufFile)
+{
+	if (bufFile != NULL)
+	{
+		if (BufFileSeek(bufFile, 0, 0L, SEEK_SET))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+							errmsg("could not rewind hash-join temporary file: %m")));
+		return bufFile;
+	}
+	return NULL;
+}
diff --git a/src/backend/storage/ipc/barrier.c b/src/backend/storage/ipc/barrier.c
index 83cbe33107..392fb930f9 100644
--- a/src/backend/storage/ipc/barrier.c
+++ b/src/backend/storage/ipc/barrier.c
@@ -195,6 +195,91 @@ BarrierArriveAndWait(Barrier *barrier, uint32 wait_event_info)
 	return elected;
 }
 
+/*
+ * Arrive at this barrier, wait for all other attached participants to arrive
+ * too and then return.  Sets the current phase to next_phase.  The caller must
+ * be attached.
+ *
+ * While waiting, pg_stat_activity shows a wait_event_type and wait_event
+ * controlled by the wait_event_info passed in, which should be a value from
+ * one of the WaitEventXXX enums defined in pgstat.h.
+ *
+ * Return true in one arbitrarily chosen participant.  Return false in all
+ * others.  The return code can be used to elect one participant to execute a
+ * phase of work that must be done serially while other participants wait.
+ */
+bool
+BarrierArriveExplicitAndWait(Barrier *barrier, int next_phase, uint32 wait_event_info)
+{
+	bool		release = false;
+	bool		elected;
+	int			start_phase;
+
+	SpinLockAcquire(&barrier->mutex);
+	start_phase = barrier->phase;
+	++barrier->arrived;
+	if (barrier->arrived == barrier->participants)
+	{
+		release = true;
+		barrier->arrived = 0;
+		barrier->phase = next_phase;
+		barrier->elected = next_phase;
+	}
+	SpinLockRelease(&barrier->mutex);
+
+	/*
+	 * If we were the last expected participant to arrive, we can release our
+	 * peers and return true to indicate that this backend has been elected to
+	 * perform any serial work.
+	 */
+	if (release)
+	{
+		ConditionVariableBroadcast(&barrier->condition_variable);
+
+		return true;
+	}
+
+	/*
+	 * Otherwise we have to wait for the last participant to arrive and
+	 * advance the phase.
+	 */
+	elected = false;
+	ConditionVariablePrepareToSleep(&barrier->condition_variable);
+	for (;;)
+	{
+		/*
+		 * We know that phase must either be start_phase, indicating that we
+		 * need to keep waiting, or next_phase, indicating that the last
+		 * participant that we were waiting for has either arrived or detached
+		 * so that the next phase has begun.  The phase cannot advance any
+		 * further than that without this backend's participation, because
+		 * this backend is attached.
+		 */
+		SpinLockAcquire(&barrier->mutex);
+		Assert(barrier->phase == start_phase || barrier->phase == next_phase);
+		release = barrier->phase == next_phase;
+		if (release && barrier->elected != next_phase)
+		{
+			/*
+			 * Usually the backend that arrives last and releases the other
+			 * backends is elected to return true (see above), so that it can
+			 * begin processing serial work while it has a CPU timeslice.
+			 * However, if the barrier advanced because someone detached, then
+			 * one of the backends that is awoken will need to be elected.
+			 */
+			barrier->elected = barrier->phase;
+			elected = true;
+		}
+		SpinLockRelease(&barrier->mutex);
+		if (release)
+			break;
+		ConditionVariableSleep(&barrier->condition_variable, wait_event_info);
+	}
+	ConditionVariableCancelSleep();
+
+	return elected;
+}
+
 /*
  * Arrive at this barrier, but detach rather than waiting.  Returns true if
  * the caller was the last to detach.
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 7765a445c0..8365254fc4 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -59,10 +59,11 @@ typedef struct SharedTuplestoreParticipant
 /* The control object that lives in shared memory. */
 struct SharedTuplestore
 {
-	int			nparticipants;	/* Number of participants that can write. */
-	int			flags;			/* Flag bits from SHARED_TUPLESTORE_XXX */
-	size_t		meta_data_size; /* Size of per-tuple header. */
-	char		name[NAMEDATALEN];	/* A name for this tuplestore. */
+	int              nparticipants;	/* Number of participants that can write. */
+	pg_atomic_uint32 ntuples; // TODO: does this belong elsewhere
+	int              flags;			/* Flag bits from SHARED_TUPLESTORE_XXX */
+	size_t           meta_data_size; /* Size of per-tuple header. */
+	char             name[NAMEDATALEN];	/* A name for this tuplestore. */
 
 	/* Followed by per-participant shared state. */
 	SharedTuplestoreParticipant participants[FLEXIBLE_ARRAY_MEMBER];
@@ -92,10 +93,15 @@ struct SharedTuplestoreAccessor
 	BlockNumber write_page;		/* The next page to write to. */
 	char	   *write_pointer;	/* Current write pointer within chunk. */
 	char	   *write_end;		/* One past the end of the current chunk. */
+
+	/* Bitmap of matched outer tuples (currently only used for hashjoin). */
+	BufFile    *outer_match_status_file;
 };
 
 static void sts_filename(char *name, SharedTuplestoreAccessor *accessor,
 						 int participant);
+static void
+sts_bitmap_filename(char *name, SharedTuplestoreAccessor *accessor, int participant);
 
 /*
  * Return the amount of shared memory required to hold SharedTuplestore for a
@@ -137,6 +143,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 	Assert(my_participant_number < participants);
 
 	sts->nparticipants = participants;
+	pg_atomic_init_u32(&sts->ntuples, 1);
 	sts->meta_data_size = meta_data_size;
 	sts->flags = flags;
 
@@ -166,6 +173,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 	accessor->sts = sts;
 	accessor->fileset = fileset;
 	accessor->context = CurrentMemoryContext;
+	accessor->outer_match_status_file = NULL;
 
 	return accessor;
 }
@@ -343,6 +351,7 @@ sts_puttuple(SharedTuplestoreAccessor *accessor, void *meta_data,
 			sts_flush_chunk(accessor);
 		}
 
+		// TODO: exercise this code with a test (over-sized tuple)
 		/* It may still not be enough in the case of a gigantic tuple. */
 		if (accessor->write_pointer + size >= accessor->write_end)
 		{
@@ -621,6 +630,116 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	return NULL;
 }
 
+// TODO: fix signedness
+int sts_increment_tuplenum(SharedTuplestoreAccessor *accessor)
+{
+	return pg_atomic_fetch_add_u32(&accessor->sts->ntuples, 1);
+}
+
+void
+sts_make_outer_match_status_file(SharedTuplestoreAccessor *accessor)
+{
+	uint32 tuplenum = pg_atomic_read_u32(&accessor->sts->ntuples);
+	/* don't make the outer match status file if there are no tuples */
+	if (tuplenum == 0)
+		return;
+
+	char name[MAXPGPATH];
+	sts_bitmap_filename(name, accessor, accessor->participant);
+
+	accessor->outer_match_status_file = BufFileCreateShared(accessor->fileset, name);
+
+	// TODO: check this math. tuplenumber will be too high.
+	uint32 num_to_write = tuplenum / 8 + 1;
+
+	unsigned char byteToWrite = 0;
+	BufFileWrite(accessor->outer_match_status_file, &byteToWrite, num_to_write);
+
+	if (BufFileSeek(accessor->outer_match_status_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+		        (errcode_for_file_access(),
+			        errmsg("could not rewind hash-join temporary file: %m")));
+}
+
+void
+sts_set_outer_match_status(SharedTuplestoreAccessor *accessor, uint32 tuplenum)
+{
+	BufFile *parallel_outer_matchstatuses = accessor->outer_match_status_file;
+	unsigned char current_outer_byte;
+
+	BufFileSeek(parallel_outer_matchstatuses, 0, tuplenum / 8, SEEK_SET);
+	BufFileRead(parallel_outer_matchstatuses, &current_outer_byte, 1);
+
+	current_outer_byte |= 1U << (tuplenum % 8);
+
+	if (BufFileSeek(parallel_outer_matchstatuses, 0, -1, SEEK_CUR) != 0)
+		elog(ERROR, "there is a problem with outer match status file. pid %i.", MyProcPid);
+	BufFileWrite(parallel_outer_matchstatuses, &current_outer_byte, 1);
+}
+
+void
+sts_close_outer_match_status_file(SharedTuplestoreAccessor *accessor)
+{
+	BufFileClose(accessor->outer_match_status_file);
+}
+
+BufFile *sts_combine_outer_match_status_files(SharedTuplestoreAccessor *accessor)
+{
+	// TODO: this tries to close an outer match status file for
+	// each participant in the tuplestore. technically, only participants
+	// in the barrier could have outer match status files, however,
+	// all but one participant continue on and detach from the barrier
+	// so we won't have a reliable way to close only files for those attached
+	// to the barrier
+	BufFile **statuses = palloc(sizeof(BufFile *) * accessor->sts->nparticipants);
+
+	// Open the bitmap shared BufFile from each participant. TODO: explain why file can be NULLs
+	int statuses_length = 0;
+	for (int i = 0; i < accessor->sts->nparticipants; i++)
+	{
+		char bitmap_filename[MAXPGPATH];
+		sts_bitmap_filename(bitmap_filename, accessor, i);
+		BufFile *file = BufFileOpenSharedIfExists(accessor->fileset, bitmap_filename);
+
+		if (file != NULL)
+			statuses[statuses_length++] = file;
+	}
+
+	BufFile *combined_bitmap_file = BufFileCreateTemp(false);
+
+	for (int64 cur = 0; cur < BufFileSize(statuses[0]); cur++) // make it while not EOF
+	{
+		unsigned char combined_byte = 0;
+
+		for (int i = 0; i < statuses_length; i++)
+		{
+			unsigned char read_byte;
+			BufFileRead(statuses[i], &read_byte, 1);
+			combined_byte |= read_byte;
+		}
+
+		BufFileWrite(combined_bitmap_file, &combined_byte, 1);
+	}
+
+	if (BufFileSeek(combined_bitmap_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+		        (errcode_for_file_access(),
+			        errmsg("could not rewind hash-join temporary file: %m")));
+
+	for (int i = 0; i < statuses_length; i++)
+		BufFileClose(statuses[i]);
+	pfree(statuses);
+
+	return combined_bitmap_file;
+}
+
+
+static void
+sts_bitmap_filename(char *name, SharedTuplestoreAccessor *accessor, int participant)
+{
+	snprintf(name, MAXPGPATH, "%s.p%d.bitmap", accessor->sts->name, participant);
+}
+
 /*
  * Create the name used for the BufFile that a given participant will write.
  */
diff --git a/src/include/executor/adaptiveHashjoin.h b/src/include/executor/adaptiveHashjoin.h
new file mode 100644
index 0000000000..ee189f5e2b
--- /dev/null
+++ b/src/include/executor/adaptiveHashjoin.h
@@ -0,0 +1,9 @@
+#ifndef ADAPTIVE_HASHJOIN_H
+#define ADAPTIVE_HASHJOIN_H
+
+
+extern bool ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing);
+extern bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
+
+
+#endif /* ADAPTIVE_HASHJOIN_H */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 2c94b926d3..4500300356 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -148,17 +148,32 @@ typedef struct HashMemoryChunkData *HashMemoryChunk;
  * followed by variable-sized objects, they are arranged in contiguous memory
  * but not accessed directly as an array.
  */
+// TODO: maybe remove lock from ParallelHashJoinBatch and use pstate->lock
+// and the PHJBatchAccessor to coordinate access to the PHJ batch similar to
+// other users of that lock
 typedef struct ParallelHashJoinBatch
 {
 	dsa_pointer buckets;		/* array of hash table buckets */
 	Barrier		batch_barrier;	/* synchronization for joining this batch */
 
+	/* Parallel Adaptive Hash Join members */
+	/*
+	 * after finishing build phase, parallel_hashloop_fallback cannot change,
+	 * and does not require a lock to read
+	 */
+	bool    parallel_hashloop_fallback;
+	int     total_num_chunks;
+	int     current_chunk_num;
+	size_t  estimated_chunk_size;
+	Barrier chunk_barrier;
+	LWLock  lock;
+
 	dsa_pointer chunks;			/* chunks of tuples loaded */
 	size_t		size;			/* size of buckets + chunks in memory */
 	size_t		estimated_size; /* size of buckets + chunks while writing */
 	size_t		ntuples;		/* number of tuples loaded */
 	size_t		old_ntuples;	/* number of tuples before repartitioning */
-	bool		space_exhausted;
+	bool    space_exhausted;
 
 	/*
 	 * Variable-sized SharedTuplestore objects follow this struct in memory.
@@ -243,6 +258,7 @@ typedef struct ParallelHashJoinState
 	int			nparticipants;
 	size_t		space_allowed;
 	size_t		total_tuples;	/* total number of inner tuples */
+	int batch_increases; // TODO: make this an atomic so I don't need the lock to increment it?
 	LWLock		lock;			/* lock protecting the above */
 
 	Barrier		build_barrier;	/* synchronization for the build phases */
@@ -263,10 +279,16 @@ typedef struct ParallelHashJoinState
 /* The phases for probing each batch, used by for batch_barrier. */
 #define PHJ_BATCH_ELECTING				0
 #define PHJ_BATCH_ALLOCATING			1
-#define PHJ_BATCH_LOADING				2
-#define PHJ_BATCH_PROBING				3
+#define PHJ_BATCH_CHUNKING				2
+#define PHJ_BATCH_OUTER_MATCH_STATUS_PROCESSING 3
 #define PHJ_BATCH_DONE					4
 
+#define PHJ_CHUNK_ELECTING				0
+#define PHJ_CHUNK_LOADING				1
+#define PHJ_CHUNK_PROBING				2
+#define PHJ_CHUNK_DONE					3
+#define PHJ_CHUNK_FINAL					4
+
 /* The phases of batch growth while hashing, for grow_batches_barrier. */
 #define PHJ_GROW_BATCHES_ELECTING		0
 #define PHJ_GROW_BATCHES_ALLOCATING		1
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index fc80f03aa8..39500f755c 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -40,9 +40,8 @@ extern void ExecHashTableInsert(HashJoinTable hashtable,
 extern void ExecParallelHashTableInsert(HashJoinTable hashtable,
 										TupleTableSlot *slot,
 										uint32 hashvalue);
-extern void ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable,
-													TupleTableSlot *slot,
-													uint32 hashvalue);
+extern void
+ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable, TupleTableSlot *slot, uint32 hashvalue);
 extern bool ExecHashGetHashValue(HashJoinTable hashtable,
 								 ExprContext *econtext,
 								 List *hashkeys,
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 04feaba55d..e2e7d0a58c 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -129,6 +129,7 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+	uint32 		tuplenum;
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -425,7 +426,7 @@ static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
 	slot->tts_ops->clear(slot);
-
+	slot->tuplenum = 0;
 	return slot;
 }
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0c2a77aaf8..e569cea3c2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -14,6 +14,7 @@
 #ifndef EXECNODES_H
 #define EXECNODES_H
 
+#include <storage/buffile.h>
 #include "access/tupconvert.h"
 #include "executor/instrument.h"
 #include "fmgr.h"
@@ -1943,6 +1944,22 @@ typedef struct HashJoinState
 	int			hj_JoinState;
 	bool		hj_MatchedOuter;
 	bool		hj_OuterNotEmpty;
+
+	/* hashloop fallback */
+	bool hashloop_fallback;
+	/* hashloop fallback inner side */
+	bool hj_InnerFirstChunk;
+	bool hj_InnerExhausted;
+	off_t hj_InnerPageOffset;
+
+	/* hashloop fallback outer side */
+	unsigned char hj_OuterCurrentByte;
+	BufFile *hj_OuterMatchStatusesFile; /* serial AHJ */
+	int64 hj_OuterTupleCount;
+
+	/* parallel hashloop fallback outer side */
+	bool last_worker;
+	BufFile *combined_bitmap;
 } HashJoinState;
 
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f2e873d048..c117a16d43 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -815,6 +815,7 @@ typedef enum
  * it is waiting for a notification from another process.
  * ----------
  */
+// TODO: add WAIT_EVENT_HASH_BUILD_CREATE_OUTER_MATCH_STATUS_BITMAP_FILES?
 typedef enum
 {
 	WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
@@ -827,10 +828,12 @@ typedef enum
 	WAIT_EVENT_HASH_BATCH_ALLOCATING,
 	WAIT_EVENT_HASH_BATCH_ELECTING,
 	WAIT_EVENT_HASH_BATCH_LOADING,
+	WAIT_EVENT_HASH_BATCH_PROBING,
 	WAIT_EVENT_HASH_BUILD_ALLOCATING,
 	WAIT_EVENT_HASH_BUILD_ELECTING,
 	WAIT_EVENT_HASH_BUILD_HASHING_INNER,
 	WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
+	WAIT_EVENT_HASH_BUILD_CREATE_OUTER_MATCH_STATUS_BITMAP_FILES,
 	WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
 	WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
 	WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
@@ -839,6 +842,11 @@ typedef enum
 	WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
 	WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
 	WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
+	WAIT_EVENT_HASH_CHUNK_ELECTING,
+	WAIT_EVENT_HASH_CHUNK_LOADING,
+	WAIT_EVENT_HASH_CHUNK_PROBING,
+	WAIT_EVENT_HASH_CHUNK_DONE,
+	WAIT_EVENT_HASH_ADVANCE_CHUNK,
 	WAIT_EVENT_LOGICAL_SYNC_DATA,
 	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
 	WAIT_EVENT_MQ_INTERNAL,
diff --git a/src/include/storage/barrier.h b/src/include/storage/barrier.h
index 1903897eef..7c578d3935 100644
--- a/src/include/storage/barrier.h
+++ b/src/include/storage/barrier.h
@@ -36,6 +36,7 @@ typedef struct Barrier
 
 extern void BarrierInit(Barrier *barrier, int num_workers);
 extern bool BarrierArriveAndWait(Barrier *barrier, uint32 wait_event_info);
+extern bool BarrierArriveExplicitAndWait(Barrier *barrier, int next_phase, uint32 wait_event_info);
 extern bool BarrierArriveAndDetach(Barrier *barrier);
 extern int	BarrierAttach(Barrier *barrier);
 extern bool BarrierDetach(Barrier *barrier);
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index 1fba404fe2..6539dc34bb 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,10 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
+extern BufFile *BufFileOpenSharedIfExists(SharedFileSet *fileset, const char *name);
 extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
 
+extern BufFile *BufFileRewindIfExists(BufFile *bufFile);
+
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index f9450dac90..f2575d024f 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -212,6 +212,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_LOCK_MANAGER,
 	LWTRANCHE_PREDICATE_LOCK_MANAGER,
 	LWTRANCHE_PARALLEL_HASH_JOIN,
+	LWTRANCHE_PARALLEL_HASH_JOIN_BATCH,
 	LWTRANCHE_PARALLEL_QUERY_DSA,
 	LWTRANCHE_SESSION_DSA,
 	LWTRANCHE_SESSION_RECORD_TABLE,
diff --git a/src/include/utils/sharedtuplestore.h b/src/include/utils/sharedtuplestore.h
index 9dea626e84..23768683fc 100644
--- a/src/include/utils/sharedtuplestore.h
+++ b/src/include/utils/sharedtuplestore.h
@@ -22,6 +22,16 @@ typedef struct SharedTuplestore SharedTuplestore;
 
 struct SharedTuplestoreAccessor;
 typedef struct SharedTuplestoreAccessor SharedTuplestoreAccessor;
+struct tupleMetadata;
+typedef struct tupleMetadata tupleMetadata;
+// TODO: conflicting types for tupleid with accessor->sts->ntuples (uint32)
+// TODO: use a union for tupleid (uint32) (make this a uint64) and chunk number (int)
+struct tupleMetadata
+{
+	uint32 hashvalue;
+	int tupleid; /* tuple id on outer side and chunk number for inner side */
+} __attribute__((packed));
+// TODO: make sure I can get rid of packed now that using sizeof(struct)
 
 /*
  * A flag indicating that the tuplestore will only be scanned once, so backing
@@ -58,4 +68,13 @@ extern void sts_puttuple(SharedTuplestoreAccessor *accessor,
 extern MinimalTuple sts_parallel_scan_next(SharedTuplestoreAccessor *accessor,
 										   void *meta_data);
 
+
+extern int sts_increment_tuplenum(SharedTuplestoreAccessor *accessor);
+
+extern void sts_make_outer_match_status_file(SharedTuplestoreAccessor *accessor);
+extern void sts_set_outer_match_status(SharedTuplestoreAccessor *accessor, uint32 tuplenum);
+extern void sts_close_outer_match_status_file(SharedTuplestoreAccessor *accessor);
+extern BufFile *sts_combine_outer_match_status_files(SharedTuplestoreAccessor *accessor);
+
+
 #endif							/* SHAREDTUPLESTORE_H */
diff --git a/src/test/regress/expected/adaptive_hj.out b/src/test/regress/expected/adaptive_hj.out
new file mode 100644
index 0000000000..fe24acd255
--- /dev/null
+++ b/src/test/regress/expected/adaptive_hj.out
@@ -0,0 +1,1233 @@
+-- TODO: remove some of these tests and make the test file faster
+create schema adaptive_hj;
+set search_path=adaptive_hj;
+drop table if exists t1;
+NOTICE:  table "t1" does not exist, skipping
+drop table if exists t2;
+NOTICE:  table "t2" does not exist, skipping
+create table t1(a int);
+create table t2(b int);
+-- serial setup
+set work_mem=64;
+set enable_mergejoin to off;
+-- TODO: make this function general
+create or replace function explain_multi_batch() returns setof text language plpgsql as
+$$
+declare ln text;
+begin
+    for ln in
+        explain (analyze, summary off, timing off, costs off)
+		select count(*) from t1 left outer join t2 on a = b
+    loop
+        ln := regexp_replace(ln, 'Memory Usage: \S*',  'Memory Usage: xxx');
+        return next ln;
+    end loop;
+end;
+$$;
+-- Serial_Test_1 reset
+-- TODO: refactor into procedure or change to drop table
+update pg_class set reltuples = 0, relpages = 0 where relname = 't2';
+update pg_class set reltuples = 0, relpages = 0 where relname = 't1';
+delete from pg_statistic where starelid = 't2'::regclass;
+delete from pg_statistic where starelid = 't1'::regclass;
+-- Serial_Test_1 setup
+truncate table t1;
+insert into t1 values(1),(2);
+insert into t1 select i from generate_series(1,10)i;
+insert into t1 select 2 from generate_series(1,5)i;
+truncate table t2;
+insert into t2 values(2),(3),(11);
+insert into t2 select i from generate_series(2,10)i;
+insert into t2 select 2 from generate_series(2,7)i;
+-- Serial_Test_1.1
+-- TODO: automate the checking for expected number of chunks (explain option?)
+-- spills in 4 batches
+-- batch 1 falls back with 2 chunks with no unmatched tuples
+-- batch 2 falls back with 2 chunks with 2 unmatched tuples emitted at EOB 
+-- batch 3 falls back with 5 chunks with no unmatched tuples
+-- batch 4 does not fall back with no unmatched tuples
+select * from explain_multi_batch();
+                    explain_multi_batch                     
+------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Left Join (actual rows=67 loops=1)
+         Hash Cond: (t1.a = t2.b)
+         ->  Seq Scan on t1 (actual rows=17 loops=1)
+         ->  Hash (actual rows=18 loops=1)
+               Buckets: 2048  Batches: 4  Memory Usage: xxx
+               ->  Seq Scan on t2 (actual rows=18 loops=1)
+(7 rows)
+
+select * from t1 left outer join t2 on a = b order by b, a;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+  1 |   
+  1 |   
+(67 rows)
+
+select * from t1, t2 where a = b order by b;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+(65 rows)
+
+select * from t1 right outer join t2 on a = b order by a, b;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+    | 11
+(66 rows)
+
+select * from t1 full outer join t2 on a = b order by b, a;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+    | 11
+  1 |   
+  1 |   
+(68 rows)
+
+-- Serial_Test_1.2 setup
+analyze t1; analyze t2;
+-- Serial_Test_1.2
+-- doesn't spill (happens to do a hash right join)
+select * from explain_multi_batch();
+                    explain_multi_batch                     
+------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Right Join (actual rows=67 loops=1)
+         Hash Cond: (t2.b = t1.a)
+         ->  Seq Scan on t2 (actual rows=18 loops=1)
+         ->  Hash (actual rows=17 loops=1)
+               Buckets: 1024  Batches: 1  Memory Usage: xxx
+               ->  Seq Scan on t1 (actual rows=17 loops=1)
+(7 rows)
+
+-- Serial_Test_2 reset
+update pg_class set reltuples = 0, relpages = 0 where relname = 't2';
+update pg_class set reltuples = 0, relpages = 0 where relname = 't1';
+delete from pg_statistic where starelid = 't2'::regclass;
+delete from pg_statistic where starelid = 't1'::regclass;
+-- Serial_Test_2 setup:
+truncate table t1;
+insert into t1 values (1),(2),(2),(3);
+truncate table t2;
+insert into t2 values(2),(2),(3),(3),(4);
+-- Serial_Test_2.1
+-- spills in 4 batches
+-- batch 1 falls back with 2 chunks with no unmatched tuples
+-- batch 2 does not fall back with 1 unmatched tuple
+-- batch 3 does not fall back with no unmatched tuples
+-- batch 4 does not fall back with no unmatched tuples
+select * from explain_multi_batch();
+                    explain_multi_batch                     
+------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Left Join (actual rows=7 loops=1)
+         Hash Cond: (t1.a = t2.b)
+         ->  Seq Scan on t1 (actual rows=4 loops=1)
+         ->  Hash (actual rows=5 loops=1)
+               Buckets: 2048  Batches: 4  Memory Usage: xxx
+               ->  Seq Scan on t2 (actual rows=5 loops=1)
+(7 rows)
+
+select * from t1 left outer join t2 on a = b order by b, a;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 1 |  
+(7 rows)
+
+select * from t1 right outer join t2 on a = b order by a, b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+   | 4
+(7 rows)
+
+-- TODO: check coverage for emitting ummatched inner tuples
+-- Serial_Test_2.1.a
+-- results checking for inner join
+select * from t1 left outer join t2 on a = b order by b, a;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 1 |  
+(7 rows)
+
+select * from t1, t2 where a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+(6 rows)
+
+select * from t1 right outer join t2 on a = b order by a, b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+   | 4
+(7 rows)
+
+select * from t1 full outer join t2 on a = b order by b, a;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+   | 4
+ 1 |  
+(8 rows)
+
+select * from t1, t2 where a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+(6 rows)
+
+-- Serial_Test_2.2
+analyze t1; analyze t2;
+-- doesn't spill (happens to do a hash right join)
+select * from explain_multi_batch();
+                    explain_multi_batch                     
+------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Right Join (actual rows=7 loops=1)
+         Hash Cond: (t2.b = t1.a)
+         ->  Seq Scan on t2 (actual rows=5 loops=1)
+         ->  Hash (actual rows=4 loops=1)
+               Buckets: 1024  Batches: 1  Memory Usage: xxx
+               ->  Seq Scan on t1 (actual rows=4 loops=1)
+(7 rows)
+
+-- Serial_Test_3 reset
+update pg_class set reltuples = 0, relpages = 0 where relname = 't2';
+update pg_class set reltuples = 0, relpages = 0 where relname = 't1';
+delete from pg_statistic where starelid = 't2'::regclass;
+delete from pg_statistic where starelid = 't1'::regclass;
+-- Serial_Test_3 setup:
+truncate table t1;
+insert into t1 values(1),(1);
+insert into t1 select 2 from generate_series(1,7)i;
+insert into t1 select i from generate_series(3,10)i;
+truncate table t2;
+insert into t2 select 2 from generate_series(1,7)i;
+insert into t2 values(3),(3);
+insert into t2 select i from generate_series(5,9)i;
+-- Serial_Test_3.1
+-- spills in 4 batches
+-- batch 1 falls back with 2 chunks with 1 unmatched tuple
+-- batch 2 does not fall back with 2 unmatched tuples
+-- batch 3 falls back with 4 chunks with 1 unmatched tuple
+-- batch 4 does not fall back with no unmatched tuples
+select * from explain_multi_batch();
+                    explain_multi_batch                     
+------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Left Join (actual rows=60 loops=1)
+         Hash Cond: (t1.a = t2.b)
+         ->  Seq Scan on t1 (actual rows=17 loops=1)
+         ->  Hash (actual rows=14 loops=1)
+               Buckets: 2048  Batches: 4  Memory Usage: xxx
+               ->  Seq Scan on t2 (actual rows=14 loops=1)
+(7 rows)
+
+select * from t1 left outer join t2 on a = b order by b, a;
+ a  | b 
+----+---
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  3 | 3
+  3 | 3
+  5 | 5
+  6 | 6
+  7 | 7
+  8 | 8
+  9 | 9
+  1 |  
+  1 |  
+  4 |  
+ 10 |  
+(60 rows)
+
+select * from t1, t2 where a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+select * from t1 right outer join t2 on a = b order by a, b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+select * from t1 full outer join t2 on a = b order by b, a;
+ a  | b 
+----+---
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  3 | 3
+  3 | 3
+  5 | 5
+  6 | 6
+  7 | 7
+  8 | 8
+  9 | 9
+  1 |  
+  1 |  
+  4 |  
+ 10 |  
+(60 rows)
+
+select * from t1, t2 where a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+-- Serial_Test_3.2 
+-- swap join order
+select * from t2 left outer join t1 on a = b order by a, b;
+ b | a 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+select * from t2, t1 where a = b order by a;
+ b | a 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+select * from t2 right outer join t1 on a = b order by b, a;
+ b | a  
+---+----
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 3 |  3
+ 3 |  3
+ 5 |  5
+ 6 |  6
+ 7 |  7
+ 8 |  8
+ 9 |  9
+   |  1
+   |  1
+   |  4
+   | 10
+(60 rows)
+
+select * from t2 full outer join t1 on a = b order by a, b;
+ b | a  
+---+----
+   |  1
+   |  1
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 3 |  3
+ 3 |  3
+   |  4
+ 5 |  5
+ 6 |  6
+ 7 |  7
+ 8 |  8
+ 9 |  9
+   | 10
+(60 rows)
+
+-- Serial_Test_3.3 setup
+analyze t1; analyze t2;
+-- Serial_Test_3.3
+-- doesn't spill
+select * from explain_multi_batch();
+                    explain_multi_batch                     
+------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Left Join (actual rows=60 loops=1)
+         Hash Cond: (t1.a = t2.b)
+         ->  Seq Scan on t1 (actual rows=17 loops=1)
+         ->  Hash (actual rows=14 loops=1)
+               Buckets: 1024  Batches: 1  Memory Usage: xxx
+               ->  Seq Scan on t2 (actual rows=14 loops=1)
+(7 rows)
+
+-- Serial_Test_4 setup
+drop table t1;
+create table t1(b int);
+insert into t1 select i from generate_series(1,111)i;
+insert into t1 select 2 from generate_series(1,180)i;
+analyze t1;
+drop table t2;
+create table t2(a int);
+insert into t2 select i from generate_series(20,25000)i;
+insert into t2 select 2 from generate_series(1,100)i;
+analyze t2;
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+-- Serial_Test_4.1
+-- spills in 32 batches
+--batch 0 does not fall back with 1 unmatched outer tuple (15)
+--batch 1 falls back with 396 chunks.
+--batch 2 falls back with 402 chunks with 1 unmatched outer tuple (1)
+--batch 3 falls back with 389 chunks with 1 unmatched outer tuple (8)
+--batch 4 falls back with 409 chunks with no unmatched outer tuples
+--batch 5 falls back with 366 chunks with 1 unmatched outer tuple (4)
+--batch 6 falls back with 407 chunks with 1 unmatched outer tuple (11)
+--batch 7 falls back with 382 chunks with unmatched outer tuple (10)
+--batch 8 falls back with 413 chunks with no unmatched outer tuples
+--batch 9 falls back with 371 chunks with 1 unmatched outer tuple (3)
+--batch 10 falls back with 389 chunks with no unmatched outer tuples
+--batch 11 falls back with 408 chunks with no unmatched outer tuples
+--batch 12 falls back with 387 chunks with no unmatched outer tuples
+--batch 13 falls back with 402 chunks with 1 unmatched outer tuple (18) 
+--batch 14 falls back with 369 chunks with 1 unmatched outer tuple (9)
+--batch 15 falls back with 387 chunks with no unmatched outer tuples
+--batch 16 falls back with 365 chunks with no unmatched outer tuples
+--batch 17 falls back with 403 chunks with 2 unmatched outer tuples (14,19)
+--batch 18 falls back with 375 chunks with no unmatched outer tuples
+--batch 19 falls back with 384 chunks with no unmatched outer tuples
+--batch 20 falls back with 377 chunks with 1 unmatched outer tuple (12)
+--batch 22 falls back with 401 chunks with no unmatched outer tuples
+--batch 23 falls back with 396 chunks with no unmatched outer tuples
+--batch 24 falls back with 387 chunks with 1 unmatched outer tuple (5)
+--batch 25 falls back with 399 chunks with 1 unmatched outer tuple (7)
+--batch 26 falls back with 387 chunks.
+--batch 27 falls back with 442 chunks.
+--batch 28 falls back with 385 chunks with 1 unmatched outer tuple (17)
+--batch 29 falls back with 375 chunks.
+--batch 30 falls back with 404 chunks with 1 unmatched outer tuple (6)
+--batch 31 falls back with 396 chunks with 2 unmatched outer tuples (13,16)
+select * from explain_multi_batch();
+                                     explain_multi_batch                                      
+----------------------------------------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Left Join (actual rows=18210 loops=1)
+         Hash Cond: (t1.b = t2.a)
+         ->  Seq Scan on t1 (actual rows=291 loops=1)
+         ->  Hash (actual rows=25081 loops=1)
+               Buckets: 2048 (originally 1024)  Batches: 32 (originally 1)  Memory Usage: xxx
+               ->  Seq Scan on t2 (actual rows=25081 loops=1)
+(7 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+ 18210
+(1 row)
+
+select count(a) from t1 left outer join t2 on a = b;
+ count 
+-------
+ 18192
+(1 row)
+
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+ 18192
+(1 row)
+
+-- used to give wrong results because there is a whole batch of outer which is
+-- empty and so the inner doesn't emit unmatched tuples with ROJ
+select count(*) from t1 right outer join t2 on a = b;
+ count 
+-------
+ 43081
+(1 row)
+
+select count(*) from t1 full outer join t2 on a = b; 
+ count 
+-------
+ 43099
+(1 row)
+
+-- Test_6 non-negligible amount of data test case
+-- TODO: doesn't finish with my code when it is set to be serial
+-- it does finish when it is parallel -- the serial version is either simply too
+-- slow or has a bug -- I tried it with less data and it did finish, so it must
+-- just be really slow
+-- inner join shouldn't even need to make the unmatched files
+-- it finishes eventually if I decrease data amount
+--drop table simple;
+--create table simple as
+ -- select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+--alter table simple set (parallel_workers = 2);
+--analyze simple;
+--
+--drop table extremely_skewed;
+--create table extremely_skewed (id int, t text);
+--alter table extremely_skewed set (autovacuum_enabled = 'false');
+--alter table extremely_skewed set (parallel_workers = 2);
+--analyze extremely_skewed;
+--insert into extremely_skewed
+--  select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+--  from generate_series(1, 20000);
+--update pg_class
+--  set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+--  where relname = 'extremely_skewed';
+--set work_mem=64;
+--set enable_mergejoin to off;
+--explain (analyze, costs off, timing off)
+  --select * from simple r join extremely_skewed s using (id);
+--select * from explain_multi_batch();
+drop table t1;
+drop table t2;
+drop function explain_multi_batch();
+reset enable_mergejoin;
+reset work_mem;
+reset search_path;
+drop schema adaptive_hj;
diff --git a/src/test/regress/expected/parallel_adaptive_hj.out b/src/test/regress/expected/parallel_adaptive_hj.out
new file mode 100644
index 0000000000..e5e7f9aa4f
--- /dev/null
+++ b/src/test/regress/expected/parallel_adaptive_hj.out
@@ -0,0 +1,343 @@
+create schema parallel_adaptive_hj;
+set search_path=parallel_adaptive_hj;
+-- TODO: anti-semi-join and semi-join tests
+-- TODO: check if test2 and 3 are different at all
+-- TODO: add test for parallel-oblivious parallel hash join
+-- TODO: make this function general
+create or replace function explain_parallel_multi_batch() returns setof text language plpgsql as
+$$
+declare ln text;
+begin
+    for ln in
+        explain (analyze, summary off, timing off, costs off)
+		select count(*) from t1 left outer join t2 on a = b
+    loop
+        ln := regexp_replace(ln, 'Memory Usage: \S*',  'Memory Usage: xxx');
+        return next ln;
+    end loop;
+end;
+$$;
+-- parallel setup
+set enable_nestloop to off;
+set enable_mergejoin to off;
+set  min_parallel_table_scan_size = 0;
+set  parallel_setup_cost = 0;
+set  enable_parallel_hash = on;
+set  enable_hashjoin = on;
+set  max_parallel_workers_per_gather = 1;
+set  work_mem = 64;
+-- Parallel_Test_1 setup
+drop table if exists t1;
+NOTICE:  table "t1" does not exist, skipping
+create table t1(a int);
+insert into t1 select i from generate_series(1,11)i;
+insert into t1 select 2 from generate_series(1,18)i;
+analyze t1;
+drop table if exists t2;
+NOTICE:  table "t2" does not exist, skipping
+create table t2(b int);
+insert into t2 select i from generate_series(4,2500)i;
+insert into t2 select 2 from generate_series(1,10)i;
+analyze t2;
+alter table t2 set (autovacuum_enabled = 'false');
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+-- Parallel_Test_1.1
+-- spills in 4 batches
+-- 1 resize of nbatches
+-- no batch falls back
+select * from explain_parallel_multi_batch();
+                                      explain_parallel_multi_batch                                       
+---------------------------------------------------------------------------------------------------------
+ Finalize Aggregate (actual rows=1 loops=1)
+   ->  Gather (actual rows=2 loops=1)
+         Workers Planned: 1
+         Workers Launched: 1
+         ->  Partial Aggregate (actual rows=1 loops=2)
+               ->  Parallel Hash Left Join (actual rows=100 loops=2)
+                     Hash Cond: (t1.a = t2.b)
+                     ->  Parallel Seq Scan on t1 (actual rows=29 loops=1)
+                     ->  Parallel Hash (actual rows=1254 loops=2)
+                           Buckets: 1024 (originally 1024)  Batches: 4 (originally 1)  Memory Usage: xxx
+                           ->  Parallel Seq Scan on t2 (actual rows=2507 loops=1)
+(11 rows)
+
+-- need an aggregate to exercise the code but still want to know if we are
+-- emitting the right unmatched outer tuples
+select count(a) from t1 left outer join t2 on a = b;
+ count 
+-------
+   200
+(1 row)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+   200
+(1 row)
+
+-- Parallel_Test_1.1.a
+-- results checking for inner join
+-- doesn't fall back
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+   198
+(1 row)
+
+-- Parallel_Test_1.1.b
+-- results checking for right outer join
+-- doesn't exercise the fallback code but just checking results
+select count(*) from t1 right outer join t2 on a = b;
+ count 
+-------
+  2687
+(1 row)
+
+-- Parallel_Test_1.1.c
+-- results checking for full outer join
+select count(*) from t1 full outer join t2 on a = b;
+ count 
+-------
+  2689
+(1 row)
+
+-- Parallel_Test_1.2
+-- spill and doesn't have to resize nbatches
+analyze t2;
+select * from explain_parallel_multi_batch();
+                           explain_parallel_multi_batch                           
+----------------------------------------------------------------------------------
+ Finalize Aggregate (actual rows=1 loops=1)
+   ->  Gather (actual rows=2 loops=1)
+         Workers Planned: 1
+         Workers Launched: 1
+         ->  Partial Aggregate (actual rows=1 loops=2)
+               ->  Parallel Hash Left Join (actual rows=100 loops=2)
+                     Hash Cond: (t1.a = t2.b)
+                     ->  Parallel Seq Scan on t1 (actual rows=29 loops=1)
+                     ->  Parallel Hash (actual rows=1254 loops=2)
+                           Buckets: 2048  Batches: 4  Memory Usage: xxx
+                           ->  Parallel Seq Scan on t2 (actual rows=2507 loops=1)
+(11 rows)
+
+select count(a) from t1 left outer join t2 on a = b;
+ count 
+-------
+   200
+(1 row)
+
+-- Parallel_Test_1.3
+-- doesn't spill
+-- does resize nbuckets
+set work_mem = '4MB';
+select * from explain_parallel_multi_batch();
+                           explain_parallel_multi_batch                           
+----------------------------------------------------------------------------------
+ Finalize Aggregate (actual rows=1 loops=1)
+   ->  Gather (actual rows=2 loops=1)
+         Workers Planned: 1
+         Workers Launched: 1
+         ->  Partial Aggregate (actual rows=1 loops=2)
+               ->  Parallel Hash Left Join (actual rows=100 loops=2)
+                     Hash Cond: (t1.a = t2.b)
+                     ->  Parallel Seq Scan on t1 (actual rows=29 loops=1)
+                     ->  Parallel Hash (actual rows=1254 loops=2)
+                           Buckets: 4096  Batches: 1  Memory Usage: xxx
+                           ->  Parallel Seq Scan on t2 (actual rows=2507 loops=1)
+(11 rows)
+
+select count(a) from t1 left outer join t2 on a = b;
+ count 
+-------
+   200
+(1 row)
+
+set work_mem = 64;
+-- Parallel_Test_3
+-- big example
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i from generate_series(20,25000)i;
+insert into t2 select 2 from generate_series(1,100)i;
+analyze t2;
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,111)i;
+insert into t1 select 2 from generate_series(1,180)i;
+analyze t1;
+select * from explain_parallel_multi_batch();
+                                       explain_parallel_multi_batch                                       
+----------------------------------------------------------------------------------------------------------
+ Finalize Aggregate (actual rows=1 loops=1)
+   ->  Gather (actual rows=2 loops=1)
+         Workers Planned: 1
+         Workers Launched: 1
+         ->  Partial Aggregate (actual rows=1 loops=2)
+               ->  Parallel Hash Left Join (actual rows=9105 loops=2)
+                     Hash Cond: (t1.a = t2.b)
+                     ->  Parallel Seq Scan on t1 (actual rows=146 loops=2)
+                     ->  Parallel Hash (actual rows=12540 loops=2)
+                           Buckets: 1024 (originally 1024)  Batches: 16 (originally 1)  Memory Usage: xxx
+                           ->  Parallel Seq Scan on t2 (actual rows=12540 loops=2)
+(11 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+ 18210
+(1 row)
+
+-- TODO: check what each of these is exercising -- chunk num, etc and write that
+-- down
+-- also, note that this example did reveal with ROJ that it wasn't working, so
+-- maybe keep that but it is not parallel
+-- make sure the plans make sense for the code we are writing
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+ 18210
+(1 row)
+
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+ 18192
+(1 row)
+
+select count(*) from t1 right outer join t2 on a = b;
+ count 
+-------
+ 43081
+(1 row)
+
+select count(*) from t1 full outer join t2 on a = b;
+ count 
+-------
+ 43099
+(1 row)
+
+-- Parallel_Test_4
+-- spill and resize nbatches 2x
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i from generate_series(4,1000)i;
+insert into t2 select 2 from generate_series(1,4000)i;
+analyze t2;
+alter table t2 set (autovacuum_enabled = 'false');
+update pg_class
+set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+where relname = 't2';
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,11)i;
+insert into t1 select 2 from generate_series(1,18)i;
+insert into t1 values(500);
+analyze t1;
+select * from explain_parallel_multi_batch();
+                                       explain_parallel_multi_batch                                       
+----------------------------------------------------------------------------------------------------------
+ Finalize Aggregate (actual rows=1 loops=1)
+   ->  Gather (actual rows=2 loops=1)
+         Workers Planned: 1
+         Workers Launched: 1
+         ->  Partial Aggregate (actual rows=1 loops=2)
+               ->  Parallel Hash Left Join (actual rows=38006 loops=2)
+                     Hash Cond: (t1.a = t2.b)
+                     ->  Parallel Seq Scan on t1 (actual rows=15 loops=2)
+                     ->  Parallel Hash (actual rows=2498 loops=2)
+                           Buckets: 1024 (originally 1024)  Batches: 16 (originally 1)  Memory Usage: xxx
+                           ->  Parallel Seq Scan on t2 (actual rows=2498 loops=2)
+(11 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+ 76011
+(1 row)
+
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+ 76009
+(1 row)
+
+select count(*) from t1 right outer join t2 on a = b;
+ count 
+-------
+ 76997
+(1 row)
+
+select count(*) from t1 full outer join t2 on a = b;
+ count 
+-------
+ 76999
+(1 row)
+
+select count(a) from t1 left outer join t2 on a = b;
+ count 
+-------
+ 76011
+(1 row)
+
+-- Parallel_Test_5
+-- revealed race condition because two workers are working on a chunked batch
+-- only 2 unmatched tuples
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i%1111 from generate_series(200,10000)i;
+delete from t2 where b = 115;
+delete from t2 where b = 200;
+insert into t2 select 2 from generate_series(1,4000);
+analyze t2;
+alter table t2 set (autovacuum_enabled = 'false');
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,111)i;
+insert into t1 values(115);
+insert into t1 values(200);
+insert into t1 select 2 from generate_series(1,180)i;
+analyze t1;
+select * from explain_parallel_multi_batch();
+                                       explain_parallel_multi_batch                                       
+----------------------------------------------------------------------------------------------------------
+ Finalize Aggregate (actual rows=1 loops=1)
+   ->  Gather (actual rows=2 loops=1)
+         Workers Planned: 1
+         Workers Launched: 1
+         ->  Partial Aggregate (actual rows=1 loops=2)
+               ->  Parallel Hash Left Join (actual rows=363166 loops=2)
+                     Hash Cond: (t1.a = t2.b)
+                     ->  Parallel Seq Scan on t1 (actual rows=146 loops=2)
+                     ->  Parallel Hash (actual rows=6892 loops=2)
+                           Buckets: 1024 (originally 1024)  Batches: 32 (originally 1)  Memory Usage: xxx
+                           ->  Parallel Seq Scan on t2 (actual rows=6892 loops=2)
+(11 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count  
+--------
+ 726331
+(1 row)
+
+-- without count(*), can't reproduce desired plan so can't rely on results
+select count(*) from t1 left outer join t2 on a = b;
+ count  
+--------
+ 726331
+(1 row)
+
+drop table if exists t1;
+drop table if exists t2;
+drop function explain_parallel_multi_batch();
+reset enable_mergejoin;
+reset work_mem;
+reset search_path;
+drop schema parallel_adaptive_hj;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d33a4e143d..0afd6db491 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 adaptive_hj parallel_adaptive_hj
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/post_schedule b/src/test/regress/post_schedule
new file mode 100644
index 0000000000..7824ecf7bf
--- /dev/null
+++ b/src/test/regress/post_schedule
@@ -0,0 +1,8 @@
+test: object_address
+test: tablesample
+test: groupingsets
+test: drop_operator
+test: password
+test: identity
+test: generated
+test: join_hash
diff --git a/src/test/regress/pre_schedule b/src/test/regress/pre_schedule
new file mode 100644
index 0000000000..4105b0fa03
--- /dev/null
+++ b/src/test/regress/pre_schedule
@@ -0,0 +1,120 @@
+# src/test/regress/serial_schedule
+# This should probably be in an order similar to parallel_schedule.
+test: tablespace
+test: boolean
+test: char
+test: name
+test: varchar
+test: text
+test: int2
+test: int4
+test: int8
+test: oid
+test: float4
+test: float8
+test: bit
+test: numeric
+test: txid
+test: uuid
+test: enum
+test: money
+test: rangetypes
+test: pg_lsn
+test: regproc
+test: strings
+test: numerology
+test: point
+test: lseg
+test: line
+test: box
+test: path
+test: polygon
+test: circle
+test: date
+test: time
+test: timetz
+test: timestamp
+test: timestamptz
+test: interval
+test: inet
+test: macaddr
+test: macaddr8
+test: tstypes
+test: geometry
+test: horology
+test: regex
+test: oidjoins
+test: type_sanity
+test: opr_sanity
+test: misc_sanity
+test: comments
+test: expressions
+test: create_function_1
+test: create_type
+test: create_table
+test: create_function_2
+test: copy
+test: copyselect
+test: copydml
+test: insert
+test: insert_conflict
+test: create_misc
+test: create_operator
+test: create_procedure
+test: create_index
+test: create_index_spgist
+test: create_view
+test: index_including
+test: index_including_gist
+test: create_aggregate
+test: create_function_3
+test: create_cast
+test: constraints
+test: triggers
+test: select
+test: inherit
+test: typed_table
+test: vacuum
+test: drop_if_exists
+test: updatable_views
+test: roleattributes
+test: create_am
+test: hash_func
+test: errors
+test: sanity_check
+test: select_into
+test: select_distinct
+test: select_distinct_on
+test: select_implicit
+test: select_having
+test: subselect
+test: union
+test: case
+test: join
+test: adaptive_hj
+test: parallel_adaptive_hj
+test: aggregates
+test: transactions
+ignore: random
+test: random
+test: portals
+test: arrays
+test: btree_index
+test: hash_index
+test: update
+test: delete
+test: namespace
+test: prepared_xacts
+test: brin
+test: gin
+test: gist
+test: spgist
+test: privileges
+test: init_privs
+test: security_label
+test: collate
+test: matview
+test: lock
+test: replica_identity
+test: rowsecurity
+
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index f86f5c5682..0dc0967a93 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -91,6 +91,8 @@ test: subselect
 test: union
 test: case
 test: join
+test: adaptive_hj
+test: parallel_adaptive_hj
 test: aggregates
 test: transactions
 ignore: random
diff --git a/src/test/regress/sql/adaptive_hj.sql b/src/test/regress/sql/adaptive_hj.sql
new file mode 100644
index 0000000000..a5af798ea8
--- /dev/null
+++ b/src/test/regress/sql/adaptive_hj.sql
@@ -0,0 +1,240 @@
+-- TODO: remove some of these tests and make the test file faster
+create schema adaptive_hj;
+set search_path=adaptive_hj;
+drop table if exists t1;
+drop table if exists t2;
+create table t1(a int);
+create table t2(b int);
+
+-- serial setup
+set work_mem=64;
+set enable_mergejoin to off;
+-- TODO: make this function general
+create or replace function explain_multi_batch() returns setof text language plpgsql as
+$$
+declare ln text;
+begin
+    for ln in
+        explain (analyze, summary off, timing off, costs off)
+		select count(*) from t1 left outer join t2 on a = b
+    loop
+        ln := regexp_replace(ln, 'Memory Usage: \S*',  'Memory Usage: xxx');
+        return next ln;
+    end loop;
+end;
+$$;
+
+-- Serial_Test_1 reset
+-- TODO: refactor into procedure or change to drop table
+update pg_class set reltuples = 0, relpages = 0 where relname = 't2';
+update pg_class set reltuples = 0, relpages = 0 where relname = 't1';
+delete from pg_statistic where starelid = 't2'::regclass;
+delete from pg_statistic where starelid = 't1'::regclass;
+
+-- Serial_Test_1 setup
+truncate table t1;
+insert into t1 values(1),(2);
+insert into t1 select i from generate_series(1,10)i;
+insert into t1 select 2 from generate_series(1,5)i;
+truncate table t2;
+insert into t2 values(2),(3),(11);
+insert into t2 select i from generate_series(2,10)i;
+insert into t2 select 2 from generate_series(2,7)i;
+
+-- Serial_Test_1.1
+-- TODO: automate the checking for expected number of chunks (explain option?)
+-- spills in 4 batches
+-- batch 1 falls back with 2 chunks with no unmatched tuples
+-- batch 2 falls back with 2 chunks with 2 unmatched tuples emitted at EOB 
+-- batch 3 falls back with 5 chunks with no unmatched tuples
+-- batch 4 does not fall back with no unmatched tuples
+select * from explain_multi_batch();
+select * from t1 left outer join t2 on a = b order by b, a;
+select * from t1, t2 where a = b order by b;
+select * from t1 right outer join t2 on a = b order by a, b;
+select * from t1 full outer join t2 on a = b order by b, a;
+
+-- Serial_Test_1.2 setup
+analyze t1; analyze t2;
+
+-- Serial_Test_1.2
+-- doesn't spill (happens to do a hash right join)
+select * from explain_multi_batch();
+
+-- Serial_Test_2 reset
+update pg_class set reltuples = 0, relpages = 0 where relname = 't2';
+update pg_class set reltuples = 0, relpages = 0 where relname = 't1';
+delete from pg_statistic where starelid = 't2'::regclass;
+delete from pg_statistic where starelid = 't1'::regclass;
+
+-- Serial_Test_2 setup:
+truncate table t1;
+insert into t1 values (1),(2),(2),(3);
+truncate table t2;
+insert into t2 values(2),(2),(3),(3),(4);
+
+-- Serial_Test_2.1
+-- spills in 4 batches
+-- batch 1 falls back with 2 chunks with no unmatched tuples
+-- batch 2 does not fall back with 1 unmatched tuple
+-- batch 3 does not fall back with no unmatched tuples
+-- batch 4 does not fall back with no unmatched tuples
+select * from explain_multi_batch();
+select * from t1 left outer join t2 on a = b order by b, a;
+select * from t1 right outer join t2 on a = b order by a, b;
+
+-- TODO: check coverage for emitting ummatched inner tuples
+-- Serial_Test_2.1.a
+-- results checking for inner join
+select * from t1 left outer join t2 on a = b order by b, a;
+select * from t1, t2 where a = b order by b;
+select * from t1 right outer join t2 on a = b order by a, b;
+select * from t1 full outer join t2 on a = b order by b, a;
+select * from t1, t2 where a = b order by b;
+
+-- Serial_Test_2.2
+analyze t1; analyze t2;
+-- doesn't spill (happens to do a hash right join)
+select * from explain_multi_batch();
+
+-- Serial_Test_3 reset
+update pg_class set reltuples = 0, relpages = 0 where relname = 't2';
+update pg_class set reltuples = 0, relpages = 0 where relname = 't1';
+delete from pg_statistic where starelid = 't2'::regclass;
+delete from pg_statistic where starelid = 't1'::regclass;
+
+
+-- Serial_Test_3 setup:
+truncate table t1;
+insert into t1 values(1),(1);
+insert into t1 select 2 from generate_series(1,7)i;
+insert into t1 select i from generate_series(3,10)i;
+truncate table t2;
+insert into t2 select 2 from generate_series(1,7)i;
+insert into t2 values(3),(3);
+insert into t2 select i from generate_series(5,9)i;
+
+-- Serial_Test_3.1
+-- spills in 4 batches
+-- batch 1 falls back with 2 chunks with 1 unmatched tuple
+-- batch 2 does not fall back with 2 unmatched tuples
+-- batch 3 falls back with 4 chunks with 1 unmatched tuple
+-- batch 4 does not fall back with no unmatched tuples
+select * from explain_multi_batch();
+select * from t1 left outer join t2 on a = b order by b, a;
+select * from t1, t2 where a = b order by b;
+select * from t1 right outer join t2 on a = b order by a, b;
+select * from t1 full outer join t2 on a = b order by b, a;
+select * from t1, t2 where a = b order by b;
+
+-- Serial_Test_3.2 
+-- swap join order
+select * from t2 left outer join t1 on a = b order by a, b;
+select * from t2, t1 where a = b order by a;
+select * from t2 right outer join t1 on a = b order by b, a;
+select * from t2 full outer join t1 on a = b order by a, b;
+
+-- Serial_Test_3.3 setup
+analyze t1; analyze t2;
+
+-- Serial_Test_3.3
+-- doesn't spill
+select * from explain_multi_batch();
+
+-- Serial_Test_4 setup
+drop table t1;
+create table t1(b int);
+insert into t1 select i from generate_series(1,111)i;
+insert into t1 select 2 from generate_series(1,180)i;
+analyze t1;
+
+drop table t2;
+create table t2(a int);
+insert into t2 select i from generate_series(20,25000)i;
+insert into t2 select 2 from generate_series(1,100)i;
+analyze t2;
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+
+-- Serial_Test_4.1
+-- spills in 32 batches
+--batch 0 does not fall back with 1 unmatched outer tuple (15)
+--batch 1 falls back with 396 chunks.
+--batch 2 falls back with 402 chunks with 1 unmatched outer tuple (1)
+--batch 3 falls back with 389 chunks with 1 unmatched outer tuple (8)
+--batch 4 falls back with 409 chunks with no unmatched outer tuples
+--batch 5 falls back with 366 chunks with 1 unmatched outer tuple (4)
+--batch 6 falls back with 407 chunks with 1 unmatched outer tuple (11)
+--batch 7 falls back with 382 chunks with unmatched outer tuple (10)
+--batch 8 falls back with 413 chunks with no unmatched outer tuples
+--batch 9 falls back with 371 chunks with 1 unmatched outer tuple (3)
+--batch 10 falls back with 389 chunks with no unmatched outer tuples
+--batch 11 falls back with 408 chunks with no unmatched outer tuples
+--batch 12 falls back with 387 chunks with no unmatched outer tuples
+--batch 13 falls back with 402 chunks with 1 unmatched outer tuple (18) 
+--batch 14 falls back with 369 chunks with 1 unmatched outer tuple (9)
+--batch 15 falls back with 387 chunks with no unmatched outer tuples
+--batch 16 falls back with 365 chunks with no unmatched outer tuples
+--batch 17 falls back with 403 chunks with 2 unmatched outer tuples (14,19)
+--batch 18 falls back with 375 chunks with no unmatched outer tuples
+--batch 19 falls back with 384 chunks with no unmatched outer tuples
+--batch 20 falls back with 377 chunks with 1 unmatched outer tuple (12)
+--batch 22 falls back with 401 chunks with no unmatched outer tuples
+--batch 23 falls back with 396 chunks with no unmatched outer tuples
+--batch 24 falls back with 387 chunks with 1 unmatched outer tuple (5)
+--batch 25 falls back with 399 chunks with 1 unmatched outer tuple (7)
+--batch 26 falls back with 387 chunks.
+--batch 27 falls back with 442 chunks.
+--batch 28 falls back with 385 chunks with 1 unmatched outer tuple (17)
+--batch 29 falls back with 375 chunks.
+--batch 30 falls back with 404 chunks with 1 unmatched outer tuple (6)
+--batch 31 falls back with 396 chunks with 2 unmatched outer tuples (13,16)
+select * from explain_multi_batch();
+select count(*) from t1 left outer join t2 on a = b;
+select count(a) from t1 left outer join t2 on a = b;
+select count(*) from t1, t2 where a = b;
+-- used to give wrong results because there is a whole batch of outer which is
+-- empty and so the inner doesn't emit unmatched tuples with ROJ
+select count(*) from t1 right outer join t2 on a = b;
+select count(*) from t1 full outer join t2 on a = b; 
+
+-- Test_6 non-negligible amount of data test case
+-- TODO: doesn't finish with my code when it is set to be serial
+-- it does finish when it is parallel -- the serial version is either simply too
+-- slow or has a bug -- I tried it with less data and it did finish, so it must
+-- just be really slow
+-- inner join shouldn't even need to make the unmatched files
+-- it finishes eventually if I decrease data amount
+
+--drop table simple;
+--create table simple as
+ -- select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+--alter table simple set (parallel_workers = 2);
+--analyze simple;
+--
+--drop table extremely_skewed;
+--create table extremely_skewed (id int, t text);
+--alter table extremely_skewed set (autovacuum_enabled = 'false');
+--alter table extremely_skewed set (parallel_workers = 2);
+--analyze extremely_skewed;
+--insert into extremely_skewed
+--  select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+--  from generate_series(1, 20000);
+--update pg_class
+--  set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+--  where relname = 'extremely_skewed';
+
+--set work_mem=64;
+--set enable_mergejoin to off;
+--explain (analyze, costs off, timing off)
+  --select * from simple r join extremely_skewed s using (id);
+--select * from explain_multi_batch();
+
+drop table t1;
+drop table t2;
+drop function explain_multi_batch();
+reset enable_mergejoin;
+reset work_mem;
+reset search_path;
+drop schema adaptive_hj;
diff --git a/src/test/regress/sql/parallel_adaptive_hj.sql b/src/test/regress/sql/parallel_adaptive_hj.sql
new file mode 100644
index 0000000000..3071c5f82e
--- /dev/null
+++ b/src/test/regress/sql/parallel_adaptive_hj.sql
@@ -0,0 +1,182 @@
+create schema parallel_adaptive_hj;
+set search_path=parallel_adaptive_hj;
+
+-- TODO: anti-semi-join and semi-join tests
+
+-- TODO: check if test2 and 3 are different at all
+
+-- TODO: add test for parallel-oblivious parallel hash join
+
+-- TODO: make this function general
+create or replace function explain_parallel_multi_batch() returns setof text language plpgsql as
+$$
+declare ln text;
+begin
+    for ln in
+        explain (analyze, summary off, timing off, costs off)
+		select count(*) from t1 left outer join t2 on a = b
+    loop
+        ln := regexp_replace(ln, 'Memory Usage: \S*',  'Memory Usage: xxx');
+        return next ln;
+    end loop;
+end;
+$$;
+
+-- parallel setup
+set enable_nestloop to off;
+set enable_mergejoin to off;
+set  min_parallel_table_scan_size = 0;
+set  parallel_setup_cost = 0;
+set  enable_parallel_hash = on;
+set  enable_hashjoin = on;
+set  max_parallel_workers_per_gather = 1;
+set  work_mem = 64;
+
+-- Parallel_Test_1 setup
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,11)i;
+insert into t1 select 2 from generate_series(1,18)i;
+analyze t1;
+
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i from generate_series(4,2500)i;
+insert into t2 select 2 from generate_series(1,10)i;
+analyze t2;
+alter table t2 set (autovacuum_enabled = 'false');
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+
+-- Parallel_Test_1.1
+-- spills in 4 batches
+-- 1 resize of nbatches
+-- no batch falls back
+select * from explain_parallel_multi_batch();
+-- need an aggregate to exercise the code but still want to know if we are
+-- emitting the right unmatched outer tuples
+select count(a) from t1 left outer join t2 on a = b;
+select count(*) from t1 left outer join t2 on a = b;
+
+-- Parallel_Test_1.1.a
+-- results checking for inner join
+-- doesn't fall back
+select count(*) from t1, t2 where a = b;
+-- Parallel_Test_1.1.b
+-- results checking for right outer join
+-- doesn't exercise the fallback code but just checking results
+select count(*) from t1 right outer join t2 on a = b;
+-- Parallel_Test_1.1.c
+-- results checking for full outer join
+select count(*) from t1 full outer join t2 on a = b;
+
+-- Parallel_Test_1.2
+-- spill and doesn't have to resize nbatches
+analyze t2;
+select * from explain_parallel_multi_batch();
+select count(a) from t1 left outer join t2 on a = b;
+
+-- Parallel_Test_1.3
+-- doesn't spill
+-- does resize nbuckets
+set work_mem = '4MB';
+select * from explain_parallel_multi_batch();
+select count(a) from t1 left outer join t2 on a = b;
+set work_mem = 64;
+
+
+-- Parallel_Test_3
+-- big example
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i from generate_series(20,25000)i;
+insert into t2 select 2 from generate_series(1,100)i;
+analyze t2;
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,111)i;
+insert into t1 select 2 from generate_series(1,180)i;
+analyze t1;
+
+select * from explain_parallel_multi_batch();
+select count(*) from t1 left outer join t2 on a = b;
+
+-- TODO: check what each of these is exercising -- chunk num, etc and write that
+-- down
+-- also, note that this example did reveal with ROJ that it wasn't working, so
+-- maybe keep that but it is not parallel
+-- make sure the plans make sense for the code we are writing
+select count(*) from t1 left outer join t2 on a = b;
+select count(*) from t1, t2 where a = b;
+select count(*) from t1 right outer join t2 on a = b;
+select count(*) from t1 full outer join t2 on a = b;
+
+-- Parallel_Test_4
+-- spill and resize nbatches 2x
+
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i from generate_series(4,1000)i;
+insert into t2 select 2 from generate_series(1,4000)i;
+analyze t2;
+alter table t2 set (autovacuum_enabled = 'false');
+update pg_class
+set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+where relname = 't2';
+
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,11)i;
+insert into t1 select 2 from generate_series(1,18)i;
+insert into t1 values(500);
+analyze t1;
+
+select * from explain_parallel_multi_batch();
+select count(*) from t1 left outer join t2 on a = b;
+select count(*) from t1, t2 where a = b;
+select count(*) from t1 right outer join t2 on a = b;
+select count(*) from t1 full outer join t2 on a = b;
+select count(a) from t1 left outer join t2 on a = b;
+
+-- Parallel_Test_5
+-- revealed race condition because two workers are working on a chunked batch
+-- only 2 unmatched tuples
+
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i%1111 from generate_series(200,10000)i;
+delete from t2 where b = 115;
+delete from t2 where b = 200;
+insert into t2 select 2 from generate_series(1,4000);
+analyze t2;
+alter table t2 set (autovacuum_enabled = 'false');
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,111)i;
+insert into t1 values(115);
+insert into t1 values(200);
+insert into t1 select 2 from generate_series(1,180)i;
+analyze t1;
+
+select * from explain_parallel_multi_batch();
+select count(*) from t1 left outer join t2 on a = b;
+
+-- without count(*), can't reproduce desired plan so can't rely on results
+select count(*) from t1 left outer join t2 on a = b;
+
+drop table if exists t1;
+drop table if exists t2;
+drop function explain_parallel_multi_batch();
+reset enable_mergejoin;
+reset work_mem;
+reset search_path;
+drop schema parallel_adaptive_hj;
-- 
2.20.1 (Apple Git-117)

#43

Thomas Munro

thomas.munro@gmail.com

about 6 years ago

In reply to: Melanie Plageman (#42)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Mon, Dec 30, 2019 at 4:34 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

So, I finally have a prototype to share of parallel hashloop fallback.

Hi Melanie,

Thanks for all your continued work on this! I started looking at it
today; it's a difficult project and I think it'll take me a while to
grok. I do have some early comments though:

* I am uneasy about BarrierArriveExplicitAndWait() (a variant of
BarrierArriveAndWait() that lets you skip directly to a given phase?);
perhaps you only needed that for a circular phase system, which you
could do with modular phase numbers, like PHJ_GROW_BATCHES_PHASE? I
tried to make the barrier interfaces look like the libraries in other
parallel programming environments, and I'd be worried that the
explicit phase thing could easily lead to bugs.
* It seems a bit strange to have "outer_match_status_file" in
SharedTupleStore; something's gone awry layering-wise there.
* I'm not sure it's OK to wait at the end of each loop, as described
in the commit message:

Workers probing a fallback batch will wait until all workers have
finished probing before moving on so that an elected worker can read
and combine the outer match status files into a single bitmap and use
it to emit unmatched outer tuples after all chunks of the inner side
have been processed.

Maybe I misunderstood completely, but that seems to break the
programming rule described in nodeHashjoin.c's comment beginning "To
avoid deadlocks, ...". To recap: (1) When you emit a tuple, the
program counter escapes to some other node, and maybe that other node
waits for thee, (2) Maybe the leader is waiting for you but you're
waiting for it to drain its queue so you can emit a tuple (I learned a
proper name for this: "flow control deadlock"). That's why the
current code only ever detaches (a non-waiting operation) after it's
begun emitting tuples (that is, the probing phase). It just moves
onto another batch. That's not a solution here: you can't simply move
to another loop, loops are not independent of each other like batches.
It's possible that barriers are not the right tool for this part of
the problem, or that there is a way to use a barrier that you don't
remain attached to while emitting, or that we should remove the
deadlock risks another way entirely[1]/messages/by-id/CA+hUKG+A6ftXPz4oe92+x8Er+xpGZqto70-Q_ERwRaSyA=afNg@mail.gmail.com but I'm not sure. Furthermore,
the new code in ExecParallelHashJoinNewBatch() appears to break the
rule even in the non-looping case (it calls BarrierArriveAndWait() in
ExecParallelHashJoinNewBatch(), where the existing code just
detaches).

This patch does contain refactoring of nodeHashjoin.

I have split the Parallel HashJoin and Serial HashJoin state machines
up, as they were diverging in my patch to a point that made for a
really cluttered ExecHashJoinImpl() (ExecHashJoinImpl() is now gone).

Hmm. I'm rather keen on extending that technique further: I'd like
there to be more configuration points in the form of parameters to
that function, so that we write the algorithm just once but we
generate a bunch of specialised variants that are the best possible
machine code for each combination of parameters via constant-folding
using the "always inline" trick (steampunk C++ function templates).
My motivations for wanting to do that are: supporting different hash
sizes (CF commit e69d6445), removing branches for unused optimisations
(eg skew), and inlining common hash functions. That isn't to say we
couldn't have two different templatoid functions from which many
others are specialised, but I feel like that's going to lead to a lot
of duplication.

The reason I didn't do this refactoring in one patch and then put the
adaptive hashjoin code on top of it is that I might like to make
Parallel HashJoin and Serial HashJoin different nodes.

I think that has been discussed elsewhere and was looking to
understand the rationale for keeping them in the same node.

Well, there is a discussion about getting rid of the Hash node, since
it's so tightly coupled with Hash Join that it might as well not exist
as a separate entity. (Incidentally, I noticed in someone's blog that
MySQL now shows Hash separately in its PostgreSQL-style EXPLAIN
output; now we'll remove it, CF the Dr Seuss story about the
Sneetches). But as for Parallel Hash Join vs [Serial] Hash Join, I
think it makes sense to use the same node because they are
substantially the same thing, with optional extra magic, and I think
it's our job to figure out how to write code in a style that makes the
differences maintainable. That fits into a general pattern that
"Parallel" is a mode, not a different node. On the other hand, PHJ is
by far the most different from the original code, compared to things
like Parallel Sequential Scan etc. FWIW I think we're probably in
relatively new territory here: as far as I know, other traditional
RDBMSs didn't really seem to have a concept like parallel-aware
executor nodes, because they tended to be based on partitioning, so
that the operators are all oblivious to parallelism and don't have to
share/coordinate anything at this level. It seems that everyone is
now coming around to the view that shared hash table hash joins are a
good idea now that we have so many cores connected up to shared
memory. Curiously, judging from another blog article I saw, on the
surface it looks like Oracle's brand new HASH JOIN SHARED is a
different operator than HASH JOIN (just an observation, I could be way
off and I don't know or want to know how that's done under the covers
in that system).

- number of batches is not deterministic from run-to-run

Yeah, I had a lot of fun with that sort of thing on the build farm
when PHJ was first committed, and the effects were different on
systems I don't have access to that have different sizeof() for
certain types.

- Rename "chunk" (as in chunks of inner side) to something that is
not already used in the context of memory chunks and, more
importantly, SharedTuplestoreChunk

+1. Fragments? Loops? Blocks (from
https://en.wikipedia.org/wiki/Block_nested_loop, though, no, strike
that, blocks are also super overloaded).

[1]: /messages/by-id/CA+hUKG+A6ftXPz4oe92+x8Er+xpGZqto70-Q_ERwRaSyA=afNg@mail.gmail.com

#44

Melanie Plageman

melanieplageman@gmail.com

about 6 years ago

In reply to: Thomas Munro (#43)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Tue, Jan 7, 2020 at 4:14 PM Thomas Munro <thomas.munro@gmail.com> wrote:

* I am uneasy about BarrierArriveExplicitAndWait() (a variant of
BarrierArriveAndWait() that lets you skip directly to a given phase?);
perhaps you only needed that for a circular phase system, which you
could do with modular phase numbers, like PHJ_GROW_BATCHES_PHASE? I
tried to make the barrier interfaces look like the libraries in other
parallel programming environments, and I'd be worried that the
explicit phase thing could easily lead to bugs.

So, I actually use it to circle back up to the first phase while
skipping the last phase.
So I couldn't do it with modular phase numbers and a loop.
The last phase detaches from the chunk barrier. I don't want to detach
from the chunk barrier if there are more chunks.
I basically need a way to only attach to the chunk barrier at the
begininng of the first chunk and only detach at the end of the last
chunk--not in between chunks. I will return from the function and
re-enter between chunks -- say between chunk 2 and chunk 3 of 5.

However, could this be solved by having more than one chunk
barrier?
A worker would attach to one chunk barrier and then when it moves to
the next chunk it would attach to the other chunk barrier and then
switch back when it switches to the next chunk. Then it could detach
and attach each time it enters/leaves the function.

* I'm not sure it's OK to wait at the end of each loop, as described
in the commit message:

Workers probing a fallback batch will wait until all workers have
finished probing before moving on so that an elected worker can read
and combine the outer match status files into a single bitmap and use
it to emit unmatched outer tuples after all chunks of the inner side
have been processed.

Maybe I misunderstood completely, but that seems to break the
programming rule described in nodeHashjoin.c's comment beginning "To
avoid deadlocks, ...". To recap: (1) When you emit a tuple, the
program counter escapes to some other node, and maybe that other node
waits for thee, (2) Maybe the leader is waiting for you but you're
waiting for it to drain its queue so you can emit a tuple (I learned a
proper name for this: "flow control deadlock"). That's why the
current code only ever detaches (a non-waiting operation) after it's
begun emitting tuples (that is, the probing phase). It just moves
onto another batch. That's not a solution here: you can't simply move
to another loop, loops are not independent of each other like batches.
It's possible that barriers are not the right tool for this part of
the problem, or that there is a way to use a barrier that you don't
remain attached to while emitting, or that we should remove the
deadlock risks another way entirely[1] but I'm not sure. Furthermore,
the new code in ExecParallelHashJoinNewBatch() appears to break the
rule even in the non-looping case (it calls BarrierArriveAndWait() in
ExecParallelHashJoinNewBatch(), where the existing code just
detaches).

Yea, I think I'm totally breaking that rule.
Just to make sure I understand the way in which I am breaking that
rule:

In my patch, while attached to a chunk_barrier, worker1 emits a
matched tuple (control leaves the current node). Meanwhile, worker2
has finished probing the chunk and is waiting on the chunk_barrier for
worker1.
How though could worker1 be waiting for worker2?

Is this only a problem when one of the barrier participants is the
leader and is reading from the tuple queue? (reading your tuple queue
deadlock hazard example in the thread [1]/messages/by-id/CA+hUKG+A6ftXPz4oe92+x8Er+xpGZqto70-Q_ERwRaSyA=afNg@mail.gmail.com you referred to).
Basically is my deadlock hazard a tuple queue deadlock hazard?

I thought maybe this could be a problem with nested HJ nodes, but I'm
not sure.

As I understand it, this isn't a problem with current master with
batch barriers because while attached to a batch_barrier, a worker can
emit tuples. No other workers will wait on the batch barrier once they
have started probing.

I need to think more about the suggestions you provided in [1]/messages/by-id/CA+hUKG+A6ftXPz4oe92+x8Er+xpGZqto70-Q_ERwRaSyA=afNg@mail.gmail.com about
nixing the tuple queue deadlock hazard.

However, hypothetically, if we decide we don't want to break the no
emitting tuples while attached to a barrier rule, how can we still
allow workers to coordinate while probing chunks of the batch
sequentially (1 chunk at a time)?

I could think of two options (both sound slow and bad):

Option 1:
Stash away the matched tuples in a tuplestore and emit them at the end
of the batch (incurring more writes).

Option 2:
Degenerate to 1 worker for fallback batches

Any other ideas?

- Rename "chunk" (as in chunks of inner side) to something that is
not already used in the context of memory chunks and, more
importantly, SharedTuplestoreChunk

+1. Fragments? Loops? Blocks (from
https://en.wikipedia.org/wiki/Block_nested_loop, though, no, strike
that, blocks are also super overloaded).

Hmmm. I think loop is kinda confusing. "fragment" has potential.
I also thought of "piece". That is actually where I am leaning now.
What do you think?

[1]: /messages/by-id/CA+hUKG+A6ftXPz4oe92+x8Er+xpGZqto70-Q_ERwRaSyA=afNg@mail.gmail.com
/messages/by-id/CA+hUKG+A6ftXPz4oe92+x8Er+xpGZqto70-Q_ERwRaSyA=afNg@mail.gmail.com

--
Melanie Plageman

#45

Melanie Plageman

melanieplageman@gmail.com

almost 6 years ago

In reply to: Thomas Munro (#43)

4 attachment(s)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Tue, Jan 7, 2020 at 4:14 PM Thomas Munro <thomas.munro@gmail.com> wrote:

* I am uneasy about BarrierArriveExplicitAndWait() (a variant of
BarrierArriveAndWait() that lets you skip directly to a given phase?);
perhaps you only needed that for a circular phase system, which you
could do with modular phase numbers, like PHJ_GROW_BATCHES_PHASE? I
tried to make the barrier interfaces look like the libraries in other
parallel programming environments, and I'd be worried that the
explicit phase thing could easily lead to bugs.

BarrierArriveExplicitAndWait() is gone now due to the refactor to
address the barrier waiting deadlock hazard (mentioned below).

* It seems a bit strange to have "outer_match_status_file" in
SharedTupleStore; something's gone awry layering-wise there.

outer_match_status_file is now out of the SharedTuplestore. Jesse
Zhang and I worked on a new API, SharedBits, for workers to
collaboratively make a bitmap and then used it for the outer match
status file and the combined bitmap file
(v4-0004-Add-SharedBits-API.patch).

The SharedBits API is modeled closely after the SharedTuplestore API.
It uses a control object in shared memory to synchronize access to
some files in a SharedFileset and maintains some participant-specific
shared state. The big difference (other than that the files are for
bitmaps and not tuples) is that each backend writes to its file in one
phase and a single backend reads from all of the files and combines
them in another phase.
In other words, it supports parallel write but not parallel scan (and
not concurrent read/write). This could definitely be modified in the
future.

Also, the SharedBits uses a SharedFileset which uses BufFiles. This is
not the ideal API for the bitmap. The access pattern is small sequential
writes and random reads. It would also be nice to maintain the fixed
size buffer but have an API that let us write an arbitrary number of
bytes to it in bufsize chunks without incurring additional function call
overhead.

* I'm not sure it's OK to wait at the end of each loop, as described
in the commit message:

Workers probing a fallback batch will wait until all workers have
finished probing before moving on so that an elected worker can read
and combine the outer match status files into a single bitmap and use
it to emit unmatched outer tuples after all chunks of the inner side
have been processed.

Maybe I misunderstood completely, but that seems to break the
programming rule described in nodeHashjoin.c's comment beginning "To
avoid deadlocks, ...". To recap: (1) When you emit a tuple, the
program counter escapes to some other node, and maybe that other node
waits for thee, (2) Maybe the leader is waiting for you but you're
waiting for it to drain its queue so you can emit a tuple (I learned a
proper name for this: "flow control deadlock"). That's why the
current code only ever detaches (a non-waiting operation) after it's
begun emitting tuples (that is, the probing phase). It just moves
onto another batch. That's not a solution here: you can't simply move
to another loop, loops are not independent of each other like batches.
It's possible that barriers are not the right tool for this part of
the problem, or that there is a way to use a barrier that you don't
remain attached to while emitting, or that we should remove the
deadlock risks another way entirely[1] but I'm not sure. Furthermore,
the new code in ExecParallelHashJoinNewBatch() appears to break the
rule even in the non-looping case (it calls BarrierArriveAndWait() in
ExecParallelHashJoinNewBatch(), where the existing code just
detaches).

So, after a more careful reading of the parallel full hashjoin email
[1]: /messages/by-id/CA+hUKG+A6ftXPz4oe92+x8Er+xpGZqto70-Q_ERwRaSyA=afNg@mail.gmail.com
nodeHashJoin.c.
I do have some questions about the potential solutions mentioned in
that thread, however, I'll pose those over there.

For adaptive hashjoin, for now, the options for addressing the barrier
wait hazard that Jesse and I came up with based on the PFHJ thread are:
- leader doesn't participate in fallback batches (has the downside of
reduced parallelism and needing special casing when it ends up being
the only worker because other workers get used for something else
[like autovaccuum])
- use some kind of spool to avoid deadlock
- the original solution I proposed in which all workers detach from
the batch barrier (instead of waiting)

I revisited the original solution I proposed and realized that I had
not implemented it as advertised. By reverting to the original
design, I can skirt the issue for now.

In the original solution I suggested, I mentioned all workers would
detach from the batch barrier and the last to detach would combine the
bitmaps. That was not what I actually implemented (my patch had all
the workers wait on the barrier).

I've changed to actually doing this--which addresses some of the
potential deadlock hazard.

The two deadlock waits causing the deadlock hazard were waiting on the
chunk barrier and waiting on the batch barrier. In order to fully
address the deadlock hazard, Jesse and I came up with the following
solution (in v4-0003-Address-barrier-wait-deadlock-hazard.patch in the
attached patchset) to each:

chunk barrier wait:
- instead of waiting on the chunk barrier when it is not in its final
state and then reusing it and jumping back to the initial state,
initialize an array of chunk barriers, one per chunk, and, workers
only wait on a chunk barrier when it is in its final state. The last
worker to arrive will increment the chunk number. All workers detach
from the chunk barrier they are attached to and select the next
chunk barrier

Jesse brought up that there isn't a safe time to reinitialize the
chunk barrier, so reusing it doesn't seem like a good idea.

batch barrier wait:
- In order to mitigate the other cause of deadlock hazard (workers
wait on the batch barrier after emitting tuples), now, in
ExecParallelHashJoinNewBatch(), if we are attached to a batch
barrier and it is a fallback batch, all workers will detach from the
batch barrier and then end their scan of that batch. The last
worker to detach will combine the outer match status files, then it
will detach from the batch, clean up the hashtable, and end its scan
of the inner side. Then it will return and proceed to emit
unmatched outer tuples.

This patch does contain refactoring of nodeHashjoin.

I have split the Parallel HashJoin and Serial HashJoin state machines
up, as they were diverging in my patch to a point that made for a
really cluttered ExecHashJoinImpl() (ExecHashJoinImpl() is now gone).

Hmm. I'm rather keen on extending that technique further: I'd like
there to be more configuration points in the form of parameters to
that function, so that we write the algorithm just once but we
generate a bunch of specialised variants that are the best possible
machine code for each combination of parameters via constant-folding
using the "always inline" trick (steampunk C++ function templates).
My motivations for wanting to do that are: supporting different hash
sizes (CF commit e69d6445), removing branches for unused optimisations
(eg skew), and inlining common hash functions. That isn't to say we
couldn't have two different templatoid functions from which many
others are specialised, but I feel like that's going to lead to a lot
of duplication.

I'm okay with using templating. For now, while I am addressing large
TODO items with the patchset, I will keep them as separate functions.
Once it is in a better state, I will look at the overlap and explore
templating. The caveat here is if a lot of new commits start going
into nodeHashjoin.c and keeping this long-running branch rebased gets
painful.

The patchset has also been run through pg_indent, so,
v4-0001-Implement-Adaptive-Hashjoin.patch will look a bit different
than v3-0001-hashloop-fallback.patch, but, it is the same content.
v4-0002-Fixup-tupleMetadata-struct-issues.patch is just some other
fixups and small cosmetic changes.

The new big TODOs is to make a file type that suits the SharedBits API
better--but I don't want to do that unless the idea is validated.

[1]: /messages/by-id/CA+hUKG+A6ftXPz4oe92+x8Er+xpGZqto70-Q_ERwRaSyA=afNg@mail.gmail.com
/messages/by-id/CA+hUKG+A6ftXPz4oe92+x8Er+xpGZqto70-Q_ERwRaSyA=afNg@mail.gmail.com

Attachments:

v4-0002-Fixup-tupleMetadata-struct-issues.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Fixup-tupleMetadata-struct-issues.patchDownload

From 737317370de0b41883551bf1d470f5d647d6117b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 7 Jan 2020 16:28:32 -0800
Subject: [PATCH v4 2/4] Fixup tupleMetadata struct issues

Remove __attribute__((packed)) from tupleMetadata. It is not needed
since I am using sizeof(struct tupleMetadata).

Change tupleMetadata members to include a union with an anonymous union
containing tupleid/chunk number.
tupleMetadata's tupleid member will be the tupleid in the outer side and
the chunk number in the inner side. Use a union for this since they will
be different types. Also, fix the signedness and type issues in code
using it. For now, this uses a 32bit int for tuples as I use an atomic
and 64bit atomic operations are not supported on all architecture/OS
combinations. It remains a TODO to make this variable backend local and
combine it to reduce the amount of synchronization needed.
Additionally, the tupleid/chunk number member should not be included for
non-fallback batches, as it bloats the tuplestore.

Also, this patch contains assorted updates to variable names/TODOs.
---
 src/backend/executor/adaptiveHashjoin.c   | 10 +++----
 src/backend/executor/nodeHash.c           | 30 +++++++++++++++------
 src/backend/executor/nodeHashjoin.c       | 25 ++++++++++-------
 src/backend/utils/sort/sharedtuplestore.c | 33 ++++++++++++-----------
 src/include/executor/hashjoin.h           |  4 +--
 src/include/utils/sharedtuplestore.h      | 16 +++++------
 6 files changed, 70 insertions(+), 48 deletions(-)

diff --git a/src/backend/executor/adaptiveHashjoin.c b/src/backend/executor/adaptiveHashjoin.c
index dff5b38d38f8..64af2a24f346 100644
--- a/src/backend/executor/adaptiveHashjoin.c
+++ b/src/backend/executor/adaptiveHashjoin.c
@@ -51,7 +51,7 @@ ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 		 */
 		if (BarrierArriveAndWait(chunk_barrier,
 								 WAIT_EVENT_HASH_CHUNK_PROBING))
-			phj_batch->current_chunk_num++;
+			phj_batch->current_chunk++;
 
 		/* Once the barrier is advanced we'll be in the DONE phase */
 	}
@@ -68,7 +68,7 @@ ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 	{
 			/*
 			 * TODO: remove this phase and coordinate access to hashtable
-			 * above goto and after incrementing current_chunk_num
+			 * above goto and after incrementing current_chunk
 			 */
 		case PHJ_CHUNK_ELECTING:
 	phj_chunk_electing:
@@ -85,7 +85,7 @@ ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 
 			while ((tuple = sts_parallel_scan_next(inner_tuples, &metadata)))
 			{
-				if (metadata.tupleid != phj_batch->current_chunk_num)
+				if (metadata.chunk != phj_batch->current_chunk)
 					continue;
 
 				ExecForceStoreMinimalTuple(tuple,
@@ -110,7 +110,7 @@ ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 
 			BarrierArriveAndWait(chunk_barrier, WAIT_EVENT_HASH_CHUNK_DONE);
 
-			if (phj_batch->current_chunk_num > phj_batch->total_num_chunks)
+			if (phj_batch->current_chunk > phj_batch->total_chunks)
 			{
 				BarrierDetach(chunk_barrier);
 				return false;
@@ -276,7 +276,7 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 						&hashtable->batches[batchno].shared->chunk_barrier;
 
 						BarrierInit(chunk_barrier, 0);
-						hashtable->batches[batchno].shared->current_chunk_num = 1;
+						hashtable->batches[batchno].shared->current_chunk = 1;
 					}
 					/* Fall through. */
 
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index c5420b169e6c..cb2f95ac0a76 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1362,7 +1362,7 @@ ExecParallelHashRepartitionFirst(HashJoinTable hashtable)
 				/* TODO: should I check batch estimated size here at all? */
 				if (phj_batch->parallel_hashloop_fallback == true && (phj_batch->estimated_chunk_size + tuple_size > hashtable->parallel_state->space_allowed))
 				{
-					phj_batch->total_num_chunks++;
+					phj_batch->total_chunks++;
 					phj_batch->estimated_chunk_size = tuple_size;
 				}
 				else
@@ -1371,10 +1371,15 @@ ExecParallelHashRepartitionFirst(HashJoinTable hashtable)
 				tupleMetadata metadata;
 
 				metadata.hashvalue = hashTuple->hashvalue;
-				metadata.tupleid = phj_batch->total_num_chunks;
+				metadata.chunk = phj_batch->total_chunks;
 				LWLockRelease(&phj_batch->lock);
 
 				hashtable->batches[batchno].estimated_size += tuple_size;
+
+				/*
+				 * TODO: only put the chunk num if it is a fallback batch
+				 * (avoid bloating the metadata written to the file)
+				 */
 				sts_puttuple(hashtable->batches[batchno].inner_tuples,
 							 &metadata, tuple);
 			}
@@ -1451,14 +1456,19 @@ ExecParallelHashRepartitionRest(HashJoinTable hashtable)
 			/* TODO: should I check batch estimated size here at all? */
 			if (phj_batch->parallel_hashloop_fallback == true && (phj_batch->estimated_chunk_size + tuple_size > pstate->space_allowed))
 			{
-				phj_batch->total_num_chunks++;
+				phj_batch->total_chunks++;
 				phj_batch->estimated_chunk_size = tuple_size;
 			}
 			else
 				phj_batch->estimated_chunk_size += tuple_size;
-			metadata.tupleid = phj_batch->total_num_chunks;
+			metadata.chunk = phj_batch->total_chunks;
 			LWLockRelease(&phj_batch->lock);
 			/* Store the tuple its new batch. */
+
+			/*
+			 * TODO: only put the chunk num if it is a fallback batch (avoid
+			 * bloating the metadata written to the file)
+			 */
 			sts_puttuple(hashtable->batches[batchno].inner_tuples,
 						 &metadata, tuple);
 
@@ -1821,7 +1831,7 @@ retry:
 		 */
 		if (phj_batch->parallel_hashloop_fallback == true && (phj_batch->estimated_chunk_size + tuple_size > pstate->space_allowed))
 		{
-			phj_batch->total_num_chunks++;
+			phj_batch->total_chunks++;
 			phj_batch->estimated_chunk_size = tuple_size;
 		}
 		else
@@ -1830,9 +1840,13 @@ retry:
 		tupleMetadata metadata;
 
 		metadata.hashvalue = hashvalue;
-		metadata.tupleid = phj_batch->total_num_chunks;
+		metadata.chunk = phj_batch->total_chunks;
 		LWLockRelease(&phj_batch->lock);
 
+		/*
+		 * TODO: only put the chunk num if it is a fallback batch (avoid
+		 * bloating the metadata written to the file)
+		 */
 		sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata,
 					 tuple);
 	}
@@ -3043,8 +3057,8 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 		shared->parallel_hashloop_fallback = false;
 		LWLockInitialize(&shared->lock,
 						 LWTRANCHE_PARALLEL_HASH_JOIN_BATCH);
-		shared->current_chunk_num = 0;
-		shared->total_num_chunks = 1;
+		shared->current_chunk = 0;
+		shared->total_chunks = 1;
 		shared->estimated_chunk_size = 0;
 
 		/*
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 39a03000f8da..6a8efc0765a4 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -435,9 +435,8 @@ ExecHashJoin(PlanState *pstate)
 				{
 					/*
 					 * The current outer tuple has run out of matches, so
-					 * check whether to emit a dummy outer-join tuple.
-					 * Whether we emit one or not, the next state is
-					 * NEED_NEW_OUTER.
+					 * check whether to emit a dummy outer-join tuple. Whether
+					 * we emit one or not, the next state is NEED_NEW_OUTER.
 					 */
 					node->hj_JoinState = HJ_NEED_NEW_OUTER;
 					if (!node->hashloop_fallback || node->hj_HashTable->curbatch == 0)
@@ -902,7 +901,7 @@ ExecParallelHashJoin(PlanState *pstate)
 
 				ParallelHashJoinBatch *phj_batch = node->hj_HashTable->batches[node->hj_HashTable->curbatch].shared;
 
-				if (!phj_batch->parallel_hashloop_fallback || phj_batch->current_chunk_num == 1)
+				if (!phj_batch->parallel_hashloop_fallback || phj_batch->current_chunk == 1)
 					node->hj_MatchedOuter = false;
 				node->hj_JoinState = HJ_SCAN_BUCKET;
 
@@ -919,9 +918,8 @@ ExecParallelHashJoin(PlanState *pstate)
 				{
 					/*
 					 * The current outer tuple has run out of matches, so
-					 * check whether to emit a dummy outer-join tuple.
-					 * Whether we emit one or not, the next state is
-					 * NEED_NEW_OUTER.
+					 * check whether to emit a dummy outer-join tuple. Whether
+					 * we emit one or not, the next state is NEED_NEW_OUTER.
 					 */
 					node->hj_JoinState = HJ_NEED_NEW_OUTER;
 					if (!phj_batch->parallel_hashloop_fallback)
@@ -1084,7 +1082,7 @@ ExecParallelHashJoin(PlanState *pstate)
 					if ((tuple = sts_parallel_scan_next(outer_acc, &metadata)) == NULL)
 						break;
 
-					int			bytenum = metadata.tupleid / 8;
+					uint32		bytenum = metadata.tupleid / 8;
 					unsigned char bit = metadata.tupleid % 8;
 					unsigned char byte_to_check = 0;
 
@@ -1477,7 +1475,7 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 		MinimalTuple tuple;
 
 		tupleMetadata metadata;
-		int			tupleid;
+		uint32		tupleid;
 
 		tuple = sts_parallel_scan_next(hashtable->batches[curbatch].outer_tuples,
 									   &metadata);
@@ -1894,7 +1892,16 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 			metadata.hashvalue = hashvalue;
 			SharedTuplestoreAccessor *accessor = hashtable->batches[batchno].outer_tuples;
 
+			/*
+			 * TODO: add a comment that this means the order is not
+			 * deterministic so don't count on it
+			 */
 			metadata.tupleid = sts_increment_tuplenum(accessor);
+
+			/*
+			 * TODO: only add the tupleid when it is a fallback batch to avoid
+			 * bloating of the sharedtuplestore
+			 */
 			sts_puttuple(accessor, &metadata, mintup);
 
 			if (shouldFree)
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 3cd2ec2e2eb6..0e5e9db82034 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -57,11 +57,15 @@ typedef struct SharedTuplestoreParticipant
 } SharedTuplestoreParticipant;
 
 /* The control object that lives in shared memory. */
+/*  TODO: ntuples atomic 32 bit int is iffy. Didn't use 64bit because wasn't sure */
+/*  about 64bit atomic ints portability */
+/*  Seems like it would be possible to reduce the amount of synchronization instead */
+/*  potentially using worker number to unique-ify the tuple number */
 struct SharedTuplestore
 {
 	int			nparticipants;	/* Number of participants that can write. */
 	pg_atomic_uint32 ntuples;
-			  //TODO:does this belong elsewhere
+	/* TODO:does this belong elsewhere */
 	int			flags;			/* Flag bits from SHARED_TUPLESTORE_XXX */
 	size_t		meta_data_size; /* Size of per-tuple header. */
 	char		name[NAMEDATALEN];	/* A name for this tuplestore. */
@@ -631,8 +635,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	return NULL;
 }
 
-/*  TODO: fix signedness */
-int
+uint32
 sts_increment_tuplenum(SharedTuplestoreAccessor *accessor)
 {
 	return pg_atomic_fetch_add_u32(&accessor->sts->ntuples, 1);
@@ -719,22 +722,22 @@ sts_combine_outer_match_status_files(SharedTuplestoreAccessor *accessor)
 	BufFile    *combined_bitmap_file = BufFileCreateTemp(false);
 
 	for (int64 cur = 0; cur < BufFileSize(statuses[0]); cur++)
-		//make it while not
-			EOF
-		{
-			unsigned char combined_byte = 0;
-
-			for (int i = 0; i < statuses_length; i++)
-			{
-				unsigned char read_byte;
+		/* make it while not */
+		EOF
+	{
+		unsigned char combined_byte = 0;
 
-				BufFileRead(statuses[i], &read_byte, 1);
-				combined_byte |= read_byte;
-			}
+		for (int i = 0; i < statuses_length; i++)
+		{
+			unsigned char read_byte;
 
-			BufFileWrite(combined_bitmap_file, &combined_byte, 1);
+			BufFileRead(statuses[i], &read_byte, 1);
+			combined_byte |= read_byte;
 		}
 
+		BufFileWrite(combined_bitmap_file, &combined_byte, 1);
+	}
+
 	if (BufFileSeek(combined_bitmap_file, 0, 0L, SEEK_SET))
 		ereport(ERROR,
 				(errcode_for_file_access(),
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3e4f4bd5747a..e5a00f84e321 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -163,8 +163,8 @@ typedef struct ParallelHashJoinBatch
 	 * and does not require a lock to read
 	 */
 	bool		parallel_hashloop_fallback;
-	int			total_num_chunks;
-	int			current_chunk_num;
+	int			total_chunks;
+	int			current_chunk;
 	size_t		estimated_chunk_size;
 	Barrier		chunk_barrier;
 	LWLock		lock;
diff --git a/src/include/utils/sharedtuplestore.h b/src/include/utils/sharedtuplestore.h
index 6152ac163da2..8b2433e5c4b0 100644
--- a/src/include/utils/sharedtuplestore.h
+++ b/src/include/utils/sharedtuplestore.h
@@ -24,17 +24,15 @@ struct SharedTuplestoreAccessor;
 typedef struct SharedTuplestoreAccessor SharedTuplestoreAccessor;
 struct tupleMetadata;
 typedef struct tupleMetadata tupleMetadata;
-
-/*  TODO: conflicting types for tupleid with accessor->sts->ntuples (uint32) */
-/*  TODO: use a union for tupleid (uint32) (make this a uint64) and chunk number (int) */
 struct tupleMetadata
 {
 	uint32		hashvalue;
-	int			tupleid;		/* tuple id on outer side and chunk number for
-								 * inner side */
-}			__attribute__((packed));
-
-/*  TODO: make sure I can get rid of packed now that using sizeof(struct) */
+	union
+	{
+		uint32		tupleid;	/* tuple number or id on the outer side */
+		int			chunk;		/* chunk number for inner side */
+	};
+};
 
 /*
  * A flag indicating that the tuplestore will only be scanned once, so backing
@@ -72,7 +70,7 @@ extern MinimalTuple sts_parallel_scan_next(SharedTuplestoreAccessor *accessor,
 										   void *meta_data);
 
 
-extern int	sts_increment_tuplenum(SharedTuplestoreAccessor *accessor);
+extern uint32 sts_increment_tuplenum(SharedTuplestoreAccessor *accessor);
 
 extern void sts_make_outer_match_status_file(SharedTuplestoreAccessor *accessor);
 extern void sts_set_outer_match_status(SharedTuplestoreAccessor *accessor, uint32 tuplenum);
-- 
2.25.0

v4-0004-Add-SharedBits-API.patchtext/x-patch; charset=US-ASCII; name=v4-0004-Add-SharedBits-API.patchDownload

From 8bc3a52c8bd2c94489a7f865bf366ad11642fd9b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 24 Jan 2020 11:17:49 -0800
Subject: [PATCH v4 4/4] Add SharedBits API

Add SharedBits API--a way for workers to collaboratively make a bitmap.
The SharedBits store is currently meant for each backend to write to its
own bitmap file in one phase and for a single worker to combine all of
the bitmaps into a combined bitmap in another phase. In other words, it
supports parallel write but not parallel scan (and not concurrent
read/write). This could be modified in the future.

Also, the SharedBits uses a SharedFileset which uses BufFiles. This is
not the ideal API for the bitmap. The access pattern is small sequential
writes and random reads. It would also be nice to maintain the fixed
size buffer but have an API that let us write an arbitrary number of
bytes to it in bufsize chunks without incurring additional function call
overhead.

This commit also moves the outer match status file and combined_bitmap
into a new SharedBits store.

Co-authored-by: Jesse Zhang <sbjesse@gmail.com>
---
 src/backend/executor/adaptiveHashjoin.c   |  18 +-
 src/backend/executor/nodeHash.c           |   8 +-
 src/backend/executor/nodeHashjoin.c       | 169 ++++++-------
 src/backend/storage/file/buffile.c        |  51 ----
 src/backend/utils/sort/Makefile           |   1 +
 src/backend/utils/sort/sharedbits.c       | 276 ++++++++++++++++++++++
 src/backend/utils/sort/sharedtuplestore.c | 122 +---------
 src/include/executor/hashjoin.h           |  13 +-
 src/include/storage/buffile.h             |   1 -
 src/include/utils/sharedbits.h            |  40 ++++
 src/include/utils/sharedtuplestore.h      |   7 +-
 11 files changed, 423 insertions(+), 283 deletions(-)
 create mode 100644 src/backend/utils/sort/sharedbits.c
 create mode 100644 src/include/utils/sharedbits.h

diff --git a/src/backend/executor/adaptiveHashjoin.c b/src/backend/executor/adaptiveHashjoin.c
index 45846a076916..6c6e27e55e49 100644
--- a/src/backend/executor/adaptiveHashjoin.c
+++ b/src/backend/executor/adaptiveHashjoin.c
@@ -13,9 +13,6 @@
 
 #include "executor/adaptiveHashjoin.h"
 
-
-
-
 bool
 ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 {
@@ -291,10 +288,9 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 			ExecHashTableDetachBatch(hashtable);
 		}
 
-		else if (accessor->combined_bitmap != NULL)
+		else if (sb_combined_exists(accessor->sba))
 		{
-			BufFileClose(accessor->combined_bitmap);
-			accessor->combined_bitmap = NULL;
+			sb_end_read(accessor->sba);
 			accessor->done = true;
 
 			/*
@@ -308,7 +304,7 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 
 		else
 		{
-			sts_close_outer_match_status_file(accessor->outer_tuples);
+			sb_end_write(accessor->sba);
 
 			/*
 			 * If all workers (including this one) have finished probing the
@@ -329,7 +325,7 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 				 * reach here. This worker must do some final cleanup and then
 				 * detach from the batch
 				 */
-				accessor->combined_bitmap = sts_combine_outer_match_status_files(accessor->outer_tuples);
+				sb_combine(accessor->sba);
 				ExecHashTableLoopDetachBatchForChosen(hashtable);
 				hjstate->last_worker = true;
 				return true;
@@ -410,7 +406,11 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 					 * to by this worker and readable by any worker
 					 */
 					if (hashtable->batches[batchno].shared->parallel_hashloop_fallback)
-						sts_make_outer_match_status_file(hashtable->batches[batchno].outer_tuples);
+					{
+						ParallelHashJoinBatchAccessor *accessor = hashtable->batches + hashtable->curbatch;
+
+						sb_initialize_accessor(accessor->sba, sts_get_tuplenum(accessor->outer_tuples));
+					}
 
 					return true;
 
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index afdc31a3b30c..51050ce47edb 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -3052,7 +3052,9 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 	{
 		ParallelHashJoinBatchAccessor *accessor = &hashtable->batches[i];
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
+		SharedBits *sbits = ParallelHashJoinBatchOuterBits(shared, pstate->nparticipants);
 		char		name[MAXPGPATH];
+		char		sbname[MAXPGPATH];
 
 		shared->parallel_hashloop_fallback = false;
 		LWLockInitialize(&shared->lock,
@@ -3098,6 +3100,9 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
+		snprintf(sbname, MAXPGPATH, "%s.bitmaps", name);
+		accessor->sba = sb_initialize(sbits, pstate->nparticipants,
+									  ParallelWorkerNumber + 1, &pstate->sbfileset, sbname);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -3169,11 +3174,11 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 	{
 		ParallelHashJoinBatchAccessor *accessor = &hashtable->batches[i];
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
+		SharedBits *sbits = ParallelHashJoinBatchOuterBits(shared, pstate->nparticipants);
 
 		accessor->shared = shared;
 		accessor->preallocated = 0;
 		accessor->done = false;
-		accessor->combined_bitmap = NULL;
 		accessor->inner_tuples =
 			sts_attach(ParallelHashJoinBatchInner(shared),
 					   ParallelWorkerNumber + 1,
@@ -3183,6 +3188,7 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 												  pstate->nparticipants),
 					   ParallelWorkerNumber + 1,
 					   &pstate->fileset);
+		accessor->sba = sb_attach(sbits, ParallelWorkerNumber + 1, &pstate->sbfileset);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index a454cba54543..e282fb368ce7 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -908,86 +908,91 @@ ExecParallelHashJoin(PlanState *pstate)
 				/* FALL THRU */
 
 			case HJ_SCAN_BUCKET:
-
-				/*
-				 * Scan the selected hash bucket for matches to current outer
-				 */
-				phj_batch = node->hj_HashTable->batches[node->hj_HashTable->curbatch].shared;
-
-				if (!ExecParallelScanHashBucket(node, econtext))
 				{
 					/*
-					 * The current outer tuple has run out of matches, so
-					 * check whether to emit a dummy outer-join tuple. Whether
-					 * we emit one or not, the next state is NEED_NEW_OUTER.
+					 * Scan the selected hash bucket for matches to current
+					 * outer
 					 */
-					node->hj_JoinState = HJ_NEED_NEW_OUTER;
-					if (!phj_batch->parallel_hashloop_fallback)
+					ParallelHashJoinBatchAccessor *accessor =
+					&node->hj_HashTable->batches[node->hj_HashTable->curbatch];
+
+					phj_batch = accessor->shared;
+
+					if (!ExecParallelScanHashBucket(node, econtext))
 					{
-						TupleTableSlot *slot = emitUnmatchedOuterTuple(otherqual, econtext, node);
+						/*
+						 * The current outer tuple has run out of matches, so
+						 * check whether to emit a dummy outer-join tuple.
+						 * Whether we emit one or not, the next state is
+						 * NEED_NEW_OUTER.
+						 */
+						node->hj_JoinState = HJ_NEED_NEW_OUTER;
+						if (!phj_batch->parallel_hashloop_fallback)
+						{
+							TupleTableSlot *slot = emitUnmatchedOuterTuple(otherqual, econtext, node);
 
-						if (slot != NULL)
-							return slot;
+							if (slot != NULL)
+								return slot;
+						}
+						continue;
 					}
-					continue;
-				}
 
-				/*
-				 * We've got a match, but still need to test non-hashed quals.
-				 * ExecScanHashBucket already set up all the state needed to
-				 * call ExecQual.
-				 *
-				 * If we pass the qual, then save state for next call and have
-				 * ExecProject form the projection, store it in the tuple
-				 * table, and return the slot.
-				 *
-				 * Only the joinquals determine tuple match status, but all
-				 * quals must pass to actually return the tuple.
-				 */
-				if (joinqual != NULL && !ExecQual(joinqual, econtext))
-				{
-					InstrCountFiltered1(node, 1);
-					break;
-				}
+					/*
+					 * We've got a match, but still need to test non-hashed
+					 * quals. ExecScanHashBucket already set up all the state
+					 * needed to call ExecQual.
+					 *
+					 * If we pass the qual, then save state for next call and
+					 * have ExecProject form the projection, store it in the
+					 * tuple table, and return the slot.
+					 *
+					 * Only the joinquals determine tuple match status, but
+					 * all quals must pass to actually return the tuple.
+					 */
+					if (joinqual != NULL && !ExecQual(joinqual, econtext))
+					{
+						InstrCountFiltered1(node, 1);
+						break;
+					}
 
-				node->hj_MatchedOuter = true;
-				HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
+					node->hj_MatchedOuter = true;
+					HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
 
-				/*
-				 * TODO: how does this interact with PAHJ -- do I need to set
-				 * matchbit?
-				 */
-				/* In an antijoin, we never return a matched tuple */
-				if (node->js.jointype == JOIN_ANTI)
-				{
-					node->hj_JoinState = HJ_NEED_NEW_OUTER;
-					continue;
-				}
+					/*
+					 * TODO: how does this interact with PAHJ -- do I need to
+					 * set matchbit?
+					 */
+					/* In an antijoin, we never return a matched tuple */
+					if (node->js.jointype == JOIN_ANTI)
+					{
+						node->hj_JoinState = HJ_NEED_NEW_OUTER;
+						continue;
+					}
 
-				/*
-				 * If we only need to join to the first matching inner tuple,
-				 * then consider returning this one, but after that continue
-				 * with next outer tuple.
-				 */
-				if (node->js.single_match)
-					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+					/*
+					 * If we only need to join to the first matching inner
+					 * tuple, then consider returning this one, but after that
+					 * continue with next outer tuple.
+					 */
+					if (node->js.single_match)
+						node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
-				/*
-				 * Set the match bit for this outer tuple in the match status
-				 * file
-				 */
-				if (phj_batch->parallel_hashloop_fallback)
-				{
-					sts_set_outer_match_status(hashtable->batches[hashtable->curbatch].outer_tuples,
-											   econtext->ecxt_outertuple->tuplenum);
+					/*
+					 * Set the match bit for this outer tuple in the match
+					 * status file
+					 */
+					if (phj_batch->parallel_hashloop_fallback)
+					{
+						sb_setbit(accessor->sba,
+								  econtext->ecxt_outertuple->tuplenum);
 
+					}
+					if (otherqual == NULL || ExecQual(otherqual, econtext))
+						return ExecProject(node->js.ps.ps_ProjInfo);
+					else
+						InstrCountFiltered2(node, 1);
+					break;
 				}
-				if (otherqual == NULL || ExecQual(otherqual, econtext))
-					return ExecProject(node->js.ps.ps_ProjInfo);
-				else
-					InstrCountFiltered2(node, 1);
-				break;
-
 			case HJ_FILL_INNER_TUPLES:
 
 				/*
@@ -1072,8 +1077,6 @@ ExecParallelHashJoin(PlanState *pstate)
 					ParallelHashJoinBatchAccessor *batch_accessor =
 					&node->hj_HashTable->batches[node->hj_HashTable->curbatch];
 
-					Assert(batch_accessor->combined_bitmap != NULL);
-
 					/*
 					 * TODO: there should be a way to know the current batch
 					 * for the purposes of getting
@@ -1092,33 +1095,10 @@ ExecParallelHashJoin(PlanState *pstate)
 					{
 						tupleMetadata metadata;
 
-						if ((tuple =
-							 sts_parallel_scan_next(outer_acc, &metadata)) ==
-							NULL)
+						if ((tuple = sts_parallel_scan_next(outer_acc, &metadata)) == NULL)
 							break;
 
-						uint32		bytenum = metadata.tupleid / 8;
-						unsigned char bit = metadata.tupleid % 8;
-						unsigned char byte_to_check = 0;
-
-						/* seek to byte to check */
-						if (BufFileSeek(batch_accessor->combined_bitmap,
-										0,
-										bytenum,
-										SEEK_SET))
-							ereport(ERROR,
-									(errcode_for_file_access(),
-									 errmsg(
-											"could not rewind shared outer temporary file: %m")));
-						/* read byte containing ntuple bit */
-						if (BufFileRead(batch_accessor->combined_bitmap, &byte_to_check, 1) ==
-							0)
-							ereport(ERROR,
-									(errcode_for_file_access(),
-									 errmsg(
-											"could not read byte in outer match status bitmap: %m.")));
-						/* if bit is set */
-						bool		match = ((byte_to_check) >> bit) & 1;
+						bool		match = sb_checkbit(batch_accessor->sba, metadata.tupleid);
 
 						if (!match)
 							break;
@@ -1990,6 +1970,7 @@ ExecHashJoinInitializeDSM(HashJoinState *state, ParallelContext *pcxt)
 
 	/* Set up the space we'll use for shared temporary files. */
 	SharedFileSetInit(&pstate->fileset, pcxt->seg);
+	SharedFileSetInit(&pstate->sbfileset, pcxt->seg);
 
 	/* Initialize the shared state in the hash node. */
 	hashNode = (HashState *) innerPlanState(state);
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index cb49329d3fb1..f0e920b41618 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -269,57 +269,6 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
 	return file;
 }
 
-/*
- * Open a shared file created by any backend if it exists, otherwise return NULL
- */
-BufFile *
-BufFileOpenSharedIfExists(SharedFileSet *fileset, const char *name)
-{
-	BufFile    *file;
-	char		segment_name[MAXPGPATH];
-	Size		capacity = 16;
-	File	   *files;
-	int			nfiles = 0;
-
-	files = palloc(sizeof(File) * capacity);
-
-	/*
-	 * We don't know how many segments there are, so we'll probe the
-	 * filesystem to find out.
-	 */
-	for (;;)
-	{
-		/* See if we need to expand our file segment array. */
-		if (nfiles + 1 > capacity)
-		{
-			capacity *= 2;
-			files = repalloc(files, sizeof(File) * capacity);
-		}
-		/* Try to load a segment. */
-		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
-		if (files[nfiles] <= 0)
-			break;
-		++nfiles;
-
-		CHECK_FOR_INTERRUPTS();
-	}
-
-	/*
-	 * If we didn't find any files at all, then no BufFile exists with this
-	 * name.
-	 */
-	if (nfiles == 0)
-		return NULL;
-	file = makeBufFileCommon(nfiles);
-	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
-	file->fileset = fileset;
-	file->name = pstrdup(name);
-
-	return file;
-}
-
 /*
  * Open a file that was previously created in another backend (or this one)
  * with BufFileCreateShared in the same SharedFileSet using the same name.
diff --git a/src/backend/utils/sort/Makefile b/src/backend/utils/sort/Makefile
index 7ac3659261e3..f11fe85aeb31 100644
--- a/src/backend/utils/sort/Makefile
+++ b/src/backend/utils/sort/Makefile
@@ -16,6 +16,7 @@ override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS)
 
 OBJS = \
 	logtape.o \
+	sharedbits.o \
 	sharedtuplestore.o \
 	sortsupport.o \
 	tuplesort.o \
diff --git a/src/backend/utils/sort/sharedbits.c b/src/backend/utils/sort/sharedbits.c
new file mode 100644
index 000000000000..9d04d6b23661
--- /dev/null
+++ b/src/backend/utils/sort/sharedbits.c
@@ -0,0 +1,276 @@
+#include "postgres.h"
+#include "storage/buffile.h"
+#include "utils/sharedbits.h"
+
+/*  TODO: put a comment about not currently supporting parallel scan of the SharedBits */
+
+/* Per-participant shared state */
+struct SharedBitsParticipant
+{
+	bool		present;
+	bool		writing;
+};
+
+/* Shared control object */
+struct SharedBits
+{
+	int			nparticipants;	/* Number of participants that can write. */
+	int64		nbits;
+	char		name[NAMEDATALEN];	/* A name for this tuplestore. */
+
+	SharedBitsParticipant participants[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/* backend-local state */
+struct SharedBitsAccessor
+{
+	int			participant;
+	SharedBits *bits;
+	SharedFileSet *fileset;
+	BufFile    *write_file;
+	BufFile    *combined;
+};
+
+SharedBitsAccessor *
+sb_attach(SharedBits *sbits, int my_participant_number, SharedFileSet *fileset)
+{
+	SharedBitsAccessor *accessor = palloc0(sizeof(SharedBitsAccessor));
+
+	accessor->participant = my_participant_number;
+	accessor->bits = sbits;
+	accessor->fileset = fileset;
+	accessor->write_file = NULL;
+	accessor->combined = NULL;
+	return accessor;
+}
+
+SharedBitsAccessor *
+sb_initialize(SharedBits *sbits,
+			  int participants,
+			  int my_participant_number,
+			  SharedFileSet *fileset,
+			  char *name)
+{
+	SharedBitsAccessor *accessor;
+
+	sbits->nparticipants = participants;
+	strcpy(sbits->name, name);
+	sbits->nbits = 0;			/* TODO: maybe delete this */
+
+	accessor = palloc0(sizeof(SharedBitsAccessor));
+	accessor->participant = my_participant_number;
+	accessor->bits = sbits;
+	accessor->fileset = fileset;
+	accessor->write_file = NULL;
+	accessor->combined = NULL;
+	return accessor;
+}
+
+/*  TODO: is "initialize_accessor" a clear enough API for this? (making the file)? */
+void
+sb_initialize_accessor(SharedBitsAccessor *accessor, uint32 nbits)
+{
+	char		name[MAXPGPATH];
+
+	snprintf(name, MAXPGPATH, "%s.p%d.bitmap", accessor->bits->name, accessor->participant);
+
+	accessor->write_file =
+		BufFileCreateShared(accessor->fileset, name);
+
+	accessor->bits->participants[accessor->participant].present = true;
+	/* TODO: check this math. tuplenumber will be too high. */
+	uint32		num_to_write = nbits / 8 + 1;
+
+	/*
+	 * TODO: add tests that could exercise a problem with junk being written
+	 * to bitmap
+	 */
+
+	/*
+	 * TODO: is there a better way to write the bytes to the file without
+	 * calling
+	 */
+
+	/*
+	 * BufFileWrite() like this? palloc()ing an undetermined number of bytes
+	 * feels
+	 */
+
+	/*
+	 * like it is against the spirit of this patch to begin with, but the many
+	 * function
+	 */
+	/* calls seem expensive */
+	for (int i = 0; i < num_to_write; i++)
+	{
+		unsigned char byteToWrite = 0;
+
+		BufFileWrite(accessor->write_file, &byteToWrite, 1);
+	}
+
+	if (BufFileSeek(accessor->write_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+}
+
+size_t
+sb_estimate(int participants)
+{
+	return offsetof(SharedBits, participants) + participants * sizeof(SharedBitsParticipant);
+}
+
+
+void
+sb_setbit(SharedBitsAccessor *accessor, uint64 bit)
+{
+	Assert(accessor->write_file);
+	SharedBitsParticipant *const participant =
+	&accessor->bits->participants[accessor->participant];
+
+	if (!participant->writing)
+		participant->writing = true;
+	unsigned char current_outer_byte;
+
+	BufFileSeek(accessor->write_file, 0, bit / 8, SEEK_SET);
+	BufFileRead(accessor->write_file, &current_outer_byte, 1);
+
+	current_outer_byte |= 1U << (bit % 8);
+
+	/* TODO: don't seek back one but instead seek explicitly to that byte */
+	BufFileSeek(accessor->write_file, 0, -1, SEEK_CUR);
+	BufFileWrite(accessor->write_file, &current_outer_byte, 1);
+}
+
+bool
+sb_checkbit(SharedBitsAccessor *accessor, uint32 n)
+{
+	Assert(accessor->combined);
+	uint32		bytenum = n / 8;
+	unsigned char bit = n % 8;
+	unsigned char byte_to_check = 0;
+
+	/* seek to byte to check */
+	if (BufFileSeek(accessor->combined,
+					0,
+					bytenum,
+					SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg(
+						"could not rewind shared outer temporary file: %m")));
+	/* read byte containing ntuple bit */
+	if (BufFileRead(accessor->combined, &byte_to_check, 1) == 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg(
+						"could not read byte in outer match status bitmap: %m.")));
+	/* if bit is set */
+	bool		match = ((byte_to_check) >> bit) & 1;
+
+	return match;
+}
+
+BufFile *
+sb_combine(SharedBitsAccessor *accessor)
+{
+	/* TODO: this tries to close an outer match status file for */
+	/* each participant in the tuplestore. technically, only participants */
+	/* in the barrier could have outer match status files, however, */
+	/* all but one participant continue on and detach from the barrier */
+	/* so we won't have a reliable way to close only files for those attached */
+	/* to the barrier */
+	int			nbparticipants = 0;
+
+	for (int l = 0; l < accessor->bits->nparticipants; l++)
+	{
+		SharedBitsParticipant participant = accessor->bits->participants[l];
+
+		if (participant.present)
+		{
+			Assert(!participant.writing);
+			nbparticipants++;
+		}
+	}
+	BufFile   **statuses = palloc(sizeof(BufFile *) * nbparticipants);
+
+	/*
+	 * Open the bitmap shared BufFile from each participant. TODO: explain why
+	 * file can be NULLs
+	 */
+	int			statuses_length = 0;
+
+	for (int i = 0; i < accessor->bits->nparticipants; i++)
+	{
+		char		bitmap_filename[MAXPGPATH];
+
+		/* TODO: make a function that will do this */
+		snprintf(bitmap_filename, MAXPGPATH, "%s.p%d.bitmap", accessor->bits->name, i);
+
+		if (!accessor->bits->participants[i].present)
+			continue;
+		BufFile    *file = BufFileOpenShared(accessor->fileset, bitmap_filename);
+
+		Assert(file);
+
+		statuses[statuses_length++] = file;
+	}
+
+	BufFile    *combined_bitmap_file = BufFileCreateTemp(false);
+
+	for (int64 cur = 0; cur < BufFileSize(statuses[0]); cur++)	/* make it while not EOF */
+	{
+		unsigned char combined_byte = 0;
+
+		for (int i = 0; i < statuses_length; i++)
+		{
+			unsigned char read_byte;
+
+			BufFileRead(statuses[i], &read_byte, 1);
+			combined_byte |= read_byte;
+		}
+
+		BufFileWrite(combined_bitmap_file, &combined_byte, 1);
+	}
+
+	if (BufFileSeek(combined_bitmap_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	for (int i = 0; i < statuses_length; i++)
+		BufFileClose(statuses[i]);
+	pfree(statuses);
+
+	accessor->combined = combined_bitmap_file;
+	return combined_bitmap_file;
+}
+
+/*  TODO: this is an API leak. We should be able to use something in the hashjoin state */
+/*  to indicate that the worker is the elected worker */
+/*  We tried using last_worker, but the problem is that last_worker can be false when */
+/*  there is a combined file (meaning this is the last worker), so, clearly, something needs */
+/*  to change about the flag. it is not expressing what it was meant to express. */
+bool
+sb_combined_exists(SharedBitsAccessor *accessor)
+{
+	return accessor->combined != NULL;
+}
+
+void
+sb_end_write(SharedBitsAccessor *sba)
+{
+	SharedBitsParticipant
+			   *const participant = &sba->bits->participants[sba->participant];
+
+	participant->writing = false;
+	BufFileClose(sba->write_file);
+	sba->write_file = NULL;
+}
+
+void
+sb_end_read(SharedBitsAccessor *accessor)
+{
+	BufFileClose(accessor->combined);
+	accessor->combined = NULL;
+}
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 0e5e9db82034..045b8eca80dc 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -98,15 +98,10 @@ struct SharedTuplestoreAccessor
 	BlockNumber write_page;		/* The next page to write to. */
 	char	   *write_pointer;	/* Current write pointer within chunk. */
 	char	   *write_end;		/* One past the end of the current chunk. */
-
-	/* Bitmap of matched outer tuples (currently only used for hashjoin). */
-	BufFile    *outer_match_status_file;
 };
 
 static void sts_filename(char *name, SharedTuplestoreAccessor *accessor,
 						 int participant);
-static void
-			sts_bitmap_filename(char *name, SharedTuplestoreAccessor *accessor, int participant);
 
 /*
  * Return the amount of shared memory required to hold SharedTuplestore for a
@@ -178,7 +173,6 @@ sts_initialize(SharedTuplestore *sts, int participants,
 	accessor->sts = sts;
 	accessor->fileset = fileset;
 	accessor->context = CurrentMemoryContext;
-	accessor->outer_match_status_file = NULL;
 
 	return accessor;
 }
@@ -641,120 +635,10 @@ sts_increment_tuplenum(SharedTuplestoreAccessor *accessor)
 	return pg_atomic_fetch_add_u32(&accessor->sts->ntuples, 1);
 }
 
-void
-sts_make_outer_match_status_file(SharedTuplestoreAccessor *accessor)
-{
-	uint32		tuplenum = pg_atomic_read_u32(&accessor->sts->ntuples);
-
-	/* don't make the outer match status file if there are no tuples */
-	if (tuplenum == 0)
-		return;
-
-	char		name[MAXPGPATH];
-
-	sts_bitmap_filename(name, accessor, accessor->participant);
-
-	accessor->outer_match_status_file = BufFileCreateShared(accessor->fileset, name);
-
-	/* TODO: check this math. tuplenumber will be too high. */
-	uint32		num_to_write = tuplenum / 8 + 1;
-
-	unsigned char byteToWrite = 0;
-
-	BufFileWrite(accessor->outer_match_status_file, &byteToWrite, num_to_write);
-
-	if (BufFileSeek(accessor->outer_match_status_file, 0, 0L, SEEK_SET))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not rewind hash-join temporary file: %m")));
-}
-
-void
-sts_set_outer_match_status(SharedTuplestoreAccessor *accessor, uint32 tuplenum)
-{
-	BufFile    *parallel_outer_matchstatuses = accessor->outer_match_status_file;
-	unsigned char current_outer_byte;
-
-	BufFileSeek(parallel_outer_matchstatuses, 0, tuplenum / 8, SEEK_SET);
-	BufFileRead(parallel_outer_matchstatuses, &current_outer_byte, 1);
-
-	current_outer_byte |= 1U << (tuplenum % 8);
-
-	if (BufFileSeek(parallel_outer_matchstatuses, 0, -1, SEEK_CUR) != 0)
-		elog(ERROR, "there is a problem with outer match status file. pid %i.", MyProcPid);
-	BufFileWrite(parallel_outer_matchstatuses, &current_outer_byte, 1);
-}
-
-void
-sts_close_outer_match_status_file(SharedTuplestoreAccessor *accessor)
-{
-	BufFileClose(accessor->outer_match_status_file);
-}
-
-BufFile *
-sts_combine_outer_match_status_files(SharedTuplestoreAccessor *accessor)
-{
-	/* TODO: this tries to close an outer match status file for */
-	/* each participant in the tuplestore. technically, only participants */
-	/* in the barrier could have outer match status files, however, */
-	/* all but one participant continue on and detach from the barrier */
-	/* so we won't have a reliable way to close only files for those attached */
-	/* to the barrier */
-	BufFile   **statuses = palloc(sizeof(BufFile *) * accessor->sts->nparticipants);
-
-	/*
-	 * Open the bitmap shared BufFile from each participant. TODO: explain why
-	 * file can be NULLs
-	 */
-	int			statuses_length = 0;
-
-	for (int i = 0; i < accessor->sts->nparticipants; i++)
-	{
-		char		bitmap_filename[MAXPGPATH];
-
-		sts_bitmap_filename(bitmap_filename, accessor, i);
-		BufFile    *file = BufFileOpenSharedIfExists(accessor->fileset, bitmap_filename);
-
-		if (file != NULL)
-			statuses[statuses_length++] = file;
-	}
-
-	BufFile    *combined_bitmap_file = BufFileCreateTemp(false);
-
-	for (int64 cur = 0; cur < BufFileSize(statuses[0]); cur++)
-		/* make it while not */
-		EOF
-	{
-		unsigned char combined_byte = 0;
-
-		for (int i = 0; i < statuses_length; i++)
-		{
-			unsigned char read_byte;
-
-			BufFileRead(statuses[i], &read_byte, 1);
-			combined_byte |= read_byte;
-		}
-
-		BufFileWrite(combined_bitmap_file, &combined_byte, 1);
-	}
-
-	if (BufFileSeek(combined_bitmap_file, 0, 0L, SEEK_SET))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not rewind hash-join temporary file: %m")));
-
-	for (int i = 0; i < statuses_length; i++)
-		BufFileClose(statuses[i]);
-	pfree(statuses);
-
-	return combined_bitmap_file;
-}
-
-
-static void
-sts_bitmap_filename(char *name, SharedTuplestoreAccessor *accessor, int participant)
+uint32
+sts_get_tuplenum(SharedTuplestoreAccessor *accessor)
 {
-	snprintf(name, MAXPGPATH, "%s.p%d.bitmap", accessor->sts->name, participant);
+	return pg_atomic_read_u32(&accessor->sts->ntuples);
 }
 
 /*
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index b2cc12dc19be..164a97ef9625 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -19,6 +19,7 @@
 #include "storage/barrier.h"
 #include "storage/buffile.h"
 #include "storage/lwlock.h"
+#include "utils/sharedbits.h"
 
 /* ----------------------------------------------------------------
  *				hash-join hash table structures
@@ -193,10 +194,17 @@ typedef struct ParallelHashJoinBatch
 	 ((char *) ParallelHashJoinBatchInner(batch) +						\
 	  MAXALIGN(sts_estimate(nparticipants))))
 
+/* Accessor for sharedbits following a ParallelHashJoinBatch. */
+#define ParallelHashJoinBatchOuterBits(batch, nparticipants) \
+	((SharedBits *)												\
+	 ((char *) ParallelHashJoinBatchOuter(batch, nparticipants) +						\
+	  MAXALIGN(sts_estimate(nparticipants))))
+
 /* Total size of a ParallelHashJoinBatch and tuplestores. */
 #define EstimateParallelHashJoinBatch(hashtable)						\
 	(MAXALIGN(sizeof(ParallelHashJoinBatch)) +							\
-	 MAXALIGN(sts_estimate((hashtable)->parallel_state->nparticipants)) * 2)
+	 MAXALIGN(sts_estimate((hashtable)->parallel_state->nparticipants)) * 2 + \
+	 MAXALIGN(sb_estimate((hashtable)->parallel_state->nparticipants)))
 
 /* Accessor for the nth ParallelHashJoinBatch given the base. */
 #define NthParallelHashJoinBatch(base, n)								\
@@ -221,9 +229,9 @@ typedef struct ParallelHashJoinBatchAccessor
 	bool		at_least_one_chunk; /* has this backend allocated a chunk? */
 
 	bool		done;			/* flag to remember that a batch is done */
-	BufFile    *combined_bitmap;	/* for Adaptive Hashjoin only  */
 	SharedTuplestoreAccessor *inner_tuples;
 	SharedTuplestoreAccessor *outer_tuples;
+	SharedBitsAccessor *sba;
 } ParallelHashJoinBatchAccessor;
 
 /*
@@ -270,6 +278,7 @@ typedef struct ParallelHashJoinState
 	pg_atomic_uint32 distributor;	/* counter for load balancing */
 
 	SharedFileSet fileset;		/* space for shared temporary files */
+	SharedFileSet sbfileset;
 } ParallelHashJoinState;
 
 /* The phases for building batches, used by build_barrier. */
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f790f7e12186..82c0f8361154 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,6 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenSharedIfExists(SharedFileSet *fileset, const char *name);
 extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
 
diff --git a/src/include/utils/sharedbits.h b/src/include/utils/sharedbits.h
new file mode 100644
index 000000000000..a554a59a38b8
--- /dev/null
+++ b/src/include/utils/sharedbits.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * sharedbits.h
+ *	  Simple mechanism for sharing bits between backends.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/sharedbits.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SHAREDBITS_H
+#define SHAREDBITS_H
+
+#include "storage/sharedfileset.h"
+
+struct SharedBits;
+typedef struct SharedBits SharedBits;
+
+struct SharedBitsParticipant;
+typedef struct SharedBitsParticipant SharedBitsParticipant;
+
+struct SharedBitsAccessor;
+typedef struct SharedBitsAccessor SharedBitsAccessor;
+
+extern SharedBitsAccessor *sb_attach(SharedBits *sbits, int my_participant_number, SharedFileSet *fileset);
+extern SharedBitsAccessor *sb_initialize(SharedBits *sbits, int participants, int my_participant_number, SharedFileSet *fileset, char *name);
+extern void sb_initialize_accessor(SharedBitsAccessor *accessor, uint32 nbits);
+extern size_t sb_estimate(int participants);
+
+extern void sb_setbit(SharedBitsAccessor *accessor, uint64 bit);
+extern bool sb_checkbit(SharedBitsAccessor *accessor, uint32 n);
+extern BufFile *sb_combine(SharedBitsAccessor *accessor);
+extern bool sb_combined_exists(SharedBitsAccessor *accessor);
+
+extern void sb_end_write(SharedBitsAccessor *sba);
+extern void sb_end_read(SharedBitsAccessor *accessor);
+
+#endif							/* SHAREDBITS_H */
diff --git a/src/include/utils/sharedtuplestore.h b/src/include/utils/sharedtuplestore.h
index 8b2433e5c4b0..5e78f4bb15b7 100644
--- a/src/include/utils/sharedtuplestore.h
+++ b/src/include/utils/sharedtuplestore.h
@@ -71,11 +71,6 @@ extern MinimalTuple sts_parallel_scan_next(SharedTuplestoreAccessor *accessor,
 
 
 extern uint32 sts_increment_tuplenum(SharedTuplestoreAccessor *accessor);
-
-extern void sts_make_outer_match_status_file(SharedTuplestoreAccessor *accessor);
-extern void sts_set_outer_match_status(SharedTuplestoreAccessor *accessor, uint32 tuplenum);
-extern void sts_close_outer_match_status_file(SharedTuplestoreAccessor *accessor);
-extern BufFile *sts_combine_outer_match_status_files(SharedTuplestoreAccessor *accessor);
-
+extern uint32 sts_get_tuplenum(SharedTuplestoreAccessor *accessor);
 
 #endif							/* SHAREDTUPLESTORE_H */
-- 
2.25.0

v4-0001-Implement-Adaptive-Hashjoin.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Implement-Adaptive-Hashjoin.patchDownload

From e0f27230b71ab516434083552823d10df1a0069d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 29 Dec 2019 18:56:42 -0800
Subject: [PATCH v4 1/4] Implement Adaptive Hashjoin

Serial Hashloop Fallback:

"Chunk" the inner file into arbitrary partitions of work_mem size
offset along tuple bounds while loading the batch into the hashtable.

Note that this makes it impossible to increase nbatches during the
loading of batches after initial hashtable creation.

In preparation for doing this chunking, separate "advance batch" and
"load batch".

Implement outer tuple batch rewinding per chunk of inner batch. Would
be a simple rewind and replay of outer side for each chunk of inner if
it weren't for LOJ. Because we need to wait to emit NULL-extended
tuples for LOJ until after all chunks of inner have been processed.

To do this without incurring additional memory pressure, use a
temporary BufFile to capture the match status of each outer side
tuple. Use one bit per tuple to represent the match status, and, since
for parallel-oblivious hashjoin the outer side tuples are encountered
in a deterministic order, synchronizing the outer tuples match status
file with the outer tuples in the batch file to decide which ones to
emit NULL-extended is easy and can be done with a simple counter.

For non-hashloop fallback scenario (including batch 0), this file is
not created and unmatched outer tuples should be emitted as they are
encountered.

Parallel Hashloop Fallback:

During initial allocation of the hashtable, each time the number of
batches is increased, a new variable in the ParallelHashJoinState,
batch_increases, is incremented.

In PHJ_GROW_BATCHES_DECIDING, if pstate->batch_increases >= 2,
parallel_hashloop_fallback will be enabled for qualifying batches.
From then on, if a batch is still too large to fit into the
space_allowed, then parallel_hashloop_fallback is set on that batch.
It will not be allowed to divide further and, during execution, the
fallback strategy will be used.

For a batch which has parallel_hashloop_fallback set, tuples inserted
into the the batch's inner and outer batch files will have an
additional piece of metadata (other than the hashvalue). For the inner
side, this additional metadata is the chunk number, For the outer
side, this additional metadata is the tuple identifier--needed when
rescanning the outer side batch file for each chunk of the inner.

During execution of a parallel hashjoin batch which needs to fall
back, the worker will create an "outer match status file" which
contains a bitmap tracking which outer tuples have matched an inner
tuple. All bits in the worker's outer match status file are initially
unset. During probing, the worker will set the corresponding bit (the
bit at the index of the tuple identifier) in the outer match status
bitmap for an outer tuple which matches any inner tuple.

Workers probing a fallback batch will wait until all workers have
finished probing before moving on so that an elected worker can read
and combine the outer match status files into a single bitmap and use
it to emit unmatched outer tuples after all chunks of the inner side
have been processed.
---
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/adaptiveHashjoin.c       |  349 +++++
 src/backend/executor/nodeHash.c               |  127 +-
 src/backend/executor/nodeHashjoin.c           | 1171 +++++++++++-----
 src/backend/postmaster/pgstat.c               |   21 +
 src/backend/storage/file/buffile.c            |   65 +
 src/backend/storage/ipc/barrier.c             |   85 ++
 src/backend/utils/sort/sharedtuplestore.c     |  133 ++
 src/include/executor/adaptiveHashjoin.h       |    9 +
 src/include/executor/hashjoin.h               |   28 +-
 src/include/executor/nodeHash.h               |    5 +-
 src/include/executor/tuptable.h               |    3 +-
 src/include/nodes/execnodes.h                 |   17 +
 src/include/pgstat.h                          |    8 +
 src/include/storage/barrier.h                 |    1 +
 src/include/storage/buffile.h                 |    3 +
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/sharedtuplestore.h          |   22 +
 src/test/regress/expected/adaptive_hj.out     | 1233 +++++++++++++++++
 .../regress/expected/parallel_adaptive_hj.out |  343 +++++
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/post_schedule                |    8 +
 src/test/regress/pre_schedule                 |  120 ++
 src/test/regress/serial_schedule              |    2 +
 src/test/regress/sql/adaptive_hj.sql          |  240 ++++
 src/test/regress/sql/parallel_adaptive_hj.sql |  182 +++
 26 files changed, 3829 insertions(+), 350 deletions(-)
 create mode 100644 src/backend/executor/adaptiveHashjoin.c
 create mode 100644 src/include/executor/adaptiveHashjoin.h
 create mode 100644 src/test/regress/expected/adaptive_hj.out
 create mode 100644 src/test/regress/expected/parallel_adaptive_hj.out
 create mode 100644 src/test/regress/post_schedule
 create mode 100644 src/test/regress/pre_schedule
 create mode 100644 src/test/regress/sql/adaptive_hj.sql
 create mode 100644 src/test/regress/sql/parallel_adaptive_hj.sql

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b5f..54799d764438 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	adaptiveHashjoin.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/adaptiveHashjoin.c b/src/backend/executor/adaptiveHashjoin.c
new file mode 100644
index 000000000000..dff5b38d38f8
--- /dev/null
+++ b/src/backend/executor/adaptiveHashjoin.c
@@ -0,0 +1,349 @@
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/parallel.h"
+#include "executor/executor.h"
+#include "executor/hashjoin.h"
+#include "executor/nodeHash.h"
+#include "executor/nodeHashjoin.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/sharedtuplestore.h"
+
+#include "executor/adaptiveHashjoin.h"
+
+
+
+
+bool
+ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
+{
+	HashJoinTable hashtable;
+	int			batchno;
+	ParallelHashJoinBatch *phj_batch;
+	SharedTuplestoreAccessor *outer_tuples;
+	SharedTuplestoreAccessor *inner_tuples;
+	Barrier    *chunk_barrier;
+
+	hashtable = hjstate->hj_HashTable;
+	batchno = hashtable->curbatch;
+	phj_batch = hashtable->batches[batchno].shared;
+	outer_tuples = hashtable->batches[batchno].outer_tuples;
+	inner_tuples = hashtable->batches[batchno].inner_tuples;
+
+	/*
+	 * This chunk_barrier is initialized in the ELECTING phase when this
+	 * worker attached to the batch in ExecParallelHashJoinNewBatch()
+	 */
+	chunk_barrier = &hashtable->batches[batchno].shared->chunk_barrier;
+
+	/*
+	 * If this worker just came from probing (from HJ_SCAN_BUCKET) we need to
+	 * advance the chunk number here. Otherwise this worker isn't attached yet
+	 * to the chunk barrier.
+	 */
+	if (advance_from_probing)
+	{
+		/*
+		 * The current chunk number can't be incremented if *any* worker isn't
+		 * done yet (otherwise they might access the wrong data structure!)
+		 */
+		if (BarrierArriveAndWait(chunk_barrier,
+								 WAIT_EVENT_HASH_CHUNK_PROBING))
+			phj_batch->current_chunk_num++;
+
+		/* Once the barrier is advanced we'll be in the DONE phase */
+	}
+	else
+		BarrierAttach(chunk_barrier);
+
+	/*
+	 * The outer side is exhausted and either 1) the current chunk of the
+	 * inner side is exhausted and it is time to advance the chunk 2) the last
+	 * chunk of the inner side is exhausted and it is time to advance the
+	 * batch
+	 */
+	switch (BarrierPhase(chunk_barrier))
+	{
+			/*
+			 * TODO: remove this phase and coordinate access to hashtable
+			 * above goto and after incrementing current_chunk_num
+			 */
+		case PHJ_CHUNK_ELECTING:
+	phj_chunk_electing:
+			BarrierArriveAndWait(chunk_barrier,
+								 WAIT_EVENT_HASH_CHUNK_ELECTING);
+			/* Fall through. */
+
+		case PHJ_CHUNK_LOADING:
+			/* Start (or join in) loading the next chunk of inner tuples. */
+			sts_begin_parallel_scan(inner_tuples);
+
+			MinimalTuple tuple;
+			tupleMetadata metadata;
+
+			while ((tuple = sts_parallel_scan_next(inner_tuples, &metadata)))
+			{
+				if (metadata.tupleid != phj_batch->current_chunk_num)
+					continue;
+
+				ExecForceStoreMinimalTuple(tuple,
+										   hjstate->hj_HashTupleSlot,
+										   false);
+
+				ExecParallelHashTableInsertCurrentBatch(
+														hashtable,
+														hjstate->hj_HashTupleSlot,
+														metadata.hashvalue);
+			}
+			sts_end_parallel_scan(inner_tuples);
+			BarrierArriveAndWait(chunk_barrier,
+								 WAIT_EVENT_HASH_CHUNK_LOADING);
+			/* Fall through. */
+
+		case PHJ_CHUNK_PROBING:
+			sts_begin_parallel_scan(outer_tuples);
+			return true;
+
+		case PHJ_CHUNK_DONE:
+
+			BarrierArriveAndWait(chunk_barrier, WAIT_EVENT_HASH_CHUNK_DONE);
+
+			if (phj_batch->current_chunk_num > phj_batch->total_num_chunks)
+			{
+				BarrierDetach(chunk_barrier);
+				return false;
+			}
+
+			/*
+			 * Otherwise it is time for the next chunk. One worker should
+			 * reset the hashtable
+			 */
+			if (BarrierArriveExplicitAndWait(chunk_barrier, PHJ_CHUNK_ELECTING, WAIT_EVENT_HASH_ADVANCE_CHUNK))
+			{
+				/*
+				 * rewind/reset outer tuplestore and rewind outer match status
+				 * files
+				 */
+				sts_reinitialize(outer_tuples);
+
+				/*
+				 * reset inner's hashtable and recycle the existing bucket
+				 * array.
+				 */
+				dsa_pointer_atomic *buckets = (dsa_pointer_atomic *)
+				dsa_get_address(hashtable->area, phj_batch->buckets);
+
+				for (size_t i = 0; i < hashtable->nbuckets; ++i)
+					dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
+
+				/*
+				 * TODO: this will unfortunately rescan all inner tuples in
+				 * the batch for each chunk
+				 */
+
+				/*
+				 * should be able to save the block in the file which starts
+				 * the next chunk instead
+				 */
+				sts_reinitialize(inner_tuples);
+			}
+			goto phj_chunk_electing;
+
+		case PHJ_CHUNK_FINAL:
+			BarrierDetach(chunk_barrier);
+			return false;
+
+		default:
+			elog(ERROR, "unexpected chunk phase %d. pid %i. batch %i.",
+				 BarrierPhase(chunk_barrier), MyProcPid, batchno);
+	}
+
+	return false;
+}
+
+
+/*
+ * Choose a batch to work on, and attach to it.  Returns true if successful,
+ * false if there are no more batches.
+ */
+bool
+ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			start_batchno;
+	int			batchno;
+
+	/*
+	 * If we started up so late that the batch tracking array has been freed
+	 * already by ExecHashTableDetach(), then we are finished.  See also
+	 * ExecParallelHashEnsureBatchAccessors().
+	 */
+	if (hashtable->batches == NULL)
+		return false;
+
+	/*
+	 * For hashloop fallback only Only the elected worker who was chosen to
+	 * combine the outer match status bitmaps should reach here. This worker
+	 * must do some final cleanup and then detach from the batch
+	 */
+	if (hjstate->combined_bitmap != NULL)
+	{
+		BufFileClose(hjstate->combined_bitmap);
+		hjstate->combined_bitmap = NULL;
+		hashtable->batches[hashtable->curbatch].done = true;
+		ExecHashTableDetachBatch(hashtable);
+	}
+
+	/*
+	 * If we were already attached to a batch, remember not to bother checking
+	 * it again, and detach from it (possibly freeing the hash table if we are
+	 * last to detach). curbatch is set when the batch_barrier phase is either
+	 * PHJ_BATCH_LOADING or PHJ_BATCH_CHUNKING (note that the
+	 * PHJ_BATCH_LOADING case will fall through to the PHJ_BATCH_CHUNKING
+	 * case). The PHJ_BATCH_CHUNKING case returns to the caller. So when this
+	 * function is reentered with a curbatch >= 0 then we must be done
+	 * probing.
+	 */
+	if (hashtable->curbatch >= 0)
+	{
+		ParallelHashJoinBatchAccessor *accessor = hashtable->batches + hashtable->curbatch;
+		ParallelHashJoinBatch *batch = accessor->shared;
+
+		/*
+		 * End the parallel scan on the outer tuples before we arrive at the
+		 * next barrier so that the last worker to arrive at that barrier can
+		 * reinitialize the SharedTuplestore for another parallel scan.
+		 */
+
+		if (!batch->parallel_hashloop_fallback)
+			BarrierArriveAndWait(&batch->batch_barrier,
+								 WAIT_EVENT_HASH_BATCH_PROBING);
+		else
+		{
+			sts_close_outer_match_status_file(accessor->outer_tuples);
+
+			/*
+			 * If all workers (including this one) have finished probing the
+			 * batch, one worker is elected to Combine all the outer match
+			 * status files from the workers who were attached to this batch
+			 * Loop through the outer match status files from all workers that
+			 * were attached to this batch Combine them into one bitmap Use
+			 * the bitmap, loop through the outer batch file again, and emit
+			 * unmatched tuples
+			 */
+
+			if (BarrierArriveAndWait(&batch->batch_barrier,
+									 WAIT_EVENT_HASH_BATCH_PROBING))
+			{
+				hjstate->combined_bitmap = sts_combine_outer_match_status_files(accessor->outer_tuples);
+				hjstate->last_worker = true;
+				return true;
+			}
+		}
+
+		/* the elected combining worker should not reach here */
+		hashtable->batches[hashtable->curbatch].done = true;
+		ExecHashTableDetachBatch(hashtable);
+	}
+
+	/*
+	 * Search for a batch that isn't done.  We use an atomic counter to start
+	 * our search at a different batch in every participant when there are
+	 * more batches than participants.
+	 */
+	batchno = start_batchno =
+		pg_atomic_fetch_add_u32(&hashtable->parallel_state->distributor, 1) %
+		hashtable->nbatch;
+
+	do
+	{
+		if (!hashtable->batches[batchno].done)
+		{
+			Barrier    *batch_barrier =
+			&hashtable->batches[batchno].shared->batch_barrier;
+
+			switch (BarrierAttach(batch_barrier))
+			{
+				case PHJ_BATCH_ELECTING:
+					/* One backend allocates the hash table. */
+					if (BarrierArriveAndWait(batch_barrier,
+											 WAIT_EVENT_HASH_BATCH_ELECTING))
+					{
+						ExecParallelHashTableAlloc(hashtable, batchno);
+						Barrier    *chunk_barrier =
+						&hashtable->batches[batchno].shared->chunk_barrier;
+
+						BarrierInit(chunk_barrier, 0);
+						hashtable->batches[batchno].shared->current_chunk_num = 1;
+					}
+					/* Fall through. */
+
+				case PHJ_BATCH_ALLOCATING:
+					/* Wait for allocation to complete. */
+					BarrierArriveAndWait(batch_barrier,
+										 WAIT_EVENT_HASH_BATCH_ALLOCATING);
+					/* Fall through. */
+
+				case PHJ_BATCH_CHUNKING:
+
+					/*
+					 * This batch is ready to probe.  Return control to
+					 * caller. We stay attached to batch_barrier so that the
+					 * hash table stays alive until everyone's finished
+					 * probing it, but no participant is allowed to wait at
+					 * this barrier again (or else a deadlock could occur).
+					 * All attached participants must eventually call
+					 * BarrierArriveAndDetach() so that the final phase
+					 * PHJ_BATCH_DONE can be reached.
+					 */
+					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
+
+					if (batchno == 0)
+						sts_begin_parallel_scan(hashtable->batches[batchno].outer_tuples);
+
+					/*
+					 * Create an outer match status file for this batch for
+					 * this worker This file must be accessible to the other
+					 * workers But *only* written to by this worker. Written
+					 * to by this worker and readable by any worker
+					 */
+					if (hashtable->batches[batchno].shared->parallel_hashloop_fallback)
+						sts_make_outer_match_status_file(hashtable->batches[batchno].outer_tuples);
+
+					return true;
+
+				case PHJ_BATCH_OUTER_MATCH_STATUS_PROCESSING:
+
+					/*
+					 * The batch isn't done but this worker can't contribute
+					 * anything to it so it might as well be done from this
+					 * worker's perspective. (Only one worker can do work in
+					 * this phase).
+					 */
+
+					/* Fall through. */
+
+				case PHJ_BATCH_DONE:
+
+					/*
+					 * Already done. Detach and go around again (if any
+					 * remain).
+					 */
+					BarrierDetach(batch_barrier);
+
+					hashtable->batches[batchno].done = true;
+					hashtable->curbatch = -1;
+					break;
+
+				default:
+					elog(ERROR, "unexpected batch phase %d. pid %i. batchno %i.",
+						 BarrierPhase(batch_barrier), MyProcPid, batchno);
+			}
+		}
+		batchno = (batchno + 1) % hashtable->nbatch;
+	} while (batchno != start_batchno);
+
+	return false;
+}
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index b6d508490864..c5420b169e6c 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -588,7 +588,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 		 * Attach to the build barrier.  The corresponding detach operation is
 		 * in ExecHashTableDetach.  Note that we won't attach to the
 		 * batch_barrier for batch 0 yet.  We'll attach later and start it out
-		 * in PHJ_BATCH_PROBING phase, because batch 0 is allocated up front
+		 * in PHJ_BATCH_CHUNKING phase, because batch 0 is allocated up front
 		 * and then loaded while hashing (the standard hybrid hash join
 		 * algorithm), and we'll coordinate that using build_barrier.
 		 */
@@ -1061,6 +1061,9 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 	int			i;
 
 	Assert(BarrierPhase(&pstate->build_barrier) == PHJ_BUILD_HASHING_INNER);
+	LWLockAcquire(&pstate->lock, LW_EXCLUSIVE);
+	pstate->batch_increases++;
+	LWLockRelease(&pstate->lock);
 
 	/*
 	 * It's unlikely, but we need to be prepared for new participants to show
@@ -1216,11 +1219,17 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 			{
 				bool		space_exhausted = false;
 				bool		extreme_skew_detected = false;
+				bool		excessive_batch_num_increases = false;
 
 				/* Make sure that we have the current dimensions and buckets. */
 				ExecParallelHashEnsureBatchAccessors(hashtable);
 				ExecParallelHashTableSetCurrentBatch(hashtable, 0);
 
+				LWLockAcquire(&pstate->lock, LW_EXCLUSIVE);
+				if (pstate->batch_increases >= 2)
+					excessive_batch_num_increases = true;
+				LWLockRelease(&pstate->lock);
+
 				/* Are any of the new generation of batches exhausted? */
 				for (i = 0; i < hashtable->nbatch; ++i)
 				{
@@ -1233,6 +1242,36 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 
 						space_exhausted = true;
 
+						/*
+						 * only once we've increased the number of batches
+						 * overall many times should we start setting
+						 */
+
+						/*
+						 * some batches to use the fallback strategy. Those
+						 * that are still too big will have this option set
+						 */
+
+						/*
+						 * we better not repartition again (growth should be
+						 * disabled), so that we don't overwrite this value
+						 */
+
+						/*
+						 * this tells us if we have set fallback to true or
+						 * not and how many chunks -- useful for seeing how
+						 * many chunks
+						 */
+
+						/*
+						 * we can get to before setting it to true (since we
+						 * still mark chunks (work_mem sized chunks)) in
+						 * batches even if we don't fall back
+						 */
+						/* same for below but opposite */
+						if (excessive_batch_num_increases == true)
+							batch->parallel_hashloop_fallback = true;
+
 						/*
 						 * Did this batch receive ALL of the tuples from its
 						 * parent batch?  That would indicate that further
@@ -1248,6 +1287,8 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 				/* Don't keep growing if it's not helping or we'd overflow. */
 				if (extreme_skew_detected || hashtable->nbatch >= INT_MAX / 2)
 					pstate->growth = PHJ_GROWTH_DISABLED;
+				else if (excessive_batch_num_increases && space_exhausted)
+					pstate->growth = PHJ_GROWTH_DISABLED;
 				else if (space_exhausted)
 					pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
 				else
@@ -1315,9 +1356,27 @@ ExecParallelHashRepartitionFirst(HashJoinTable hashtable)
 				MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 
 				/* It belongs in a later batch. */
+				ParallelHashJoinBatch *phj_batch = hashtable->batches[batchno].shared;
+
+				LWLockAcquire(&phj_batch->lock, LW_EXCLUSIVE);
+				/* TODO: should I check batch estimated size here at all? */
+				if (phj_batch->parallel_hashloop_fallback == true && (phj_batch->estimated_chunk_size + tuple_size > hashtable->parallel_state->space_allowed))
+				{
+					phj_batch->total_num_chunks++;
+					phj_batch->estimated_chunk_size = tuple_size;
+				}
+				else
+					phj_batch->estimated_chunk_size += tuple_size;
+
+				tupleMetadata metadata;
+
+				metadata.hashvalue = hashTuple->hashvalue;
+				metadata.tupleid = phj_batch->total_num_chunks;
+				LWLockRelease(&phj_batch->lock);
+
 				hashtable->batches[batchno].estimated_size += tuple_size;
 				sts_puttuple(hashtable->batches[batchno].inner_tuples,
-							 &hashTuple->hashvalue, tuple);
+							 &metadata, tuple);
 			}
 
 			/* Count this tuple. */
@@ -1369,12 +1428,15 @@ ExecParallelHashRepartitionRest(HashJoinTable hashtable)
 
 		/* Scan one partition from the previous generation. */
 		sts_begin_parallel_scan(old_inner_tuples[i]);
-		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i], &hashvalue)))
+		tupleMetadata metadata;
+
+		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i], &metadata)))
 		{
 			size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 			int			bucketno;
 			int			batchno;
 
+			hashvalue = metadata.hashvalue;
 			/* Decide which partition it goes to in the new generation. */
 			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
 									  &batchno);
@@ -1383,10 +1445,27 @@ ExecParallelHashRepartitionRest(HashJoinTable hashtable)
 			++hashtable->batches[batchno].ntuples;
 			++hashtable->batches[i].old_ntuples;
 
+			ParallelHashJoinBatch *phj_batch = hashtable->batches[batchno].shared;
+
+			LWLockAcquire(&phj_batch->lock, LW_EXCLUSIVE);
+			/* TODO: should I check batch estimated size here at all? */
+			if (phj_batch->parallel_hashloop_fallback == true && (phj_batch->estimated_chunk_size + tuple_size > pstate->space_allowed))
+			{
+				phj_batch->total_num_chunks++;
+				phj_batch->estimated_chunk_size = tuple_size;
+			}
+			else
+				phj_batch->estimated_chunk_size += tuple_size;
+			metadata.tupleid = phj_batch->total_num_chunks;
+			LWLockRelease(&phj_batch->lock);
 			/* Store the tuple its new batch. */
 			sts_puttuple(hashtable->batches[batchno].inner_tuples,
-						 &hashvalue, tuple);
+						 &metadata, tuple);
 
+			/*
+			 * TODO: should I zero out metadata here to make sure old values
+			 * aren't reused?
+			 */
 			CHECK_FOR_INTERRUPTS();
 		}
 		sts_end_parallel_scan(old_inner_tuples[i]);
@@ -1719,6 +1798,7 @@ retry:
 		size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 
 		Assert(batchno > 0);
+		ParallelHashJoinState *pstate = hashtable->parallel_state;
 
 		/* Try to preallocate space in the batch if necessary. */
 		if (hashtable->batches[batchno].preallocated < tuple_size)
@@ -1729,7 +1809,31 @@ retry:
 
 		Assert(hashtable->batches[batchno].preallocated >= tuple_size);
 		hashtable->batches[batchno].preallocated -= tuple_size;
-		sts_puttuple(hashtable->batches[batchno].inner_tuples, &hashvalue,
+		ParallelHashJoinBatch *phj_batch = hashtable->batches[batchno].shared;
+
+		LWLockAcquire(&phj_batch->lock, LW_EXCLUSIVE);
+
+		/* TODO: should batch estimated size be considered here? */
+
+		/*
+		 * TODO: should this be done in
+		 * ExecParallelHashTableInsertCurrentBatch instead?
+		 */
+		if (phj_batch->parallel_hashloop_fallback == true && (phj_batch->estimated_chunk_size + tuple_size > pstate->space_allowed))
+		{
+			phj_batch->total_num_chunks++;
+			phj_batch->estimated_chunk_size = tuple_size;
+		}
+		else
+			phj_batch->estimated_chunk_size += tuple_size;
+
+		tupleMetadata metadata;
+
+		metadata.hashvalue = hashvalue;
+		metadata.tupleid = phj_batch->total_num_chunks;
+		LWLockRelease(&phj_batch->lock);
+
+		sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata,
 					 tuple);
 	}
 	++hashtable->batches[batchno].ntuples;
@@ -2936,6 +3040,13 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
 		char		name[MAXPGPATH];
 
+		shared->parallel_hashloop_fallback = false;
+		LWLockInitialize(&shared->lock,
+						 LWTRANCHE_PARALLEL_HASH_JOIN_BATCH);
+		shared->current_chunk_num = 0;
+		shared->total_num_chunks = 1;
+		shared->estimated_chunk_size = 0;
+
 		/*
 		 * All members of shared were zero-initialized.  We just need to set
 		 * up the Barrier.
@@ -2945,7 +3056,7 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 		{
 			/* Batch 0 doesn't need to be loaded. */
 			BarrierAttach(&shared->batch_barrier);
-			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_PROBING)
+			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_CHUNKING)
 				BarrierArriveAndWait(&shared->batch_barrier, 0);
 			BarrierDetach(&shared->batch_barrier);
 		}
@@ -2959,7 +3070,7 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 			sts_initialize(ParallelHashJoinBatchInner(shared),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
@@ -2969,7 +3080,7 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 													  pstate->nparticipants),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 67c717910f5c..39a03000f8da 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -81,11 +81,11 @@
  *  PHJ_BATCH_ELECTING       -- initial state
  *  PHJ_BATCH_ALLOCATING     -- one allocates buckets
  *  PHJ_BATCH_LOADING        -- all load the hash table from disk
- *  PHJ_BATCH_PROBING        -- all probe
+ *  PHJ_BATCH_CHUNKING       -- all probe
  *  PHJ_BATCH_DONE           -- end
  *
  * Batch 0 is a special case, because it starts out in phase
- * PHJ_BATCH_PROBING; populating batch 0's hash table is done during
+ * PHJ_BATCH_CHUNKING; populating batch 0's hash table is done during
  * PHJ_BUILD_HASHING_INNER so we can skip loading.
  *
  * Initially we try to plan for a single-batch hash join using the combined
@@ -98,7 +98,7 @@
  * already arrived.  Practically, that means that we never return a tuple
  * while attached to a barrier, unless the barrier has reached its final
  * state.  In the slightly special case of the per-batch barrier, we return
- * tuples while in PHJ_BATCH_PROBING phase, but that's OK because we use
+ * tuples while in PHJ_BATCH_CHUNKING phase, but that's OK because we use
  * BarrierArriveAndDetach() to advance it to PHJ_BATCH_DONE without waiting.
  *
  *-------------------------------------------------------------------------
@@ -117,6 +117,8 @@
 #include "utils/memutils.h"
 #include "utils/sharedtuplestore.h"
 
+#include "executor/adaptiveHashjoin.h"
+
 
 /*
  * States of the ExecHashJoin state machine
@@ -124,9 +126,11 @@
 #define HJ_BUILD_HASHTABLE		1
 #define HJ_NEED_NEW_OUTER		2
 #define HJ_SCAN_BUCKET			3
-#define HJ_FILL_OUTER_TUPLE		4
-#define HJ_FILL_INNER_TUPLES	5
-#define HJ_NEED_NEW_BATCH		6
+#define HJ_FILL_INNER_TUPLES    4
+#define HJ_NEED_NEW_BATCH		5
+#define HJ_NEED_NEW_INNER_CHUNK 6
+#define HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT 7
+#define HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER 8
 
 /* Returns true if doing null-fill on outer relation */
 #define HJ_FILL_OUTER(hjstate)	((hjstate)->hj_NullInnerTupleSlot != NULL)
@@ -143,10 +147,15 @@ static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 												 BufFile *file,
 												 uint32 *hashvalue,
 												 TupleTableSlot *tupleSlot);
-static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
-static bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
+
+static bool ExecHashJoinAdvanceBatch(HashJoinState *hjstate);
+static bool ExecHashJoinLoadInnerBatch(HashJoinState *hjstate);
 static void ExecParallelHashJoinPartitionOuter(HashJoinState *node);
 
+static TupleTableSlot *emitUnmatchedOuterTuple(ExprState *otherqual,
+											   ExprContext *econtext,
+											   HashJoinState *hjstate);
+
 
 /* ----------------------------------------------------------------
  *		ExecHashJoinImpl
@@ -161,8 +170,15 @@ static void ExecParallelHashJoinPartitionOuter(HashJoinState *node);
  *			  the other one is "outer".
  * ----------------------------------------------------------------
  */
-static pg_attribute_always_inline TupleTableSlot *
-ExecHashJoinImpl(PlanState *pstate, bool parallel)
+
+/* ----------------------------------------------------------------
+ *		ExecHashJoin
+ *
+ *		Parallel-oblivious version.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *			/* return: a tuple or NULL */
+ExecHashJoin(PlanState *pstate)
 {
 	HashJoinState *node = castNode(HashJoinState, pstate);
 	PlanState  *outerNode;
@@ -174,7 +190,8 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 	TupleTableSlot *outerTupleSlot;
 	uint32		hashvalue;
 	int			batchno;
-	ParallelHashJoinState *parallel_state;
+
+	BufFile    *outerFileForAdaptiveRead;
 
 	/*
 	 * get information from HashJoin node
@@ -185,7 +202,6 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 	outerNode = outerPlanState(node);
 	hashtable = node->hj_HashTable;
 	econtext = node->js.ps.ps_ExprContext;
-	parallel_state = hashNode->parallel_state;
 
 	/*
 	 * Reset per-tuple memory context to free any expression evaluation
@@ -243,18 +259,6 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					/* no chance to not build the hash table */
 					node->hj_FirstOuterTupleSlot = NULL;
 				}
-				else if (parallel)
-				{
-					/*
-					 * The empty-outer optimization is not implemented for
-					 * shared hash tables, because no one participant can
-					 * determine that there are no outer tuples, and it's not
-					 * yet clear that it's worth the synchronization overhead
-					 * of reaching consensus to figure that out.  So we have
-					 * to build the hash table.
-					 */
-					node->hj_FirstOuterTupleSlot = NULL;
-				}
 				else if (HJ_FILL_OUTER(node) ||
 						 (outerNode->plan->startup_cost < hashNode->ps.plan->total_cost &&
 						  !node->hj_OuterNotEmpty))
@@ -271,16 +275,527 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				else
 					node->hj_FirstOuterTupleSlot = NULL;
 
+				/* Create the hash table. */
+				hashtable = ExecHashTableCreate(hashNode,
+												node->hj_HashOperators,
+												node->hj_Collations,
+												HJ_FILL_INNER(node));
+				node->hj_HashTable = hashtable;
+
+				/* Execute the Hash node, to build the hash table. */
+				hashNode->hashtable = hashtable;
+				(void) MultiExecProcNode((PlanState *) hashNode);
+
+				/*
+				 * If the inner relation is completely empty, and we're not
+				 * doing a left outer join, we can quit without scanning the
+				 * outer relation.
+				 */
+				if (hashtable->totalTuples == 0 && !HJ_FILL_OUTER(node))
+					return NULL;
+
+				/*
+				 * need to remember whether nbatch has increased since we
+				 * began scanning the outer relation
+				 */
+				hashtable->nbatch_outstart = hashtable->nbatch;
+
+				/*
+				 * Reset OuterNotEmpty for scan.  (It's OK if we fetched a
+				 * tuple above, because ExecHashJoinOuterGetTuple will
+				 * immediately set it again.)
+				 */
+				node->hj_OuterNotEmpty = false;
+
+				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+
+				/* FALL THRU */
+
+			case HJ_NEED_NEW_OUTER:
+
+				/*
+				 * We don't have an outer tuple, try to get the next one
+				 */
+				outerTupleSlot =
+					ExecHashJoinOuterGetTuple(outerNode, node, &hashvalue);
+
+				if (TupIsNull(outerTupleSlot))
+				{
+					/*
+					 * end of batch, or maybe whole join. for hashloop
+					 * fallback, all we know is outer batch is exhausted.
+					 * inner could have more chunks
+					 */
+					if (HJ_FILL_INNER(node))
+					{
+						/* set up to scan for unmatched inner tuples */
+						ExecPrepHashTableForUnmatched(node);
+						node->hj_JoinState = HJ_FILL_INNER_TUPLES;
+						break;
+					}
+					node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
+					break;
+				}
+
+				econtext->ecxt_outertuple = outerTupleSlot;
+
+				/*
+				 * Find the corresponding bucket for this tuple in the main
+				 * hash table or skew hash table.
+				 */
+				node->hj_CurHashValue = hashvalue;
+				ExecHashGetBucketAndBatch(hashtable, hashvalue,
+										  &node->hj_CurBucketNo, &batchno);
+				node->hj_CurSkewBucketNo = ExecHashGetSkewBucket(hashtable,
+																 hashvalue);
+				node->hj_CurTuple = NULL;
+
+				/*
+				 * for the hashloop fallback case, only initialize
+				 * hj_MatchedOuter to false during the first chunk. otherwise,
+				 * we will be resetting hj_MatchedOuter to false for an outer
+				 * tuple that has already matched an inner tuple. also,
+				 * hj_MatchedOuter should be set to false for batch 0. there
+				 * are no chunks for batch 0, and node->hj_InnerFirstChunk
+				 * isn't set to true until HJ_NEED_NEW_BATCH, so need to
+				 * handle batch 0 explicitly
+				 */
+
+				if (!node->hashloop_fallback || hashtable->curbatch == 0 || node->hj_InnerFirstChunk)
+					node->hj_MatchedOuter = false;
+
+				/*
+				 * The tuple might not belong to the current batch (where
+				 * "current batch" includes the skew buckets if any).
+				 */
+				if (batchno != hashtable->curbatch &&
+					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO)
+				{
+					bool		shouldFree;
+					MinimalTuple mintuple = ExecFetchSlotMinimalTuple(outerTupleSlot,
+																	  &shouldFree);
+
+					/*
+					 * Need to postpone this outer tuple to a later batch.
+					 * Save it in the corresponding outer-batch file.
+					 */
+					Assert(batchno > hashtable->curbatch);
+					ExecHashJoinSaveTuple(mintuple, hashvalue,
+										  &hashtable->outerBatchFile[batchno]);
+
+					if (shouldFree)
+						heap_free_minimal_tuple(mintuple);
+
+					/* Loop around, staying in HJ_NEED_NEW_OUTER state */
+					continue;
+				}
+
+				if (node->hashloop_fallback)
+				{
+					/* first tuple of new batch */
+					if (node->hj_OuterMatchStatusesFile == NULL)
+					{
+						node->hj_OuterTupleCount = 0;
+						node->hj_OuterMatchStatusesFile = BufFileCreateTemp(false);
+					}
+
+					/* for fallback case, always increment tuple count */
+					node->hj_OuterTupleCount++;
+
+					/* Use the next byte on every 8th tuple */
+					if ((node->hj_OuterTupleCount - 1) % 8 == 0)
+					{
+						/*
+						 * first chunk of new batch, so write and initialize
+						 * enough bytes in the outer tuple match status file
+						 * to capture all tuples' match statuses
+						 */
+						if (node->hj_InnerFirstChunk)
+						{
+							node->hj_OuterCurrentByte = 0;
+							BufFileWrite(node->hj_OuterMatchStatusesFile, &node->hj_OuterCurrentByte, 1);
+						}
+						/* otherwise, just read the next byte */
+						else
+							BufFileRead(node->hj_OuterMatchStatusesFile, &node->hj_OuterCurrentByte, 1);
+					}
+				}
+
+				/* OK, let's scan the bucket for matches */
+				node->hj_JoinState = HJ_SCAN_BUCKET;
+
+				/* FALL THRU */
+
+			case HJ_SCAN_BUCKET:
+
+				/*
+				 * Scan the selected hash bucket for matches to current outer
+				 */
+				if (!ExecScanHashBucket(node, econtext))
+				{
+					/*
+					 * The current outer tuple has run out of matches, so
+					 * check whether to emit a dummy outer-join tuple.
+					 * Whether we emit one or not, the next state is
+					 * NEED_NEW_OUTER.
+					 */
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+					if (!node->hashloop_fallback || node->hj_HashTable->curbatch == 0)
+					{
+						TupleTableSlot *slot = emitUnmatchedOuterTuple(otherqual, econtext, node);
+
+						if (slot != NULL)
+							return slot;
+					}
+					continue;
+				}
+
+				if (joinqual != NULL && !ExecQual(joinqual, econtext))
+				{
+					InstrCountFiltered1(node, 1);
+					break;
+				}
+
+				/*
+				 * We've got a match, but still need to test non-hashed quals.
+				 * ExecScanHashBucket already set up all the state needed to
+				 * call ExecQual.
+				 *
+				 * If we pass the qual, then save state for next call and have
+				 * ExecProject form the projection, store it in the tuple
+				 * table, and return the slot.
+				 *
+				 * Only the joinquals determine tuple match status, but all
+				 * quals must pass to actually return the tuple.
+				 */
+
+				node->hj_MatchedOuter = true;
+				HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
+
+				/* In an antijoin, we never return a matched tuple */
+				if (node->js.jointype == JOIN_ANTI)
+				{
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+					continue;
+				}
+
+				/*
+				 * If we only need to join to the first matching inner tuple,
+				 * then consider returning this one, but after that, continue
+				 * with next outer tuple.
+				 */
+				/* TODO: is semi-join correct for AHJ */
+				if (node->js.single_match)
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+
+				/*
+				 * Set the match bit for this outer tuple in the match status
+				 * file
+				 */
+				if (node->hj_OuterMatchStatusesFile != NULL)
+				{
+					Assert(node->hashloop_fallback == true);
+					int			byte_to_set = (node->hj_OuterTupleCount - 1) / 8;
+					int			bit_to_set_in_byte = (node->hj_OuterTupleCount - 1) % 8;
+
+					BufFileSeek(node->hj_OuterMatchStatusesFile, 0, byte_to_set, SEEK_SET);
+
+					node->hj_OuterCurrentByte = node->hj_OuterCurrentByte | (1 << bit_to_set_in_byte);
+
+					BufFileWrite(node->hj_OuterMatchStatusesFile, &node->hj_OuterCurrentByte, 1);
+				}
+
+				if (otherqual == NULL || ExecQual(otherqual, econtext))
+					return ExecProject(node->js.ps.ps_ProjInfo);
+				InstrCountFiltered2(node, 1);
+				break;
+
+			case HJ_FILL_INNER_TUPLES:
+
+				/*
+				 * We have finished a batch, but we are doing right/full join,
+				 * so any unmatched inner tuples in the hashtable have to be
+				 * emitted before we continue to the next batch.
+				 */
+				if (!ExecScanHashTableForUnmatched(node, econtext))
+				{
+					/* no more unmatched tuples */
+					node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
+					continue;
+				}
+
+				/*
+				 * Generate a fake join tuple with nulls for the outer tuple,
+				 * and return it if it passes the non-join quals.
+				 */
+				econtext->ecxt_outertuple = node->hj_NullOuterTupleSlot;
+
+				if (otherqual == NULL || ExecQual(otherqual, econtext))
+					return ExecProject(node->js.ps.ps_ProjInfo);
+				InstrCountFiltered2(node, 1);
+				break;
+
+			case HJ_NEED_NEW_BATCH:
+
+				/*
+				 * Try to advance to next batch.  Done if there are no more.
+				 * for batches after batch 0 for which hashloop_fallback is
+				 * true, if inner is exhausted, need to consider emitting
+				 * unmatched tuples we should never get here when
+				 * hashloop_fallback is false but hj_InnerExhausted is true,
+				 * however, it felt more clear to check for hashloop_fallback
+				 * explicitly
+				 */
+				if (node->hashloop_fallback && HJ_FILL_OUTER(node) && node->hj_InnerExhausted)
+				{
+					/*
+					 * For hashloop fallback, outer tuples are not emitted
+					 * until directly before advancing the batch (after all
+					 * inner chunks have been processed).
+					 * node->hashloop_fallback should be true because it is
+					 * not reset to false until advancing the batches
+					 */
+					node->hj_InnerExhausted = false;
+					node->hj_JoinState = HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT;
+					break;
+				}
+
+				if (!ExecHashJoinAdvanceBatch(node))
+					return NULL;
+
+				/*
+				 * TODO: need to find a better way to distinguish if I should
+				 * load inner batch again than checking for outer batch file
+				 */
+				/* I need to also do this even if it is NULL when it is a ROJ */
+
+				/*
+				 * need to load inner again if it is an inner or left outer
+				 * join and there are outer tuples in the batch OR
+				 */
+
+				/*
+				 * if it is a ROJ and there are inner tuples in the batch --
+				 * should never have no tuples in either batch...
+				 */
+				if (BufFileRewindIfExists(node->hj_HashTable->outerBatchFile[node->hj_HashTable->curbatch]) != NULL ||
+					(node->hj_HashTable->innerBatchFile[node->hj_HashTable->curbatch] != NULL && HJ_FILL_INNER(node)))
+					ExecHashJoinLoadInnerBatch(node);	/* TODO: should I ever
+														 * load inner when outer
+														 * file is not present? */
+
+				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				break;
+
+			case HJ_NEED_NEW_INNER_CHUNK:
+
+				if (!node->hashloop_fallback)
+				{
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					break;
+				}
+
+				/*
+				 * it is the hashloop fallback case and there are no more
+				 * chunks inner is exhausted, so we must advance the batches
+				 */
+				if (node->hj_InnerPageOffset == 0L)
+				{
+					node->hj_InnerExhausted = true;
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					break;
+				}
+
+				/*
+				 * This is the hashloop fallback case and we have more chunks
+				 * in inner. curbatch > 0. Rewind outer batch file (if
+				 * present) so that we can start reading it. Rewind outer
+				 * match statuses file if present so that we can set match
+				 * bits as needed. Reset the tuple count and load the next
+				 * chunk of inner. Then proceed to get a new outer tuple from
+				 * our rewound outer batch file
+				 */
+				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+
+				/*
+				 * TODO: need to find a better way to distinguish if I should
+				 * load inner batch again than checking for outer batch file
+				 */
+				/* I need to also do this even if it is NULL when it is a ROJ */
+
+				/*
+				 * need to load inner again if it is an inner or left outer
+				 * join and there are outer tuples in the batch OR
+				 */
+
+				/*
+				 * if it is a ROJ and there are inner tuples in the batch --
+				 * should never have no tuples in either batch...
+				 */
+
+				/*
+				 * if outer is not null or if it is a ROJ and inner is not
+				 * null, must rewind outer match status and load inner
+				 */
+				if (BufFileRewindIfExists(node->hj_HashTable->outerBatchFile[node->hj_HashTable->curbatch]) != NULL ||
+					(node->hj_HashTable->innerBatchFile[node->hj_HashTable->curbatch] != NULL && HJ_FILL_INNER(node)))
+				{
+					BufFileRewindIfExists(node->hj_OuterMatchStatusesFile);
+					node->hj_OuterTupleCount = 0;
+					ExecHashJoinLoadInnerBatch(node);
+				}
+				break;
+
+			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT:
+
+				node->hj_OuterTupleCount = 0;
+				BufFileRewindIfExists(node->hj_OuterMatchStatusesFile);
+
+				/*
+				 * TODO: is it okay to use the hashtable to get the outer
+				 * batch file here?
+				 */
+				outerFileForAdaptiveRead = hashtable->outerBatchFile[hashtable->curbatch];
+				if (outerFileForAdaptiveRead == NULL)	/* TODO: could this
+														 * happen */
+				{
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					break;
+				}
+				BufFileRewindIfExists(outerFileForAdaptiveRead);
+
+				node->hj_JoinState = HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER;
+				/* fall through */
+
+			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER:
+
+				outerFileForAdaptiveRead = hashtable->outerBatchFile[hashtable->curbatch];
+
+				while (true)
+				{
+					uint32		unmatchedOuterHashvalue;
+					TupleTableSlot *slot = ExecHashJoinGetSavedTuple(node,
+																	 outerFileForAdaptiveRead,
+																	 &unmatchedOuterHashvalue,
+																	 node->hj_OuterTupleSlot);
+
+					node->hj_OuterTupleCount++;
+
+					if (slot == NULL)
+					{
+						node->hj_JoinState = HJ_NEED_NEW_BATCH;
+						break;
+					}
+
+					unsigned char bit = (node->hj_OuterTupleCount - 1) % 8;
+
+					/* need to read the next byte */
+					if (bit == 0)
+						BufFileRead(node->hj_OuterMatchStatusesFile, &node->hj_OuterCurrentByte, 1);
+
+					/* if the match bit is set for this tuple, continue */
+					if ((node->hj_OuterCurrentByte >> bit) & 1)
+						continue;
+
+					/* if it is not a match then emit it NULL-extended */
+					econtext->ecxt_outertuple = slot;
+					econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
+					return ExecProject(node->js.ps.ps_ProjInfo);
+				}
+				/* came here from HJ_NEED_NEW_BATCH, so go back there */
+				node->hj_JoinState = HJ_NEED_NEW_BATCH;
+				break;
+
+			default:
+				elog(ERROR, "unrecognized hashjoin state: %d",
+					 (int) node->hj_JoinState);
+		}
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecParallelHashJoin
+ *
+ *		Parallel-aware version.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *			/* return: a tuple or NULL */
+ExecParallelHashJoin(PlanState *pstate)
+{
+	HashJoinState *node = castNode(HashJoinState, pstate);
+	PlanState  *outerNode;
+	HashState  *hashNode;
+	ExprState  *joinqual;
+	ExprState  *otherqual;
+	ExprContext *econtext;
+	HashJoinTable hashtable;
+	TupleTableSlot *outerTupleSlot;
+	uint32		hashvalue;
+	int			batchno;
+	ParallelHashJoinState *parallel_state;
+
+	/*
+	 * get information from HashJoin node
+	 */
+	joinqual = node->js.joinqual;
+	otherqual = node->js.ps.qual;
+	hashNode = (HashState *) innerPlanState(node);
+	outerNode = outerPlanState(node);
+	hashtable = node->hj_HashTable;
+	econtext = node->js.ps.ps_ExprContext;
+	parallel_state = hashNode->parallel_state;
+
+	bool		advance_from_probing = false;
+
+	/*
+	 * Reset per-tuple memory context to free any expression evaluation
+	 * storage allocated in the previous tuple cycle.
+	 */
+	ResetExprContext(econtext);
+
+	/*
+	 * run the hash join state machine
+	 */
+	for (;;)
+	{
+		SharedTuplestoreAccessor *outer_acc;
+
+		/*
+		 * It's possible to iterate this loop many times before returning a
+		 * tuple, in some pathological cases such as needing to move much of
+		 * the current batch to a later batch.  So let's check for interrupts
+		 * each time through.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		switch (node->hj_JoinState)
+		{
+			case HJ_BUILD_HASHTABLE:
+
+				/*
+				 * First time through: build hash table for inner relation.
+				 */
+				Assert(hashtable == NULL);
+				/* volatile int mybp = 0; while (mybp == 0); */
+
+				/*
+				 * The empty-outer optimization is not implemented for shared
+				 * hash tables, because no one participant can determine that
+				 * there are no outer tuples, and it's not yet clear that it's
+				 * worth the synchronization overhead of reaching consensus to
+				 * figure that out.  So we have to build the hash table.
+				 */
+				node->hj_FirstOuterTupleSlot = NULL;
+
 				/*
 				 * Create the hash table.  If using Parallel Hash, then
 				 * whoever gets here first will create the hash table and any
 				 * later arrivals will merely attach to it.
 				 */
-				hashtable = ExecHashTableCreate(hashNode,
-												node->hj_HashOperators,
-												node->hj_Collations,
-												HJ_FILL_INNER(node));
-				node->hj_HashTable = hashtable;
+				node->hj_HashTable = hashtable = ExecHashTableCreate(hashNode,
+																	 node->hj_HashOperators,
+																	 node->hj_Collations,
+																	 HJ_FILL_INNER(node));
 
 				/*
 				 * Execute the Hash node, to build the hash table.  If using
@@ -311,66 +826,59 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				 */
 				node->hj_OuterNotEmpty = false;
 
-				if (parallel)
-				{
-					Barrier    *build_barrier;
-
-					build_barrier = &parallel_state->build_barrier;
-					Assert(BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER ||
-						   BarrierPhase(build_barrier) == PHJ_BUILD_DONE);
-					if (BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER)
-					{
-						/*
-						 * If multi-batch, we need to hash the outer relation
-						 * up front.
-						 */
-						if (hashtable->nbatch > 1)
-							ExecParallelHashJoinPartitionOuter(node);
-						BarrierArriveAndWait(build_barrier,
-											 WAIT_EVENT_HASH_BUILD_HASHING_OUTER);
-					}
-					Assert(BarrierPhase(build_barrier) == PHJ_BUILD_DONE);
-
-					/* Each backend should now select a batch to work on. */
-					hashtable->curbatch = -1;
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+				Barrier    *build_barrier;
 
-					continue;
+				build_barrier = &parallel_state->build_barrier;
+				Assert(BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER ||
+					   BarrierPhase(build_barrier) == PHJ_BUILD_DONE);
+				if (BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER)
+				{
+					/*
+					 * If multi-batch, we need to hash the outer relation up
+					 * front.
+					 */
+					if (hashtable->nbatch > 1)
+						ExecParallelHashJoinPartitionOuter(node);
+					BarrierArriveAndWait(build_barrier,
+										 WAIT_EVENT_HASH_BUILD_HASHING_OUTER);
 				}
-				else
-					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				Assert(BarrierPhase(build_barrier) == PHJ_BUILD_DONE);
 
-				/* FALL THRU */
+				/* Each backend should now select a batch to work on. */
+				hashtable->curbatch = -1;
+				node->hj_JoinState = HJ_NEED_NEW_BATCH;
+
+				continue;
 
 			case HJ_NEED_NEW_OUTER:
 
 				/*
 				 * We don't have an outer tuple, try to get the next one
 				 */
-				if (parallel)
-					outerTupleSlot =
-						ExecParallelHashJoinOuterGetTuple(outerNode, node,
-														  &hashvalue);
-				else
-					outerTupleSlot =
-						ExecHashJoinOuterGetTuple(outerNode, node, &hashvalue);
+				outerTupleSlot =
+					ExecParallelHashJoinOuterGetTuple(outerNode, node,
+													  &hashvalue);
 
 				if (TupIsNull(outerTupleSlot))
 				{
-					/* end of batch, or maybe whole join */
+					/*
+					 * end of batch, or maybe whole join. for hashloop
+					 * fallback, all we know is outer batch is exhausted.
+					 * inner could have more chunks
+					 */
 					if (HJ_FILL_INNER(node))
 					{
 						/* set up to scan for unmatched inner tuples */
 						ExecPrepHashTableForUnmatched(node);
 						node->hj_JoinState = HJ_FILL_INNER_TUPLES;
+						break;
 					}
-					else
-						node->hj_JoinState = HJ_NEED_NEW_BATCH;
-					continue;
+					advance_from_probing = true;
+					node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
+					break;
 				}
 
 				econtext->ecxt_outertuple = outerTupleSlot;
-				node->hj_MatchedOuter = false;
 
 				/*
 				 * Find the corresponding bucket for this tuple in the main
@@ -384,33 +892,18 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				node->hj_CurTuple = NULL;
 
 				/*
-				 * The tuple might not belong to the current batch (where
-				 * "current batch" includes the skew buckets if any).
+				 * for the hashloop fallback case, only initialize
+				 * hj_MatchedOuter to false during the first chunk. otherwise,
+				 * we will be resetting hj_MatchedOuter to false for an outer
+				 * tuple that has already matched an inner tuple. also,
+				 * hj_MatchedOuter should be set to false for batch 0. there
+				 * are no chunks for batch 0
 				 */
-				if (batchno != hashtable->curbatch &&
-					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO)
-				{
-					bool		shouldFree;
-					MinimalTuple mintuple = ExecFetchSlotMinimalTuple(outerTupleSlot,
-																	  &shouldFree);
-
-					/*
-					 * Need to postpone this outer tuple to a later batch.
-					 * Save it in the corresponding outer-batch file.
-					 */
-					Assert(parallel_state == NULL);
-					Assert(batchno > hashtable->curbatch);
-					ExecHashJoinSaveTuple(mintuple, hashvalue,
-										  &hashtable->outerBatchFile[batchno]);
 
-					if (shouldFree)
-						heap_free_minimal_tuple(mintuple);
-
-					/* Loop around, staying in HJ_NEED_NEW_OUTER state */
-					continue;
-				}
+				ParallelHashJoinBatch *phj_batch = node->hj_HashTable->batches[node->hj_HashTable->curbatch].shared;
 
-				/* OK, let's scan the bucket for matches */
+				if (!phj_batch->parallel_hashloop_fallback || phj_batch->current_chunk_num == 1)
+					node->hj_MatchedOuter = false;
 				node->hj_JoinState = HJ_SCAN_BUCKET;
 
 				/* FALL THRU */
@@ -420,23 +913,25 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				/*
 				 * Scan the selected hash bucket for matches to current outer
 				 */
-				if (parallel)
-				{
-					if (!ExecParallelScanHashBucket(node, econtext))
-					{
-						/* out of matches; check for possible outer-join fill */
-						node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
-						continue;
-					}
-				}
-				else
+				phj_batch = node->hj_HashTable->batches[node->hj_HashTable->curbatch].shared;
+
+				if (!ExecParallelScanHashBucket(node, econtext))
 				{
-					if (!ExecScanHashBucket(node, econtext))
+					/*
+					 * The current outer tuple has run out of matches, so
+					 * check whether to emit a dummy outer-join tuple.
+					 * Whether we emit one or not, the next state is
+					 * NEED_NEW_OUTER.
+					 */
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+					if (!phj_batch->parallel_hashloop_fallback)
 					{
-						/* out of matches; check for possible outer-join fill */
-						node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
-						continue;
+						TupleTableSlot *slot = emitUnmatchedOuterTuple(otherqual, econtext, node);
+
+						if (slot != NULL)
+							return slot;
 					}
+					continue;
 				}
 
 				/*
@@ -451,58 +946,48 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				 * Only the joinquals determine tuple match status, but all
 				 * quals must pass to actually return the tuple.
 				 */
-				if (joinqual == NULL || ExecQual(joinqual, econtext))
+				if (joinqual != NULL && !ExecQual(joinqual, econtext))
 				{
-					node->hj_MatchedOuter = true;
-					HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
-
-					/* In an antijoin, we never return a matched tuple */
-					if (node->js.jointype == JOIN_ANTI)
-					{
-						node->hj_JoinState = HJ_NEED_NEW_OUTER;
-						continue;
-					}
+					InstrCountFiltered1(node, 1);
+					break;
+				}
 
-					/*
-					 * If we only need to join to the first matching inner
-					 * tuple, then consider returning this one, but after that
-					 * continue with next outer tuple.
-					 */
-					if (node->js.single_match)
-						node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				node->hj_MatchedOuter = true;
+				HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
 
-					if (otherqual == NULL || ExecQual(otherqual, econtext))
-						return ExecProject(node->js.ps.ps_ProjInfo);
-					else
-						InstrCountFiltered2(node, 1);
+				/*
+				 * TODO: how does this interact with PAHJ -- do I need to set
+				 * matchbit?
+				 */
+				/* In an antijoin, we never return a matched tuple */
+				if (node->js.jointype == JOIN_ANTI)
+				{
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+					continue;
 				}
-				else
-					InstrCountFiltered1(node, 1);
-				break;
-
-			case HJ_FILL_OUTER_TUPLE:
 
 				/*
-				 * The current outer tuple has run out of matches, so check
-				 * whether to emit a dummy outer-join tuple.  Whether we emit
-				 * one or not, the next state is NEED_NEW_OUTER.
+				 * If we only need to join to the first matching inner tuple,
+				 * then consider returning this one, but after that continue
+				 * with next outer tuple.
 				 */
-				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				if (node->js.single_match)
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
-				if (!node->hj_MatchedOuter &&
-					HJ_FILL_OUTER(node))
+				/*
+				 * Set the match bit for this outer tuple in the match status
+				 * file
+				 */
+				if (phj_batch->parallel_hashloop_fallback)
 				{
-					/*
-					 * Generate a fake join tuple with nulls for the inner
-					 * tuple, and return it if it passes the non-join quals.
-					 */
-					econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
+					sts_set_outer_match_status(hashtable->batches[hashtable->curbatch].outer_tuples,
+											   econtext->ecxt_outertuple->tuplenum);
 
-					if (otherqual == NULL || ExecQual(otherqual, econtext))
-						return ExecProject(node->js.ps.ps_ProjInfo);
-					else
-						InstrCountFiltered2(node, 1);
 				}
+				if (otherqual == NULL || ExecQual(otherqual, econtext))
+					return ExecProject(node->js.ps.ps_ProjInfo);
+				else
+					InstrCountFiltered2(node, 1);
 				break;
 
 			case HJ_FILL_INNER_TUPLES:
@@ -515,7 +1000,8 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				if (!ExecScanHashTableForUnmatched(node, econtext))
 				{
 					/* no more unmatched tuples */
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					advance_from_probing = true;
+					node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
 					continue;
 				}
 
@@ -533,22 +1019,108 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 
 			case HJ_NEED_NEW_BATCH:
 
+				phj_batch = hashtable->batches[hashtable->curbatch].shared;
+
 				/*
 				 * Try to advance to next batch.  Done if there are no more.
 				 */
-				if (parallel)
+				if (!ExecParallelHashJoinNewBatch(node))
+					return NULL;	/* end of parallel-aware join */
+
+				if (node->last_worker
+					&& HJ_FILL_OUTER(node) && phj_batch->parallel_hashloop_fallback)
 				{
-					if (!ExecParallelHashJoinNewBatch(node))
-						return NULL;	/* end of parallel-aware join */
+					node->last_worker = false;
+					node->hj_JoinState = HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT;
+					break;
 				}
-				else
+				if (node->hj_HashTable->curbatch == 0)
 				{
-					if (!ExecHashJoinNewBatch(node))
-						return NULL;	/* end of parallel-oblivious join */
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+					break;
 				}
-				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				advance_from_probing = false;
+				node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
+				/* FALL THRU */
+
+			case HJ_NEED_NEW_INNER_CHUNK:
+
+				if (hashtable->curbatch == -1 || hashtable->curbatch == 0)
+
+					/*
+					 * If we're not attached to a batch at all then we need to
+					 * go to HJ_NEED_NEW_BATCH. Also batch 0 doesn't have more
+					 * than 1 chunk.
+					 */
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+				else if (!ExecParallelHashJoinNewChunk(node, advance_from_probing))
+					/* If there's no next chunk then go to the next batch */
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+				else
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
 				break;
 
+			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT:
+
+				outer_acc = hashtable->batches[hashtable->curbatch].outer_tuples;
+				sts_reinitialize(outer_acc);
+				sts_begin_parallel_scan(outer_acc);
+
+				node->hj_JoinState = HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER;
+				/* FALL THRU */
+
+			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER:
+
+				Assert(node->combined_bitmap != NULL);
+
+				outer_acc = node->hj_HashTable->batches[node->hj_HashTable->curbatch].outer_tuples;
+
+				MinimalTuple tuple;
+
+				do
+				{
+					tupleMetadata metadata;
+
+					if ((tuple = sts_parallel_scan_next(outer_acc, &metadata)) == NULL)
+						break;
+
+					int			bytenum = metadata.tupleid / 8;
+					unsigned char bit = metadata.tupleid % 8;
+					unsigned char byte_to_check = 0;
+
+					/* seek to byte to check */
+					if (BufFileSeek(node->combined_bitmap, 0, bytenum, SEEK_SET))
+						ereport(ERROR,
+								(errcode_for_file_access(),
+								 errmsg("could not rewind shared outer temporary file: %m")));
+					/* read byte containing ntuple bit */
+					if (BufFileRead(node->combined_bitmap, &byte_to_check, 1) == 0)
+						ereport(ERROR,
+								(errcode_for_file_access(),
+								 errmsg("could not read byte in outer match status bitmap: %m.")));
+					/* if bit is set */
+					bool		match = ((byte_to_check) >> bit) & 1;
+
+					if (!match)
+						break;
+				} while (1);
+
+				if (tuple == NULL)
+				{
+					sts_end_parallel_scan(outer_acc);
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					break;
+				}
+
+				/* Emit the unmatched tuple */
+				ExecForceStoreMinimalTuple(tuple,
+										   econtext->ecxt_outertuple,
+										   false);
+				econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
+
+				return ExecProject(node->js.ps.ps_ProjInfo);
+
+
 			default:
 				elog(ERROR, "unrecognized hashjoin state: %d",
 					 (int) node->hj_JoinState);
@@ -556,38 +1128,6 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 	}
 }
 
-/* ----------------------------------------------------------------
- *		ExecHashJoin
- *
- *		Parallel-oblivious version.
- * ----------------------------------------------------------------
- */
-static TupleTableSlot *			/* return: a tuple or NULL */
-ExecHashJoin(PlanState *pstate)
-{
-	/*
-	 * On sufficiently smart compilers this should be inlined with the
-	 * parallel-aware branches removed.
-	 */
-	return ExecHashJoinImpl(pstate, false);
-}
-
-/* ----------------------------------------------------------------
- *		ExecParallelHashJoin
- *
- *		Parallel-aware version.
- * ----------------------------------------------------------------
- */
-static TupleTableSlot *			/* return: a tuple or NULL */
-ExecParallelHashJoin(PlanState *pstate)
-{
-	/*
-	 * On sufficiently smart compilers this should be inlined with the
-	 * parallel-oblivious branches removed.
-	 */
-	return ExecHashJoinImpl(pstate, true);
-}
-
 /* ----------------------------------------------------------------
  *		ExecInitHashJoin
  *
@@ -622,6 +1162,18 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate->js.ps.ExecProcNode = ExecHashJoin;
 	hjstate->js.jointype = node->join.jointype;
 
+	hjstate->hashloop_fallback = false;
+	hjstate->hj_InnerPageOffset = 0L;
+	hjstate->hj_InnerFirstChunk = false;
+	hjstate->hj_OuterCurrentByte = 0;
+
+	hjstate->hj_OuterMatchStatusesFile = NULL;
+	hjstate->hj_OuterTupleCount = 0;
+	hjstate->hj_InnerExhausted = false;
+
+	hjstate->last_worker = false;
+	hjstate->combined_bitmap = NULL;
+
 	/*
 	 * Miscellaneous initialization
 	 *
@@ -773,6 +1325,30 @@ ExecEndHashJoin(HashJoinState *node)
 	ExecEndNode(innerPlanState(node));
 }
 
+
+static TupleTableSlot *
+emitUnmatchedOuterTuple(ExprState *otherqual, ExprContext *econtext, HashJoinState *hjstate)
+{
+	if (hjstate->hj_MatchedOuter)
+		return NULL;
+
+	if (!HJ_FILL_OUTER(hjstate))
+		return NULL;
+
+	econtext->ecxt_innertuple = hjstate->hj_NullInnerTupleSlot;
+
+	/*
+	 * Generate a fake join tuple with nulls for the inner tuple, and return
+	 * it if it passes the non-join quals.
+	 */
+
+	if (otherqual == NULL || ExecQual(otherqual, econtext))
+		return ExecProject(hjstate->js.ps.ps_ProjInfo);
+
+	InstrCountFiltered2(hjstate, 1);
+	return NULL;
+}
+
 /*
  * ExecHashJoinOuterGetTuple
  *
@@ -900,13 +1476,20 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 	{
 		MinimalTuple tuple;
 
+		tupleMetadata metadata;
+		int			tupleid;
+
 		tuple = sts_parallel_scan_next(hashtable->batches[curbatch].outer_tuples,
-									   hashvalue);
+									   &metadata);
 		if (tuple != NULL)
 		{
+			/* where is this hashvalue being used? */
+			*hashvalue = metadata.hashvalue;
+			tupleid = metadata.tupleid;
 			ExecForceStoreMinimalTuple(tuple,
 									   hjstate->hj_OuterTupleSlot,
 									   false);
+			hjstate->hj_OuterTupleSlot->tuplenum = tupleid;
 			slot = hjstate->hj_OuterTupleSlot;
 			return slot;
 		}
@@ -919,20 +1502,17 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 }
 
 /*
- * ExecHashJoinNewBatch
+ * ExecHashJoinAdvanceBatch
  *		switch to a new hashjoin batch
  *
  * Returns true if successful, false if there are no more batches.
  */
 static bool
-ExecHashJoinNewBatch(HashJoinState *hjstate)
+ExecHashJoinAdvanceBatch(HashJoinState *hjstate)
 {
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	int			nbatch;
 	int			curbatch;
-	BufFile    *innerFile;
-	TupleTableSlot *slot;
-	uint32		hashvalue;
 
 	nbatch = hashtable->nbatch;
 	curbatch = hashtable->curbatch;
@@ -1007,10 +1587,36 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 		curbatch++;
 	}
 
+	hjstate->hj_InnerPageOffset = 0L;
+	hjstate->hj_InnerFirstChunk = true;
+	hjstate->hashloop_fallback = false; /* new batch, so start it off false */
+	if (hjstate->hj_OuterMatchStatusesFile != NULL)
+		BufFileClose(hjstate->hj_OuterMatchStatusesFile);
+	hjstate->hj_OuterMatchStatusesFile = NULL;
 	if (curbatch >= nbatch)
 		return false;			/* no more batches */
 
 	hashtable->curbatch = curbatch;
+	return true;
+}
+
+/*
+ * Returns true if there are more chunks left, false otherwise
+ */
+static bool
+ExecHashJoinLoadInnerBatch(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			curbatch = hashtable->curbatch;
+	BufFile    *innerFile;
+	TupleTableSlot *slot;
+	uint32		hashvalue;
+
+	off_t		tup_start_offset;
+	off_t		chunk_start_offset;
+	off_t		tup_end_offset;
+	int64		current_saved_size;
+	int			current_fileno;
 
 	/*
 	 * Reload the hash table with the new inner batch (which could be empty)
@@ -1019,171 +1625,60 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 
 	innerFile = hashtable->innerBatchFile[curbatch];
 
+	/* Reset this even if the innerfile is not null */
+	hjstate->hj_InnerFirstChunk = hjstate->hj_InnerPageOffset == 0L;
+
 	if (innerFile != NULL)
 	{
-		if (BufFileSeek(innerFile, 0, 0L, SEEK_SET))
+		/* TODO: should fileno always be 0? */
+		if (BufFileSeek(innerFile, 0, hjstate->hj_InnerPageOffset, SEEK_SET))
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not rewind hash-join temporary file: %m")));
 
+		chunk_start_offset = hjstate->hj_InnerPageOffset;
+		tup_end_offset = hjstate->hj_InnerPageOffset;
 		while ((slot = ExecHashJoinGetSavedTuple(hjstate,
 												 innerFile,
 												 &hashvalue,
 												 hjstate->hj_HashTupleSlot)))
 		{
+			/* next tuple's start is last tuple's end */
+			tup_start_offset = tup_end_offset;
+			/* after we got the tuple, figure out what the offset is */
+			BufFileTell(innerFile, &current_fileno, &tup_end_offset);
+			current_saved_size = tup_end_offset - chunk_start_offset;
+			if (current_saved_size > work_mem)
+			{
+				hjstate->hj_InnerPageOffset = tup_start_offset;
+				hjstate->hashloop_fallback = true;
+				return true;
+			}
+			hjstate->hj_InnerPageOffset = tup_end_offset;
+
 			/*
-			 * NOTE: some tuples may be sent to future batches.  Also, it is
-			 * possible for hashtable->nbatch to be increased here!
+			 * NOTE: some tuples may be sent to future batches. With current
+			 * hashloop patch, however, it is not possible for
+			 * hashtable->nbatch to be increased here
 			 */
 			ExecHashTableInsert(hashtable, slot, hashvalue);
 		}
 
+		/* this is the end of the file */
+		hjstate->hj_InnerPageOffset = 0L;
+
 		/*
-		 * after we build the hash table, the inner batch file is no longer
+		 * after we processed all chunks, the inner batch file is no longer
 		 * needed
 		 */
 		BufFileClose(innerFile);
 		hashtable->innerBatchFile[curbatch] = NULL;
 	}
 
-	/*
-	 * Rewind outer batch file (if present), so that we can start reading it.
-	 */
-	if (hashtable->outerBatchFile[curbatch] != NULL)
-	{
-		if (BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not rewind hash-join temporary file: %m")));
-	}
-
-	return true;
-}
-
-/*
- * Choose a batch to work on, and attach to it.  Returns true if successful,
- * false if there are no more batches.
- */
-static bool
-ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
-{
-	HashJoinTable hashtable = hjstate->hj_HashTable;
-	int			start_batchno;
-	int			batchno;
-
-	/*
-	 * If we started up so late that the batch tracking array has been freed
-	 * already by ExecHashTableDetach(), then we are finished.  See also
-	 * ExecParallelHashEnsureBatchAccessors().
-	 */
-	if (hashtable->batches == NULL)
-		return false;
-
-	/*
-	 * If we were already attached to a batch, remember not to bother checking
-	 * it again, and detach from it (possibly freeing the hash table if we are
-	 * last to detach).
-	 */
-	if (hashtable->curbatch >= 0)
-	{
-		hashtable->batches[hashtable->curbatch].done = true;
-		ExecHashTableDetachBatch(hashtable);
-	}
-
-	/*
-	 * Search for a batch that isn't done.  We use an atomic counter to start
-	 * our search at a different batch in every participant when there are
-	 * more batches than participants.
-	 */
-	batchno = start_batchno =
-		pg_atomic_fetch_add_u32(&hashtable->parallel_state->distributor, 1) %
-		hashtable->nbatch;
-	do
-	{
-		uint32		hashvalue;
-		MinimalTuple tuple;
-		TupleTableSlot *slot;
-
-		if (!hashtable->batches[batchno].done)
-		{
-			SharedTuplestoreAccessor *inner_tuples;
-			Barrier    *batch_barrier =
-			&hashtable->batches[batchno].shared->batch_barrier;
-
-			switch (BarrierAttach(batch_barrier))
-			{
-				case PHJ_BATCH_ELECTING:
-
-					/* One backend allocates the hash table. */
-					if (BarrierArriveAndWait(batch_barrier,
-											 WAIT_EVENT_HASH_BATCH_ELECTING))
-						ExecParallelHashTableAlloc(hashtable, batchno);
-					/* Fall through. */
-
-				case PHJ_BATCH_ALLOCATING:
-					/* Wait for allocation to complete. */
-					BarrierArriveAndWait(batch_barrier,
-										 WAIT_EVENT_HASH_BATCH_ALLOCATING);
-					/* Fall through. */
-
-				case PHJ_BATCH_LOADING:
-					/* Start (or join in) loading tuples. */
-					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					inner_tuples = hashtable->batches[batchno].inner_tuples;
-					sts_begin_parallel_scan(inner_tuples);
-					while ((tuple = sts_parallel_scan_next(inner_tuples,
-														   &hashvalue)))
-					{
-						ExecForceStoreMinimalTuple(tuple,
-												   hjstate->hj_HashTupleSlot,
-												   false);
-						slot = hjstate->hj_HashTupleSlot;
-						ExecParallelHashTableInsertCurrentBatch(hashtable, slot,
-																hashvalue);
-					}
-					sts_end_parallel_scan(inner_tuples);
-					BarrierArriveAndWait(batch_barrier,
-										 WAIT_EVENT_HASH_BATCH_LOADING);
-					/* Fall through. */
-
-				case PHJ_BATCH_PROBING:
-
-					/*
-					 * This batch is ready to probe.  Return control to
-					 * caller. We stay attached to batch_barrier so that the
-					 * hash table stays alive until everyone's finished
-					 * probing it, but no participant is allowed to wait at
-					 * this barrier again (or else a deadlock could occur).
-					 * All attached participants must eventually call
-					 * BarrierArriveAndDetach() so that the final phase
-					 * PHJ_BATCH_DONE can be reached.
-					 */
-					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					sts_begin_parallel_scan(hashtable->batches[batchno].outer_tuples);
-					return true;
-
-				case PHJ_BATCH_DONE:
-
-					/*
-					 * Already done.  Detach and go around again (if any
-					 * remain).
-					 */
-					BarrierDetach(batch_barrier);
-					hashtable->batches[batchno].done = true;
-					hashtable->curbatch = -1;
-					break;
-
-				default:
-					elog(ERROR, "unexpected batch phase %d",
-						 BarrierPhase(batch_barrier));
-			}
-		}
-		batchno = (batchno + 1) % hashtable->nbatch;
-	} while (batchno != start_batchno);
-
 	return false;
 }
 
+
 /*
  * ExecHashJoinSaveTuple
  *		save a tuple to a batch file.
@@ -1377,6 +1872,8 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 	/* Execute outer plan, writing all tuples to shared tuplestores. */
 	for (;;)
 	{
+		tupleMetadata metadata;
+
 		slot = ExecProcNode(outerState);
 		if (TupIsNull(slot))
 			break;
@@ -1394,8 +1891,11 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 
 			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
 									  &batchno);
-			sts_puttuple(hashtable->batches[batchno].outer_tuples,
-						 &hashvalue, mintup);
+			metadata.hashvalue = hashvalue;
+			SharedTuplestoreAccessor *accessor = hashtable->batches[batchno].outer_tuples;
+
+			metadata.tupleid = sts_increment_tuplenum(accessor);
+			sts_puttuple(accessor, &metadata, mintup);
 
 			if (shouldFree)
 				heap_free_minimal_tuple(mintup);
@@ -1444,6 +1944,7 @@ ExecHashJoinInitializeDSM(HashJoinState *state, ParallelContext *pcxt)
 	 * and space_allowed.
 	 */
 	pstate->nbatch = 0;
+	pstate->batch_increases = 0;
 	pstate->space_allowed = 0;
 	pstate->batches = InvalidDsaPointer;
 	pstate->old_batches = InvalidDsaPointer;
@@ -1483,7 +1984,7 @@ ExecHashJoinReInitializeDSM(HashJoinState *state, ParallelContext *cxt)
 	/*
 	 * It would be possible to reuse the shared hash table in single-batch
 	 * cases by resetting and then fast-forwarding build_barrier to
-	 * PHJ_BUILD_DONE and batch 0's batch_barrier to PHJ_BATCH_PROBING, but
+	 * PHJ_BUILD_DONE and batch 0's batch_barrier to PHJ_BATCH_CHUNKING, but
 	 * currently shared hash tables are already freed by now (by the last
 	 * participant to detach from the batch).  We could consider keeping it
 	 * around for single-batch joins.  We'd also need to adjust
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 51c486bebdb9..e582365e8409 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3767,6 +3767,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_BATCH_LOADING:
 			event_name = "Hash/Batch/Loading";
 			break;
+		case WAIT_EVENT_HASH_BATCH_PROBING:
+			event_name = "Hash/Batch/Probing";
+			break;
 		case WAIT_EVENT_HASH_BUILD_ALLOCATING:
 			event_name = "Hash/Build/Allocating";
 			break;
@@ -3779,6 +3782,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
 			event_name = "Hash/Build/HashingOuter";
 			break;
+		case WAIT_EVENT_HASH_BUILD_CREATE_OUTER_MATCH_STATUS_BITMAP_FILES:
+			event_name = "Hash/Build/CreateOuterMatchStatusBitmapFiles";
+			break;
 		case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
 			event_name = "Hash/GrowBatches/Allocating";
 			break;
@@ -3803,6 +3809,21 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
 			event_name = "Hash/GrowBuckets/Reinserting";
 			break;
+		case WAIT_EVENT_HASH_CHUNK_ELECTING:
+			event_name = "Hash/Chunk/Electing";
+			break;
+		case WAIT_EVENT_HASH_CHUNK_LOADING:
+			event_name = "Hash/Chunk/Loading";
+			break;
+		case WAIT_EVENT_HASH_CHUNK_PROBING:
+			event_name = "Hash/Chunk/Probing";
+			break;
+		case WAIT_EVENT_HASH_CHUNK_DONE:
+			event_name = "Hash/Chunk/Done";
+			break;
+		case WAIT_EVENT_HASH_ADVANCE_CHUNK:
+			event_name = "Hash/Chunk/Final";
+			break;
 		case WAIT_EVENT_LOGICAL_SYNC_DATA:
 			event_name = "LogicalSyncData";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 35e8f12e62da..cb49329d3fb1 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -269,6 +269,57 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
 	return file;
 }
 
+/*
+ * Open a shared file created by any backend if it exists, otherwise return NULL
+ */
+BufFile *
+BufFileOpenSharedIfExists(SharedFileSet *fileset, const char *name)
+{
+	BufFile    *file;
+	char		segment_name[MAXPGPATH];
+	Size		capacity = 16;
+	File	   *files;
+	int			nfiles = 0;
+
+	files = palloc(sizeof(File) * capacity);
+
+	/*
+	 * We don't know how many segments there are, so we'll probe the
+	 * filesystem to find out.
+	 */
+	for (;;)
+	{
+		/* See if we need to expand our file segment array. */
+		if (nfiles + 1 > capacity)
+		{
+			capacity *= 2;
+			files = repalloc(files, sizeof(File) * capacity);
+		}
+		/* Try to load a segment. */
+		SharedSegmentName(segment_name, name, nfiles);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		if (files[nfiles] <= 0)
+			break;
+		++nfiles;
+
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	/*
+	 * If we didn't find any files at all, then no BufFile exists with this
+	 * name.
+	 */
+	if (nfiles == 0)
+		return NULL;
+	file = makeBufFileCommon(nfiles);
+	file->files = files;
+	file->readOnly = true;		/* Can't write to files opened this way */
+	file->fileset = fileset;
+	file->name = pstrdup(name);
+
+	return file;
+}
+
 /*
  * Open a file that was previously created in another backend (or this one)
  * with BufFileCreateShared in the same SharedFileSet using the same name.
@@ -843,3 +894,17 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+BufFile *
+BufFileRewindIfExists(BufFile *bufFile)
+{
+	if (bufFile != NULL)
+	{
+		if (BufFileSeek(bufFile, 0, 0L, SEEK_SET))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not rewind hash-join temporary file: %m")));
+		return bufFile;
+	}
+	return NULL;
+}
diff --git a/src/backend/storage/ipc/barrier.c b/src/backend/storage/ipc/barrier.c
index 3e200e02cc22..58455dda1cb8 100644
--- a/src/backend/storage/ipc/barrier.c
+++ b/src/backend/storage/ipc/barrier.c
@@ -195,6 +195,91 @@ BarrierArriveAndWait(Barrier *barrier, uint32 wait_event_info)
 	return elected;
 }
 
+/*
+ * Arrive at this barrier, wait for all other attached participants to arrive
+ * too and then return.  Sets the current phase to next_phase.  The caller must
+ * be attached.
+ *
+ * While waiting, pg_stat_activity shows a wait_event_type and wait_event
+ * controlled by the wait_event_info passed in, which should be a value from
+ * one of the WaitEventXXX enums defined in pgstat.h.
+ *
+ * Return true in one arbitrarily chosen participant.  Return false in all
+ * others.  The return code can be used to elect one participant to execute a
+ * phase of work that must be done serially while other participants wait.
+ */
+bool
+BarrierArriveExplicitAndWait(Barrier *barrier, int next_phase, uint32 wait_event_info)
+{
+	bool		release = false;
+	bool		elected;
+	int			start_phase;
+
+	SpinLockAcquire(&barrier->mutex);
+	start_phase = barrier->phase;
+	++barrier->arrived;
+	if (barrier->arrived == barrier->participants)
+	{
+		release = true;
+		barrier->arrived = 0;
+		barrier->phase = next_phase;
+		barrier->elected = next_phase;
+	}
+	SpinLockRelease(&barrier->mutex);
+
+	/*
+	 * If we were the last expected participant to arrive, we can release our
+	 * peers and return true to indicate that this backend has been elected to
+	 * perform any serial work.
+	 */
+	if (release)
+	{
+		ConditionVariableBroadcast(&barrier->condition_variable);
+
+		return true;
+	}
+
+	/*
+	 * Otherwise we have to wait for the last participant to arrive and
+	 * advance the phase.
+	 */
+	elected = false;
+	ConditionVariablePrepareToSleep(&barrier->condition_variable);
+	for (;;)
+	{
+		/*
+		 * We know that phase must either be start_phase, indicating that we
+		 * need to keep waiting, or next_phase, indicating that the last
+		 * participant that we were waiting for has either arrived or detached
+		 * so that the next phase has begun.  The phase cannot advance any
+		 * further than that without this backend's participation, because
+		 * this backend is attached.
+		 */
+		SpinLockAcquire(&barrier->mutex);
+		Assert(barrier->phase == start_phase || barrier->phase == next_phase);
+		release = barrier->phase == next_phase;
+		if (release && barrier->elected != next_phase)
+		{
+			/*
+			 * Usually the backend that arrives last and releases the other
+			 * backends is elected to return true (see above), so that it can
+			 * begin processing serial work while it has a CPU timeslice.
+			 * However, if the barrier advanced because someone detached, then
+			 * one of the backends that is awoken will need to be elected.
+			 */
+			barrier->elected = barrier->phase;
+			elected = true;
+		}
+		SpinLockRelease(&barrier->mutex);
+		if (release)
+			break;
+		ConditionVariableSleep(&barrier->condition_variable, wait_event_info);
+	}
+	ConditionVariableCancelSleep();
+
+	return elected;
+}
+
 /*
  * Arrive at this barrier, but detach rather than waiting.  Returns true if
  * the caller was the last to detach.
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index c3ab494a45ea..3cd2ec2e2eb6 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -60,6 +60,8 @@ typedef struct SharedTuplestoreParticipant
 struct SharedTuplestore
 {
 	int			nparticipants;	/* Number of participants that can write. */
+	pg_atomic_uint32 ntuples;
+			  //TODO:does this belong elsewhere
 	int			flags;			/* Flag bits from SHARED_TUPLESTORE_XXX */
 	size_t		meta_data_size; /* Size of per-tuple header. */
 	char		name[NAMEDATALEN];	/* A name for this tuplestore. */
@@ -92,10 +94,15 @@ struct SharedTuplestoreAccessor
 	BlockNumber write_page;		/* The next page to write to. */
 	char	   *write_pointer;	/* Current write pointer within chunk. */
 	char	   *write_end;		/* One past the end of the current chunk. */
+
+	/* Bitmap of matched outer tuples (currently only used for hashjoin). */
+	BufFile    *outer_match_status_file;
 };
 
 static void sts_filename(char *name, SharedTuplestoreAccessor *accessor,
 						 int participant);
+static void
+			sts_bitmap_filename(char *name, SharedTuplestoreAccessor *accessor, int participant);
 
 /*
  * Return the amount of shared memory required to hold SharedTuplestore for a
@@ -137,6 +144,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 	Assert(my_participant_number < participants);
 
 	sts->nparticipants = participants;
+	pg_atomic_init_u32(&sts->ntuples, 1);
 	sts->meta_data_size = meta_data_size;
 	sts->flags = flags;
 
@@ -166,6 +174,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 	accessor->sts = sts;
 	accessor->fileset = fileset;
 	accessor->context = CurrentMemoryContext;
+	accessor->outer_match_status_file = NULL;
 
 	return accessor;
 }
@@ -343,6 +352,7 @@ sts_puttuple(SharedTuplestoreAccessor *accessor, void *meta_data,
 			sts_flush_chunk(accessor);
 		}
 
+		/* TODO: exercise this code with a test (over-sized tuple) */
 		/* It may still not be enough in the case of a gigantic tuple. */
 		if (accessor->write_pointer + size >= accessor->write_end)
 		{
@@ -621,6 +631,129 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	return NULL;
 }
 
+/*  TODO: fix signedness */
+int
+sts_increment_tuplenum(SharedTuplestoreAccessor *accessor)
+{
+	return pg_atomic_fetch_add_u32(&accessor->sts->ntuples, 1);
+}
+
+void
+sts_make_outer_match_status_file(SharedTuplestoreAccessor *accessor)
+{
+	uint32		tuplenum = pg_atomic_read_u32(&accessor->sts->ntuples);
+
+	/* don't make the outer match status file if there are no tuples */
+	if (tuplenum == 0)
+		return;
+
+	char		name[MAXPGPATH];
+
+	sts_bitmap_filename(name, accessor, accessor->participant);
+
+	accessor->outer_match_status_file = BufFileCreateShared(accessor->fileset, name);
+
+	/* TODO: check this math. tuplenumber will be too high. */
+	uint32		num_to_write = tuplenum / 8 + 1;
+
+	unsigned char byteToWrite = 0;
+
+	BufFileWrite(accessor->outer_match_status_file, &byteToWrite, num_to_write);
+
+	if (BufFileSeek(accessor->outer_match_status_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+}
+
+void
+sts_set_outer_match_status(SharedTuplestoreAccessor *accessor, uint32 tuplenum)
+{
+	BufFile    *parallel_outer_matchstatuses = accessor->outer_match_status_file;
+	unsigned char current_outer_byte;
+
+	BufFileSeek(parallel_outer_matchstatuses, 0, tuplenum / 8, SEEK_SET);
+	BufFileRead(parallel_outer_matchstatuses, &current_outer_byte, 1);
+
+	current_outer_byte |= 1U << (tuplenum % 8);
+
+	if (BufFileSeek(parallel_outer_matchstatuses, 0, -1, SEEK_CUR) != 0)
+		elog(ERROR, "there is a problem with outer match status file. pid %i.", MyProcPid);
+	BufFileWrite(parallel_outer_matchstatuses, &current_outer_byte, 1);
+}
+
+void
+sts_close_outer_match_status_file(SharedTuplestoreAccessor *accessor)
+{
+	BufFileClose(accessor->outer_match_status_file);
+}
+
+BufFile *
+sts_combine_outer_match_status_files(SharedTuplestoreAccessor *accessor)
+{
+	/* TODO: this tries to close an outer match status file for */
+	/* each participant in the tuplestore. technically, only participants */
+	/* in the barrier could have outer match status files, however, */
+	/* all but one participant continue on and detach from the barrier */
+	/* so we won't have a reliable way to close only files for those attached */
+	/* to the barrier */
+	BufFile   **statuses = palloc(sizeof(BufFile *) * accessor->sts->nparticipants);
+
+	/*
+	 * Open the bitmap shared BufFile from each participant. TODO: explain why
+	 * file can be NULLs
+	 */
+	int			statuses_length = 0;
+
+	for (int i = 0; i < accessor->sts->nparticipants; i++)
+	{
+		char		bitmap_filename[MAXPGPATH];
+
+		sts_bitmap_filename(bitmap_filename, accessor, i);
+		BufFile    *file = BufFileOpenSharedIfExists(accessor->fileset, bitmap_filename);
+
+		if (file != NULL)
+			statuses[statuses_length++] = file;
+	}
+
+	BufFile    *combined_bitmap_file = BufFileCreateTemp(false);
+
+	for (int64 cur = 0; cur < BufFileSize(statuses[0]); cur++)
+		//make it while not
+			EOF
+		{
+			unsigned char combined_byte = 0;
+
+			for (int i = 0; i < statuses_length; i++)
+			{
+				unsigned char read_byte;
+
+				BufFileRead(statuses[i], &read_byte, 1);
+				combined_byte |= read_byte;
+			}
+
+			BufFileWrite(combined_bitmap_file, &combined_byte, 1);
+		}
+
+	if (BufFileSeek(combined_bitmap_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	for (int i = 0; i < statuses_length; i++)
+		BufFileClose(statuses[i]);
+	pfree(statuses);
+
+	return combined_bitmap_file;
+}
+
+
+static void
+sts_bitmap_filename(char *name, SharedTuplestoreAccessor *accessor, int participant)
+{
+	snprintf(name, MAXPGPATH, "%s.p%d.bitmap", accessor->sts->name, participant);
+}
+
 /*
  * Create the name used for the BufFile that a given participant will write.
  */
diff --git a/src/include/executor/adaptiveHashjoin.h b/src/include/executor/adaptiveHashjoin.h
new file mode 100644
index 000000000000..030a04c5c005
--- /dev/null
+++ b/src/include/executor/adaptiveHashjoin.h
@@ -0,0 +1,9 @@
+#ifndef ADAPTIVE_HASHJOIN_H
+#define ADAPTIVE_HASHJOIN_H
+
+
+extern bool ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing);
+extern bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
+
+
+#endif							/* ADAPTIVE_HASHJOIN_H */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 79b634e8ed10..3e4f4bd5747a 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -148,11 +148,27 @@ typedef struct HashMemoryChunkData *HashMemoryChunk;
  * followed by variable-sized objects, they are arranged in contiguous memory
  * but not accessed directly as an array.
  */
+/*  TODO: maybe remove lock from ParallelHashJoinBatch and use pstate->lock */
+/*  and the PHJBatchAccessor to coordinate access to the PHJ batch similar to */
+/*  other users of that lock */
 typedef struct ParallelHashJoinBatch
 {
 	dsa_pointer buckets;		/* array of hash table buckets */
 	Barrier		batch_barrier;	/* synchronization for joining this batch */
 
+	/* Parallel Adaptive Hash Join members */
+
+	/*
+	 * after finishing build phase, parallel_hashloop_fallback cannot change,
+	 * and does not require a lock to read
+	 */
+	bool		parallel_hashloop_fallback;
+	int			total_num_chunks;
+	int			current_chunk_num;
+	size_t		estimated_chunk_size;
+	Barrier		chunk_barrier;
+	LWLock		lock;
+
 	dsa_pointer chunks;			/* chunks of tuples loaded */
 	size_t		size;			/* size of buckets + chunks in memory */
 	size_t		estimated_size; /* size of buckets + chunks while writing */
@@ -243,6 +259,8 @@ typedef struct ParallelHashJoinState
 	int			nparticipants;
 	size_t		space_allowed;
 	size_t		total_tuples;	/* total number of inner tuples */
+	int			batch_increases;	/* TODO: make this an atomic so I don't
+									 * need the lock to increment it? */
 	LWLock		lock;			/* lock protecting the above */
 
 	Barrier		build_barrier;	/* synchronization for the build phases */
@@ -263,10 +281,16 @@ typedef struct ParallelHashJoinState
 /* The phases for probing each batch, used by for batch_barrier. */
 #define PHJ_BATCH_ELECTING				0
 #define PHJ_BATCH_ALLOCATING			1
-#define PHJ_BATCH_LOADING				2
-#define PHJ_BATCH_PROBING				3
+#define PHJ_BATCH_CHUNKING				2
+#define PHJ_BATCH_OUTER_MATCH_STATUS_PROCESSING 3
 #define PHJ_BATCH_DONE					4
 
+#define PHJ_CHUNK_ELECTING				0
+#define PHJ_CHUNK_LOADING				1
+#define PHJ_CHUNK_PROBING				2
+#define PHJ_CHUNK_DONE					3
+#define PHJ_CHUNK_FINAL					4
+
 /* The phases of batch growth while hashing, for grow_batches_barrier. */
 #define PHJ_GROW_BATCHES_ELECTING		0
 #define PHJ_GROW_BATCHES_ALLOCATING		1
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 1336fde6b4d5..dfc221e6a111 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -40,9 +40,8 @@ extern void ExecHashTableInsert(HashJoinTable hashtable,
 extern void ExecParallelHashTableInsert(HashJoinTable hashtable,
 										TupleTableSlot *slot,
 										uint32 hashvalue);
-extern void ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable,
-													TupleTableSlot *slot,
-													uint32 hashvalue);
+extern void
+			ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable, TupleTableSlot *slot, uint32 hashvalue);
 extern bool ExecHashGetHashValue(HashJoinTable hashtable,
 								 ExprContext *econtext,
 								 List *hashkeys,
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index f7df70b5abd5..9497b10972b3 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -129,6 +129,7 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+	uint32		tuplenum;
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -425,7 +426,7 @@ static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
 	slot->tts_ops->clear(slot);
-
+	slot->tuplenum = 0;
 	return slot;
 }
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1f6f5bbc2075..b4f5f0357cb7 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -14,6 +14,7 @@
 #ifndef EXECNODES_H
 #define EXECNODES_H
 
+#include <storage/buffile.h>
 #include "access/tupconvert.h"
 #include "executor/instrument.h"
 #include "fmgr.h"
@@ -1951,6 +1952,22 @@ typedef struct HashJoinState
 	int			hj_JoinState;
 	bool		hj_MatchedOuter;
 	bool		hj_OuterNotEmpty;
+
+	/* hashloop fallback */
+	bool		hashloop_fallback;
+	/* hashloop fallback inner side */
+	bool		hj_InnerFirstChunk;
+	bool		hj_InnerExhausted;
+	off_t		hj_InnerPageOffset;
+
+	/* hashloop fallback outer side */
+	unsigned char hj_OuterCurrentByte;
+	BufFile    *hj_OuterMatchStatusesFile;	/* serial AHJ */
+	int64		hj_OuterTupleCount;
+
+	/* parallel hashloop fallback outer side */
+	bool		last_worker;
+	BufFile    *combined_bitmap;
 } HashJoinState;
 
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index aecb6013f00d..340086a7e77c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -815,6 +815,7 @@ typedef enum
  * it is waiting for a notification from another process.
  * ----------
  */
+/*  TODO: add WAIT_EVENT_HASH_BUILD_CREATE_OUTER_MATCH_STATUS_BITMAP_FILES? */
 typedef enum
 {
 	WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
@@ -827,10 +828,12 @@ typedef enum
 	WAIT_EVENT_HASH_BATCH_ALLOCATING,
 	WAIT_EVENT_HASH_BATCH_ELECTING,
 	WAIT_EVENT_HASH_BATCH_LOADING,
+	WAIT_EVENT_HASH_BATCH_PROBING,
 	WAIT_EVENT_HASH_BUILD_ALLOCATING,
 	WAIT_EVENT_HASH_BUILD_ELECTING,
 	WAIT_EVENT_HASH_BUILD_HASHING_INNER,
 	WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
+	WAIT_EVENT_HASH_BUILD_CREATE_OUTER_MATCH_STATUS_BITMAP_FILES,
 	WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
 	WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
 	WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
@@ -839,6 +842,11 @@ typedef enum
 	WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
 	WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
 	WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
+	WAIT_EVENT_HASH_CHUNK_ELECTING,
+	WAIT_EVENT_HASH_CHUNK_LOADING,
+	WAIT_EVENT_HASH_CHUNK_PROBING,
+	WAIT_EVENT_HASH_CHUNK_DONE,
+	WAIT_EVENT_HASH_ADVANCE_CHUNK,
 	WAIT_EVENT_LOGICAL_SYNC_DATA,
 	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
 	WAIT_EVENT_MQ_INTERNAL,
diff --git a/src/include/storage/barrier.h b/src/include/storage/barrier.h
index d71927cc2f7a..a3c867024c2b 100644
--- a/src/include/storage/barrier.h
+++ b/src/include/storage/barrier.h
@@ -36,6 +36,7 @@ typedef struct Barrier
 
 extern void BarrierInit(Barrier *barrier, int num_workers);
 extern bool BarrierArriveAndWait(Barrier *barrier, uint32 wait_event_info);
+extern bool BarrierArriveExplicitAndWait(Barrier *barrier, int next_phase, uint32 wait_event_info);
 extern bool BarrierArriveAndDetach(Barrier *barrier);
 extern int	BarrierAttach(Barrier *barrier);
 extern bool BarrierDetach(Barrier *barrier);
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index 60433f35b454..f790f7e12186 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,10 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
+extern BufFile *BufFileOpenSharedIfExists(SharedFileSet *fileset, const char *name);
 extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
 
+extern BufFile *BufFileRewindIfExists(BufFile *bufFile);
+
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 8fda8e4f7836..793f660eb428 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -212,6 +212,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_LOCK_MANAGER,
 	LWTRANCHE_PREDICATE_LOCK_MANAGER,
 	LWTRANCHE_PARALLEL_HASH_JOIN,
+	LWTRANCHE_PARALLEL_HASH_JOIN_BATCH,
 	LWTRANCHE_PARALLEL_QUERY_DSA,
 	LWTRANCHE_SESSION_DSA,
 	LWTRANCHE_SESSION_RECORD_TABLE,
diff --git a/src/include/utils/sharedtuplestore.h b/src/include/utils/sharedtuplestore.h
index 9754504cc536..6152ac163da2 100644
--- a/src/include/utils/sharedtuplestore.h
+++ b/src/include/utils/sharedtuplestore.h
@@ -22,6 +22,19 @@ typedef struct SharedTuplestore SharedTuplestore;
 
 struct SharedTuplestoreAccessor;
 typedef struct SharedTuplestoreAccessor SharedTuplestoreAccessor;
+struct tupleMetadata;
+typedef struct tupleMetadata tupleMetadata;
+
+/*  TODO: conflicting types for tupleid with accessor->sts->ntuples (uint32) */
+/*  TODO: use a union for tupleid (uint32) (make this a uint64) and chunk number (int) */
+struct tupleMetadata
+{
+	uint32		hashvalue;
+	int			tupleid;		/* tuple id on outer side and chunk number for
+								 * inner side */
+}			__attribute__((packed));
+
+/*  TODO: make sure I can get rid of packed now that using sizeof(struct) */
 
 /*
  * A flag indicating that the tuplestore will only be scanned once, so backing
@@ -58,4 +71,13 @@ extern void sts_puttuple(SharedTuplestoreAccessor *accessor,
 extern MinimalTuple sts_parallel_scan_next(SharedTuplestoreAccessor *accessor,
 										   void *meta_data);
 
+
+extern int	sts_increment_tuplenum(SharedTuplestoreAccessor *accessor);
+
+extern void sts_make_outer_match_status_file(SharedTuplestoreAccessor *accessor);
+extern void sts_set_outer_match_status(SharedTuplestoreAccessor *accessor, uint32 tuplenum);
+extern void sts_close_outer_match_status_file(SharedTuplestoreAccessor *accessor);
+extern BufFile *sts_combine_outer_match_status_files(SharedTuplestoreAccessor *accessor);
+
+
 #endif							/* SHAREDTUPLESTORE_H */
diff --git a/src/test/regress/expected/adaptive_hj.out b/src/test/regress/expected/adaptive_hj.out
new file mode 100644
index 000000000000..fe24acd2550e
--- /dev/null
+++ b/src/test/regress/expected/adaptive_hj.out
@@ -0,0 +1,1233 @@
+-- TODO: remove some of these tests and make the test file faster
+create schema adaptive_hj;
+set search_path=adaptive_hj;
+drop table if exists t1;
+NOTICE:  table "t1" does not exist, skipping
+drop table if exists t2;
+NOTICE:  table "t2" does not exist, skipping
+create table t1(a int);
+create table t2(b int);
+-- serial setup
+set work_mem=64;
+set enable_mergejoin to off;
+-- TODO: make this function general
+create or replace function explain_multi_batch() returns setof text language plpgsql as
+$$
+declare ln text;
+begin
+    for ln in
+        explain (analyze, summary off, timing off, costs off)
+		select count(*) from t1 left outer join t2 on a = b
+    loop
+        ln := regexp_replace(ln, 'Memory Usage: \S*',  'Memory Usage: xxx');
+        return next ln;
+    end loop;
+end;
+$$;
+-- Serial_Test_1 reset
+-- TODO: refactor into procedure or change to drop table
+update pg_class set reltuples = 0, relpages = 0 where relname = 't2';
+update pg_class set reltuples = 0, relpages = 0 where relname = 't1';
+delete from pg_statistic where starelid = 't2'::regclass;
+delete from pg_statistic where starelid = 't1'::regclass;
+-- Serial_Test_1 setup
+truncate table t1;
+insert into t1 values(1),(2);
+insert into t1 select i from generate_series(1,10)i;
+insert into t1 select 2 from generate_series(1,5)i;
+truncate table t2;
+insert into t2 values(2),(3),(11);
+insert into t2 select i from generate_series(2,10)i;
+insert into t2 select 2 from generate_series(2,7)i;
+-- Serial_Test_1.1
+-- TODO: automate the checking for expected number of chunks (explain option?)
+-- spills in 4 batches
+-- batch 1 falls back with 2 chunks with no unmatched tuples
+-- batch 2 falls back with 2 chunks with 2 unmatched tuples emitted at EOB 
+-- batch 3 falls back with 5 chunks with no unmatched tuples
+-- batch 4 does not fall back with no unmatched tuples
+select * from explain_multi_batch();
+                    explain_multi_batch                     
+------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Left Join (actual rows=67 loops=1)
+         Hash Cond: (t1.a = t2.b)
+         ->  Seq Scan on t1 (actual rows=17 loops=1)
+         ->  Hash (actual rows=18 loops=1)
+               Buckets: 2048  Batches: 4  Memory Usage: xxx
+               ->  Seq Scan on t2 (actual rows=18 loops=1)
+(7 rows)
+
+select * from t1 left outer join t2 on a = b order by b, a;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+  1 |   
+  1 |   
+(67 rows)
+
+select * from t1, t2 where a = b order by b;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+(65 rows)
+
+select * from t1 right outer join t2 on a = b order by a, b;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+    | 11
+(66 rows)
+
+select * from t1 full outer join t2 on a = b order by b, a;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+    | 11
+  1 |   
+  1 |   
+(68 rows)
+
+-- Serial_Test_1.2 setup
+analyze t1; analyze t2;
+-- Serial_Test_1.2
+-- doesn't spill (happens to do a hash right join)
+select * from explain_multi_batch();
+                    explain_multi_batch                     
+------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Right Join (actual rows=67 loops=1)
+         Hash Cond: (t2.b = t1.a)
+         ->  Seq Scan on t2 (actual rows=18 loops=1)
+         ->  Hash (actual rows=17 loops=1)
+               Buckets: 1024  Batches: 1  Memory Usage: xxx
+               ->  Seq Scan on t1 (actual rows=17 loops=1)
+(7 rows)
+
+-- Serial_Test_2 reset
+update pg_class set reltuples = 0, relpages = 0 where relname = 't2';
+update pg_class set reltuples = 0, relpages = 0 where relname = 't1';
+delete from pg_statistic where starelid = 't2'::regclass;
+delete from pg_statistic where starelid = 't1'::regclass;
+-- Serial_Test_2 setup:
+truncate table t1;
+insert into t1 values (1),(2),(2),(3);
+truncate table t2;
+insert into t2 values(2),(2),(3),(3),(4);
+-- Serial_Test_2.1
+-- spills in 4 batches
+-- batch 1 falls back with 2 chunks with no unmatched tuples
+-- batch 2 does not fall back with 1 unmatched tuple
+-- batch 3 does not fall back with no unmatched tuples
+-- batch 4 does not fall back with no unmatched tuples
+select * from explain_multi_batch();
+                    explain_multi_batch                     
+------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Left Join (actual rows=7 loops=1)
+         Hash Cond: (t1.a = t2.b)
+         ->  Seq Scan on t1 (actual rows=4 loops=1)
+         ->  Hash (actual rows=5 loops=1)
+               Buckets: 2048  Batches: 4  Memory Usage: xxx
+               ->  Seq Scan on t2 (actual rows=5 loops=1)
+(7 rows)
+
+select * from t1 left outer join t2 on a = b order by b, a;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 1 |  
+(7 rows)
+
+select * from t1 right outer join t2 on a = b order by a, b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+   | 4
+(7 rows)
+
+-- TODO: check coverage for emitting ummatched inner tuples
+-- Serial_Test_2.1.a
+-- results checking for inner join
+select * from t1 left outer join t2 on a = b order by b, a;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 1 |  
+(7 rows)
+
+select * from t1, t2 where a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+(6 rows)
+
+select * from t1 right outer join t2 on a = b order by a, b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+   | 4
+(7 rows)
+
+select * from t1 full outer join t2 on a = b order by b, a;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+   | 4
+ 1 |  
+(8 rows)
+
+select * from t1, t2 where a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+(6 rows)
+
+-- Serial_Test_2.2
+analyze t1; analyze t2;
+-- doesn't spill (happens to do a hash right join)
+select * from explain_multi_batch();
+                    explain_multi_batch                     
+------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Right Join (actual rows=7 loops=1)
+         Hash Cond: (t2.b = t1.a)
+         ->  Seq Scan on t2 (actual rows=5 loops=1)
+         ->  Hash (actual rows=4 loops=1)
+               Buckets: 1024  Batches: 1  Memory Usage: xxx
+               ->  Seq Scan on t1 (actual rows=4 loops=1)
+(7 rows)
+
+-- Serial_Test_3 reset
+update pg_class set reltuples = 0, relpages = 0 where relname = 't2';
+update pg_class set reltuples = 0, relpages = 0 where relname = 't1';
+delete from pg_statistic where starelid = 't2'::regclass;
+delete from pg_statistic where starelid = 't1'::regclass;
+-- Serial_Test_3 setup:
+truncate table t1;
+insert into t1 values(1),(1);
+insert into t1 select 2 from generate_series(1,7)i;
+insert into t1 select i from generate_series(3,10)i;
+truncate table t2;
+insert into t2 select 2 from generate_series(1,7)i;
+insert into t2 values(3),(3);
+insert into t2 select i from generate_series(5,9)i;
+-- Serial_Test_3.1
+-- spills in 4 batches
+-- batch 1 falls back with 2 chunks with 1 unmatched tuple
+-- batch 2 does not fall back with 2 unmatched tuples
+-- batch 3 falls back with 4 chunks with 1 unmatched tuple
+-- batch 4 does not fall back with no unmatched tuples
+select * from explain_multi_batch();
+                    explain_multi_batch                     
+------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Left Join (actual rows=60 loops=1)
+         Hash Cond: (t1.a = t2.b)
+         ->  Seq Scan on t1 (actual rows=17 loops=1)
+         ->  Hash (actual rows=14 loops=1)
+               Buckets: 2048  Batches: 4  Memory Usage: xxx
+               ->  Seq Scan on t2 (actual rows=14 loops=1)
+(7 rows)
+
+select * from t1 left outer join t2 on a = b order by b, a;
+ a  | b 
+----+---
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  3 | 3
+  3 | 3
+  5 | 5
+  6 | 6
+  7 | 7
+  8 | 8
+  9 | 9
+  1 |  
+  1 |  
+  4 |  
+ 10 |  
+(60 rows)
+
+select * from t1, t2 where a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+select * from t1 right outer join t2 on a = b order by a, b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+select * from t1 full outer join t2 on a = b order by b, a;
+ a  | b 
+----+---
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  3 | 3
+  3 | 3
+  5 | 5
+  6 | 6
+  7 | 7
+  8 | 8
+  9 | 9
+  1 |  
+  1 |  
+  4 |  
+ 10 |  
+(60 rows)
+
+select * from t1, t2 where a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+-- Serial_Test_3.2 
+-- swap join order
+select * from t2 left outer join t1 on a = b order by a, b;
+ b | a 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+select * from t2, t1 where a = b order by a;
+ b | a 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+select * from t2 right outer join t1 on a = b order by b, a;
+ b | a  
+---+----
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 3 |  3
+ 3 |  3
+ 5 |  5
+ 6 |  6
+ 7 |  7
+ 8 |  8
+ 9 |  9
+   |  1
+   |  1
+   |  4
+   | 10
+(60 rows)
+
+select * from t2 full outer join t1 on a = b order by a, b;
+ b | a  
+---+----
+   |  1
+   |  1
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 3 |  3
+ 3 |  3
+   |  4
+ 5 |  5
+ 6 |  6
+ 7 |  7
+ 8 |  8
+ 9 |  9
+   | 10
+(60 rows)
+
+-- Serial_Test_3.3 setup
+analyze t1; analyze t2;
+-- Serial_Test_3.3
+-- doesn't spill
+select * from explain_multi_batch();
+                    explain_multi_batch                     
+------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Left Join (actual rows=60 loops=1)
+         Hash Cond: (t1.a = t2.b)
+         ->  Seq Scan on t1 (actual rows=17 loops=1)
+         ->  Hash (actual rows=14 loops=1)
+               Buckets: 1024  Batches: 1  Memory Usage: xxx
+               ->  Seq Scan on t2 (actual rows=14 loops=1)
+(7 rows)
+
+-- Serial_Test_4 setup
+drop table t1;
+create table t1(b int);
+insert into t1 select i from generate_series(1,111)i;
+insert into t1 select 2 from generate_series(1,180)i;
+analyze t1;
+drop table t2;
+create table t2(a int);
+insert into t2 select i from generate_series(20,25000)i;
+insert into t2 select 2 from generate_series(1,100)i;
+analyze t2;
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+-- Serial_Test_4.1
+-- spills in 32 batches
+--batch 0 does not fall back with 1 unmatched outer tuple (15)
+--batch 1 falls back with 396 chunks.
+--batch 2 falls back with 402 chunks with 1 unmatched outer tuple (1)
+--batch 3 falls back with 389 chunks with 1 unmatched outer tuple (8)
+--batch 4 falls back with 409 chunks with no unmatched outer tuples
+--batch 5 falls back with 366 chunks with 1 unmatched outer tuple (4)
+--batch 6 falls back with 407 chunks with 1 unmatched outer tuple (11)
+--batch 7 falls back with 382 chunks with unmatched outer tuple (10)
+--batch 8 falls back with 413 chunks with no unmatched outer tuples
+--batch 9 falls back with 371 chunks with 1 unmatched outer tuple (3)
+--batch 10 falls back with 389 chunks with no unmatched outer tuples
+--batch 11 falls back with 408 chunks with no unmatched outer tuples
+--batch 12 falls back with 387 chunks with no unmatched outer tuples
+--batch 13 falls back with 402 chunks with 1 unmatched outer tuple (18) 
+--batch 14 falls back with 369 chunks with 1 unmatched outer tuple (9)
+--batch 15 falls back with 387 chunks with no unmatched outer tuples
+--batch 16 falls back with 365 chunks with no unmatched outer tuples
+--batch 17 falls back with 403 chunks with 2 unmatched outer tuples (14,19)
+--batch 18 falls back with 375 chunks with no unmatched outer tuples
+--batch 19 falls back with 384 chunks with no unmatched outer tuples
+--batch 20 falls back with 377 chunks with 1 unmatched outer tuple (12)
+--batch 22 falls back with 401 chunks with no unmatched outer tuples
+--batch 23 falls back with 396 chunks with no unmatched outer tuples
+--batch 24 falls back with 387 chunks with 1 unmatched outer tuple (5)
+--batch 25 falls back with 399 chunks with 1 unmatched outer tuple (7)
+--batch 26 falls back with 387 chunks.
+--batch 27 falls back with 442 chunks.
+--batch 28 falls back with 385 chunks with 1 unmatched outer tuple (17)
+--batch 29 falls back with 375 chunks.
+--batch 30 falls back with 404 chunks with 1 unmatched outer tuple (6)
+--batch 31 falls back with 396 chunks with 2 unmatched outer tuples (13,16)
+select * from explain_multi_batch();
+                                     explain_multi_batch                                      
+----------------------------------------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Left Join (actual rows=18210 loops=1)
+         Hash Cond: (t1.b = t2.a)
+         ->  Seq Scan on t1 (actual rows=291 loops=1)
+         ->  Hash (actual rows=25081 loops=1)
+               Buckets: 2048 (originally 1024)  Batches: 32 (originally 1)  Memory Usage: xxx
+               ->  Seq Scan on t2 (actual rows=25081 loops=1)
+(7 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+ 18210
+(1 row)
+
+select count(a) from t1 left outer join t2 on a = b;
+ count 
+-------
+ 18192
+(1 row)
+
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+ 18192
+(1 row)
+
+-- used to give wrong results because there is a whole batch of outer which is
+-- empty and so the inner doesn't emit unmatched tuples with ROJ
+select count(*) from t1 right outer join t2 on a = b;
+ count 
+-------
+ 43081
+(1 row)
+
+select count(*) from t1 full outer join t2 on a = b; 
+ count 
+-------
+ 43099
+(1 row)
+
+-- Test_6 non-negligible amount of data test case
+-- TODO: doesn't finish with my code when it is set to be serial
+-- it does finish when it is parallel -- the serial version is either simply too
+-- slow or has a bug -- I tried it with less data and it did finish, so it must
+-- just be really slow
+-- inner join shouldn't even need to make the unmatched files
+-- it finishes eventually if I decrease data amount
+--drop table simple;
+--create table simple as
+ -- select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+--alter table simple set (parallel_workers = 2);
+--analyze simple;
+--
+--drop table extremely_skewed;
+--create table extremely_skewed (id int, t text);
+--alter table extremely_skewed set (autovacuum_enabled = 'false');
+--alter table extremely_skewed set (parallel_workers = 2);
+--analyze extremely_skewed;
+--insert into extremely_skewed
+--  select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+--  from generate_series(1, 20000);
+--update pg_class
+--  set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+--  where relname = 'extremely_skewed';
+--set work_mem=64;
+--set enable_mergejoin to off;
+--explain (analyze, costs off, timing off)
+  --select * from simple r join extremely_skewed s using (id);
+--select * from explain_multi_batch();
+drop table t1;
+drop table t2;
+drop function explain_multi_batch();
+reset enable_mergejoin;
+reset work_mem;
+reset search_path;
+drop schema adaptive_hj;
diff --git a/src/test/regress/expected/parallel_adaptive_hj.out b/src/test/regress/expected/parallel_adaptive_hj.out
new file mode 100644
index 000000000000..e5e7f9aa4f50
--- /dev/null
+++ b/src/test/regress/expected/parallel_adaptive_hj.out
@@ -0,0 +1,343 @@
+create schema parallel_adaptive_hj;
+set search_path=parallel_adaptive_hj;
+-- TODO: anti-semi-join and semi-join tests
+-- TODO: check if test2 and 3 are different at all
+-- TODO: add test for parallel-oblivious parallel hash join
+-- TODO: make this function general
+create or replace function explain_parallel_multi_batch() returns setof text language plpgsql as
+$$
+declare ln text;
+begin
+    for ln in
+        explain (analyze, summary off, timing off, costs off)
+		select count(*) from t1 left outer join t2 on a = b
+    loop
+        ln := regexp_replace(ln, 'Memory Usage: \S*',  'Memory Usage: xxx');
+        return next ln;
+    end loop;
+end;
+$$;
+-- parallel setup
+set enable_nestloop to off;
+set enable_mergejoin to off;
+set  min_parallel_table_scan_size = 0;
+set  parallel_setup_cost = 0;
+set  enable_parallel_hash = on;
+set  enable_hashjoin = on;
+set  max_parallel_workers_per_gather = 1;
+set  work_mem = 64;
+-- Parallel_Test_1 setup
+drop table if exists t1;
+NOTICE:  table "t1" does not exist, skipping
+create table t1(a int);
+insert into t1 select i from generate_series(1,11)i;
+insert into t1 select 2 from generate_series(1,18)i;
+analyze t1;
+drop table if exists t2;
+NOTICE:  table "t2" does not exist, skipping
+create table t2(b int);
+insert into t2 select i from generate_series(4,2500)i;
+insert into t2 select 2 from generate_series(1,10)i;
+analyze t2;
+alter table t2 set (autovacuum_enabled = 'false');
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+-- Parallel_Test_1.1
+-- spills in 4 batches
+-- 1 resize of nbatches
+-- no batch falls back
+select * from explain_parallel_multi_batch();
+                                      explain_parallel_multi_batch                                       
+---------------------------------------------------------------------------------------------------------
+ Finalize Aggregate (actual rows=1 loops=1)
+   ->  Gather (actual rows=2 loops=1)
+         Workers Planned: 1
+         Workers Launched: 1
+         ->  Partial Aggregate (actual rows=1 loops=2)
+               ->  Parallel Hash Left Join (actual rows=100 loops=2)
+                     Hash Cond: (t1.a = t2.b)
+                     ->  Parallel Seq Scan on t1 (actual rows=29 loops=1)
+                     ->  Parallel Hash (actual rows=1254 loops=2)
+                           Buckets: 1024 (originally 1024)  Batches: 4 (originally 1)  Memory Usage: xxx
+                           ->  Parallel Seq Scan on t2 (actual rows=2507 loops=1)
+(11 rows)
+
+-- need an aggregate to exercise the code but still want to know if we are
+-- emitting the right unmatched outer tuples
+select count(a) from t1 left outer join t2 on a = b;
+ count 
+-------
+   200
+(1 row)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+   200
+(1 row)
+
+-- Parallel_Test_1.1.a
+-- results checking for inner join
+-- doesn't fall back
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+   198
+(1 row)
+
+-- Parallel_Test_1.1.b
+-- results checking for right outer join
+-- doesn't exercise the fallback code but just checking results
+select count(*) from t1 right outer join t2 on a = b;
+ count 
+-------
+  2687
+(1 row)
+
+-- Parallel_Test_1.1.c
+-- results checking for full outer join
+select count(*) from t1 full outer join t2 on a = b;
+ count 
+-------
+  2689
+(1 row)
+
+-- Parallel_Test_1.2
+-- spill and doesn't have to resize nbatches
+analyze t2;
+select * from explain_parallel_multi_batch();
+                           explain_parallel_multi_batch                           
+----------------------------------------------------------------------------------
+ Finalize Aggregate (actual rows=1 loops=1)
+   ->  Gather (actual rows=2 loops=1)
+         Workers Planned: 1
+         Workers Launched: 1
+         ->  Partial Aggregate (actual rows=1 loops=2)
+               ->  Parallel Hash Left Join (actual rows=100 loops=2)
+                     Hash Cond: (t1.a = t2.b)
+                     ->  Parallel Seq Scan on t1 (actual rows=29 loops=1)
+                     ->  Parallel Hash (actual rows=1254 loops=2)
+                           Buckets: 2048  Batches: 4  Memory Usage: xxx
+                           ->  Parallel Seq Scan on t2 (actual rows=2507 loops=1)
+(11 rows)
+
+select count(a) from t1 left outer join t2 on a = b;
+ count 
+-------
+   200
+(1 row)
+
+-- Parallel_Test_1.3
+-- doesn't spill
+-- does resize nbuckets
+set work_mem = '4MB';
+select * from explain_parallel_multi_batch();
+                           explain_parallel_multi_batch                           
+----------------------------------------------------------------------------------
+ Finalize Aggregate (actual rows=1 loops=1)
+   ->  Gather (actual rows=2 loops=1)
+         Workers Planned: 1
+         Workers Launched: 1
+         ->  Partial Aggregate (actual rows=1 loops=2)
+               ->  Parallel Hash Left Join (actual rows=100 loops=2)
+                     Hash Cond: (t1.a = t2.b)
+                     ->  Parallel Seq Scan on t1 (actual rows=29 loops=1)
+                     ->  Parallel Hash (actual rows=1254 loops=2)
+                           Buckets: 4096  Batches: 1  Memory Usage: xxx
+                           ->  Parallel Seq Scan on t2 (actual rows=2507 loops=1)
+(11 rows)
+
+select count(a) from t1 left outer join t2 on a = b;
+ count 
+-------
+   200
+(1 row)
+
+set work_mem = 64;
+-- Parallel_Test_3
+-- big example
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i from generate_series(20,25000)i;
+insert into t2 select 2 from generate_series(1,100)i;
+analyze t2;
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,111)i;
+insert into t1 select 2 from generate_series(1,180)i;
+analyze t1;
+select * from explain_parallel_multi_batch();
+                                       explain_parallel_multi_batch                                       
+----------------------------------------------------------------------------------------------------------
+ Finalize Aggregate (actual rows=1 loops=1)
+   ->  Gather (actual rows=2 loops=1)
+         Workers Planned: 1
+         Workers Launched: 1
+         ->  Partial Aggregate (actual rows=1 loops=2)
+               ->  Parallel Hash Left Join (actual rows=9105 loops=2)
+                     Hash Cond: (t1.a = t2.b)
+                     ->  Parallel Seq Scan on t1 (actual rows=146 loops=2)
+                     ->  Parallel Hash (actual rows=12540 loops=2)
+                           Buckets: 1024 (originally 1024)  Batches: 16 (originally 1)  Memory Usage: xxx
+                           ->  Parallel Seq Scan on t2 (actual rows=12540 loops=2)
+(11 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+ 18210
+(1 row)
+
+-- TODO: check what each of these is exercising -- chunk num, etc and write that
+-- down
+-- also, note that this example did reveal with ROJ that it wasn't working, so
+-- maybe keep that but it is not parallel
+-- make sure the plans make sense for the code we are writing
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+ 18210
+(1 row)
+
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+ 18192
+(1 row)
+
+select count(*) from t1 right outer join t2 on a = b;
+ count 
+-------
+ 43081
+(1 row)
+
+select count(*) from t1 full outer join t2 on a = b;
+ count 
+-------
+ 43099
+(1 row)
+
+-- Parallel_Test_4
+-- spill and resize nbatches 2x
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i from generate_series(4,1000)i;
+insert into t2 select 2 from generate_series(1,4000)i;
+analyze t2;
+alter table t2 set (autovacuum_enabled = 'false');
+update pg_class
+set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+where relname = 't2';
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,11)i;
+insert into t1 select 2 from generate_series(1,18)i;
+insert into t1 values(500);
+analyze t1;
+select * from explain_parallel_multi_batch();
+                                       explain_parallel_multi_batch                                       
+----------------------------------------------------------------------------------------------------------
+ Finalize Aggregate (actual rows=1 loops=1)
+   ->  Gather (actual rows=2 loops=1)
+         Workers Planned: 1
+         Workers Launched: 1
+         ->  Partial Aggregate (actual rows=1 loops=2)
+               ->  Parallel Hash Left Join (actual rows=38006 loops=2)
+                     Hash Cond: (t1.a = t2.b)
+                     ->  Parallel Seq Scan on t1 (actual rows=15 loops=2)
+                     ->  Parallel Hash (actual rows=2498 loops=2)
+                           Buckets: 1024 (originally 1024)  Batches: 16 (originally 1)  Memory Usage: xxx
+                           ->  Parallel Seq Scan on t2 (actual rows=2498 loops=2)
+(11 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+ 76011
+(1 row)
+
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+ 76009
+(1 row)
+
+select count(*) from t1 right outer join t2 on a = b;
+ count 
+-------
+ 76997
+(1 row)
+
+select count(*) from t1 full outer join t2 on a = b;
+ count 
+-------
+ 76999
+(1 row)
+
+select count(a) from t1 left outer join t2 on a = b;
+ count 
+-------
+ 76011
+(1 row)
+
+-- Parallel_Test_5
+-- revealed race condition because two workers are working on a chunked batch
+-- only 2 unmatched tuples
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i%1111 from generate_series(200,10000)i;
+delete from t2 where b = 115;
+delete from t2 where b = 200;
+insert into t2 select 2 from generate_series(1,4000);
+analyze t2;
+alter table t2 set (autovacuum_enabled = 'false');
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,111)i;
+insert into t1 values(115);
+insert into t1 values(200);
+insert into t1 select 2 from generate_series(1,180)i;
+analyze t1;
+select * from explain_parallel_multi_batch();
+                                       explain_parallel_multi_batch                                       
+----------------------------------------------------------------------------------------------------------
+ Finalize Aggregate (actual rows=1 loops=1)
+   ->  Gather (actual rows=2 loops=1)
+         Workers Planned: 1
+         Workers Launched: 1
+         ->  Partial Aggregate (actual rows=1 loops=2)
+               ->  Parallel Hash Left Join (actual rows=363166 loops=2)
+                     Hash Cond: (t1.a = t2.b)
+                     ->  Parallel Seq Scan on t1 (actual rows=146 loops=2)
+                     ->  Parallel Hash (actual rows=6892 loops=2)
+                           Buckets: 1024 (originally 1024)  Batches: 32 (originally 1)  Memory Usage: xxx
+                           ->  Parallel Seq Scan on t2 (actual rows=6892 loops=2)
+(11 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count  
+--------
+ 726331
+(1 row)
+
+-- without count(*), can't reproduce desired plan so can't rely on results
+select count(*) from t1 left outer join t2 on a = b;
+ count  
+--------
+ 726331
+(1 row)
+
+drop table if exists t1;
+drop table if exists t2;
+drop function explain_parallel_multi_batch();
+reset enable_mergejoin;
+reset work_mem;
+reset search_path;
+drop schema parallel_adaptive_hj;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d33a4e143dce..0afd6db491bd 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 adaptive_hj parallel_adaptive_hj
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/post_schedule b/src/test/regress/post_schedule
new file mode 100644
index 000000000000..7824ecf7bfab
--- /dev/null
+++ b/src/test/regress/post_schedule
@@ -0,0 +1,8 @@
+test: object_address
+test: tablesample
+test: groupingsets
+test: drop_operator
+test: password
+test: identity
+test: generated
+test: join_hash
diff --git a/src/test/regress/pre_schedule b/src/test/regress/pre_schedule
new file mode 100644
index 000000000000..4105b0fa0310
--- /dev/null
+++ b/src/test/regress/pre_schedule
@@ -0,0 +1,120 @@
+# src/test/regress/serial_schedule
+# This should probably be in an order similar to parallel_schedule.
+test: tablespace
+test: boolean
+test: char
+test: name
+test: varchar
+test: text
+test: int2
+test: int4
+test: int8
+test: oid
+test: float4
+test: float8
+test: bit
+test: numeric
+test: txid
+test: uuid
+test: enum
+test: money
+test: rangetypes
+test: pg_lsn
+test: regproc
+test: strings
+test: numerology
+test: point
+test: lseg
+test: line
+test: box
+test: path
+test: polygon
+test: circle
+test: date
+test: time
+test: timetz
+test: timestamp
+test: timestamptz
+test: interval
+test: inet
+test: macaddr
+test: macaddr8
+test: tstypes
+test: geometry
+test: horology
+test: regex
+test: oidjoins
+test: type_sanity
+test: opr_sanity
+test: misc_sanity
+test: comments
+test: expressions
+test: create_function_1
+test: create_type
+test: create_table
+test: create_function_2
+test: copy
+test: copyselect
+test: copydml
+test: insert
+test: insert_conflict
+test: create_misc
+test: create_operator
+test: create_procedure
+test: create_index
+test: create_index_spgist
+test: create_view
+test: index_including
+test: index_including_gist
+test: create_aggregate
+test: create_function_3
+test: create_cast
+test: constraints
+test: triggers
+test: select
+test: inherit
+test: typed_table
+test: vacuum
+test: drop_if_exists
+test: updatable_views
+test: roleattributes
+test: create_am
+test: hash_func
+test: errors
+test: sanity_check
+test: select_into
+test: select_distinct
+test: select_distinct_on
+test: select_implicit
+test: select_having
+test: subselect
+test: union
+test: case
+test: join
+test: adaptive_hj
+test: parallel_adaptive_hj
+test: aggregates
+test: transactions
+ignore: random
+test: random
+test: portals
+test: arrays
+test: btree_index
+test: hash_index
+test: update
+test: delete
+test: namespace
+test: prepared_xacts
+test: brin
+test: gin
+test: gist
+test: spgist
+test: privileges
+test: init_privs
+test: security_label
+test: collate
+test: matview
+test: lock
+test: replica_identity
+test: rowsecurity
+
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index f86f5c568252..0dc0967a93ac 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -91,6 +91,8 @@ test: subselect
 test: union
 test: case
 test: join
+test: adaptive_hj
+test: parallel_adaptive_hj
 test: aggregates
 test: transactions
 ignore: random
diff --git a/src/test/regress/sql/adaptive_hj.sql b/src/test/regress/sql/adaptive_hj.sql
new file mode 100644
index 000000000000..a5af798ea856
--- /dev/null
+++ b/src/test/regress/sql/adaptive_hj.sql
@@ -0,0 +1,240 @@
+-- TODO: remove some of these tests and make the test file faster
+create schema adaptive_hj;
+set search_path=adaptive_hj;
+drop table if exists t1;
+drop table if exists t2;
+create table t1(a int);
+create table t2(b int);
+
+-- serial setup
+set work_mem=64;
+set enable_mergejoin to off;
+-- TODO: make this function general
+create or replace function explain_multi_batch() returns setof text language plpgsql as
+$$
+declare ln text;
+begin
+    for ln in
+        explain (analyze, summary off, timing off, costs off)
+		select count(*) from t1 left outer join t2 on a = b
+    loop
+        ln := regexp_replace(ln, 'Memory Usage: \S*',  'Memory Usage: xxx');
+        return next ln;
+    end loop;
+end;
+$$;
+
+-- Serial_Test_1 reset
+-- TODO: refactor into procedure or change to drop table
+update pg_class set reltuples = 0, relpages = 0 where relname = 't2';
+update pg_class set reltuples = 0, relpages = 0 where relname = 't1';
+delete from pg_statistic where starelid = 't2'::regclass;
+delete from pg_statistic where starelid = 't1'::regclass;
+
+-- Serial_Test_1 setup
+truncate table t1;
+insert into t1 values(1),(2);
+insert into t1 select i from generate_series(1,10)i;
+insert into t1 select 2 from generate_series(1,5)i;
+truncate table t2;
+insert into t2 values(2),(3),(11);
+insert into t2 select i from generate_series(2,10)i;
+insert into t2 select 2 from generate_series(2,7)i;
+
+-- Serial_Test_1.1
+-- TODO: automate the checking for expected number of chunks (explain option?)
+-- spills in 4 batches
+-- batch 1 falls back with 2 chunks with no unmatched tuples
+-- batch 2 falls back with 2 chunks with 2 unmatched tuples emitted at EOB 
+-- batch 3 falls back with 5 chunks with no unmatched tuples
+-- batch 4 does not fall back with no unmatched tuples
+select * from explain_multi_batch();
+select * from t1 left outer join t2 on a = b order by b, a;
+select * from t1, t2 where a = b order by b;
+select * from t1 right outer join t2 on a = b order by a, b;
+select * from t1 full outer join t2 on a = b order by b, a;
+
+-- Serial_Test_1.2 setup
+analyze t1; analyze t2;
+
+-- Serial_Test_1.2
+-- doesn't spill (happens to do a hash right join)
+select * from explain_multi_batch();
+
+-- Serial_Test_2 reset
+update pg_class set reltuples = 0, relpages = 0 where relname = 't2';
+update pg_class set reltuples = 0, relpages = 0 where relname = 't1';
+delete from pg_statistic where starelid = 't2'::regclass;
+delete from pg_statistic where starelid = 't1'::regclass;
+
+-- Serial_Test_2 setup:
+truncate table t1;
+insert into t1 values (1),(2),(2),(3);
+truncate table t2;
+insert into t2 values(2),(2),(3),(3),(4);
+
+-- Serial_Test_2.1
+-- spills in 4 batches
+-- batch 1 falls back with 2 chunks with no unmatched tuples
+-- batch 2 does not fall back with 1 unmatched tuple
+-- batch 3 does not fall back with no unmatched tuples
+-- batch 4 does not fall back with no unmatched tuples
+select * from explain_multi_batch();
+select * from t1 left outer join t2 on a = b order by b, a;
+select * from t1 right outer join t2 on a = b order by a, b;
+
+-- TODO: check coverage for emitting ummatched inner tuples
+-- Serial_Test_2.1.a
+-- results checking for inner join
+select * from t1 left outer join t2 on a = b order by b, a;
+select * from t1, t2 where a = b order by b;
+select * from t1 right outer join t2 on a = b order by a, b;
+select * from t1 full outer join t2 on a = b order by b, a;
+select * from t1, t2 where a = b order by b;
+
+-- Serial_Test_2.2
+analyze t1; analyze t2;
+-- doesn't spill (happens to do a hash right join)
+select * from explain_multi_batch();
+
+-- Serial_Test_3 reset
+update pg_class set reltuples = 0, relpages = 0 where relname = 't2';
+update pg_class set reltuples = 0, relpages = 0 where relname = 't1';
+delete from pg_statistic where starelid = 't2'::regclass;
+delete from pg_statistic where starelid = 't1'::regclass;
+
+
+-- Serial_Test_3 setup:
+truncate table t1;
+insert into t1 values(1),(1);
+insert into t1 select 2 from generate_series(1,7)i;
+insert into t1 select i from generate_series(3,10)i;
+truncate table t2;
+insert into t2 select 2 from generate_series(1,7)i;
+insert into t2 values(3),(3);
+insert into t2 select i from generate_series(5,9)i;
+
+-- Serial_Test_3.1
+-- spills in 4 batches
+-- batch 1 falls back with 2 chunks with 1 unmatched tuple
+-- batch 2 does not fall back with 2 unmatched tuples
+-- batch 3 falls back with 4 chunks with 1 unmatched tuple
+-- batch 4 does not fall back with no unmatched tuples
+select * from explain_multi_batch();
+select * from t1 left outer join t2 on a = b order by b, a;
+select * from t1, t2 where a = b order by b;
+select * from t1 right outer join t2 on a = b order by a, b;
+select * from t1 full outer join t2 on a = b order by b, a;
+select * from t1, t2 where a = b order by b;
+
+-- Serial_Test_3.2 
+-- swap join order
+select * from t2 left outer join t1 on a = b order by a, b;
+select * from t2, t1 where a = b order by a;
+select * from t2 right outer join t1 on a = b order by b, a;
+select * from t2 full outer join t1 on a = b order by a, b;
+
+-- Serial_Test_3.3 setup
+analyze t1; analyze t2;
+
+-- Serial_Test_3.3
+-- doesn't spill
+select * from explain_multi_batch();
+
+-- Serial_Test_4 setup
+drop table t1;
+create table t1(b int);
+insert into t1 select i from generate_series(1,111)i;
+insert into t1 select 2 from generate_series(1,180)i;
+analyze t1;
+
+drop table t2;
+create table t2(a int);
+insert into t2 select i from generate_series(20,25000)i;
+insert into t2 select 2 from generate_series(1,100)i;
+analyze t2;
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+
+-- Serial_Test_4.1
+-- spills in 32 batches
+--batch 0 does not fall back with 1 unmatched outer tuple (15)
+--batch 1 falls back with 396 chunks.
+--batch 2 falls back with 402 chunks with 1 unmatched outer tuple (1)
+--batch 3 falls back with 389 chunks with 1 unmatched outer tuple (8)
+--batch 4 falls back with 409 chunks with no unmatched outer tuples
+--batch 5 falls back with 366 chunks with 1 unmatched outer tuple (4)
+--batch 6 falls back with 407 chunks with 1 unmatched outer tuple (11)
+--batch 7 falls back with 382 chunks with unmatched outer tuple (10)
+--batch 8 falls back with 413 chunks with no unmatched outer tuples
+--batch 9 falls back with 371 chunks with 1 unmatched outer tuple (3)
+--batch 10 falls back with 389 chunks with no unmatched outer tuples
+--batch 11 falls back with 408 chunks with no unmatched outer tuples
+--batch 12 falls back with 387 chunks with no unmatched outer tuples
+--batch 13 falls back with 402 chunks with 1 unmatched outer tuple (18) 
+--batch 14 falls back with 369 chunks with 1 unmatched outer tuple (9)
+--batch 15 falls back with 387 chunks with no unmatched outer tuples
+--batch 16 falls back with 365 chunks with no unmatched outer tuples
+--batch 17 falls back with 403 chunks with 2 unmatched outer tuples (14,19)
+--batch 18 falls back with 375 chunks with no unmatched outer tuples
+--batch 19 falls back with 384 chunks with no unmatched outer tuples
+--batch 20 falls back with 377 chunks with 1 unmatched outer tuple (12)
+--batch 22 falls back with 401 chunks with no unmatched outer tuples
+--batch 23 falls back with 396 chunks with no unmatched outer tuples
+--batch 24 falls back with 387 chunks with 1 unmatched outer tuple (5)
+--batch 25 falls back with 399 chunks with 1 unmatched outer tuple (7)
+--batch 26 falls back with 387 chunks.
+--batch 27 falls back with 442 chunks.
+--batch 28 falls back with 385 chunks with 1 unmatched outer tuple (17)
+--batch 29 falls back with 375 chunks.
+--batch 30 falls back with 404 chunks with 1 unmatched outer tuple (6)
+--batch 31 falls back with 396 chunks with 2 unmatched outer tuples (13,16)
+select * from explain_multi_batch();
+select count(*) from t1 left outer join t2 on a = b;
+select count(a) from t1 left outer join t2 on a = b;
+select count(*) from t1, t2 where a = b;
+-- used to give wrong results because there is a whole batch of outer which is
+-- empty and so the inner doesn't emit unmatched tuples with ROJ
+select count(*) from t1 right outer join t2 on a = b;
+select count(*) from t1 full outer join t2 on a = b; 
+
+-- Test_6 non-negligible amount of data test case
+-- TODO: doesn't finish with my code when it is set to be serial
+-- it does finish when it is parallel -- the serial version is either simply too
+-- slow or has a bug -- I tried it with less data and it did finish, so it must
+-- just be really slow
+-- inner join shouldn't even need to make the unmatched files
+-- it finishes eventually if I decrease data amount
+
+--drop table simple;
+--create table simple as
+ -- select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+--alter table simple set (parallel_workers = 2);
+--analyze simple;
+--
+--drop table extremely_skewed;
+--create table extremely_skewed (id int, t text);
+--alter table extremely_skewed set (autovacuum_enabled = 'false');
+--alter table extremely_skewed set (parallel_workers = 2);
+--analyze extremely_skewed;
+--insert into extremely_skewed
+--  select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+--  from generate_series(1, 20000);
+--update pg_class
+--  set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+--  where relname = 'extremely_skewed';
+
+--set work_mem=64;
+--set enable_mergejoin to off;
+--explain (analyze, costs off, timing off)
+  --select * from simple r join extremely_skewed s using (id);
+--select * from explain_multi_batch();
+
+drop table t1;
+drop table t2;
+drop function explain_multi_batch();
+reset enable_mergejoin;
+reset work_mem;
+reset search_path;
+drop schema adaptive_hj;
diff --git a/src/test/regress/sql/parallel_adaptive_hj.sql b/src/test/regress/sql/parallel_adaptive_hj.sql
new file mode 100644
index 000000000000..3071c5f82efa
--- /dev/null
+++ b/src/test/regress/sql/parallel_adaptive_hj.sql
@@ -0,0 +1,182 @@
+create schema parallel_adaptive_hj;
+set search_path=parallel_adaptive_hj;
+
+-- TODO: anti-semi-join and semi-join tests
+
+-- TODO: check if test2 and 3 are different at all
+
+-- TODO: add test for parallel-oblivious parallel hash join
+
+-- TODO: make this function general
+create or replace function explain_parallel_multi_batch() returns setof text language plpgsql as
+$$
+declare ln text;
+begin
+    for ln in
+        explain (analyze, summary off, timing off, costs off)
+		select count(*) from t1 left outer join t2 on a = b
+    loop
+        ln := regexp_replace(ln, 'Memory Usage: \S*',  'Memory Usage: xxx');
+        return next ln;
+    end loop;
+end;
+$$;
+
+-- parallel setup
+set enable_nestloop to off;
+set enable_mergejoin to off;
+set  min_parallel_table_scan_size = 0;
+set  parallel_setup_cost = 0;
+set  enable_parallel_hash = on;
+set  enable_hashjoin = on;
+set  max_parallel_workers_per_gather = 1;
+set  work_mem = 64;
+
+-- Parallel_Test_1 setup
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,11)i;
+insert into t1 select 2 from generate_series(1,18)i;
+analyze t1;
+
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i from generate_series(4,2500)i;
+insert into t2 select 2 from generate_series(1,10)i;
+analyze t2;
+alter table t2 set (autovacuum_enabled = 'false');
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+
+-- Parallel_Test_1.1
+-- spills in 4 batches
+-- 1 resize of nbatches
+-- no batch falls back
+select * from explain_parallel_multi_batch();
+-- need an aggregate to exercise the code but still want to know if we are
+-- emitting the right unmatched outer tuples
+select count(a) from t1 left outer join t2 on a = b;
+select count(*) from t1 left outer join t2 on a = b;
+
+-- Parallel_Test_1.1.a
+-- results checking for inner join
+-- doesn't fall back
+select count(*) from t1, t2 where a = b;
+-- Parallel_Test_1.1.b
+-- results checking for right outer join
+-- doesn't exercise the fallback code but just checking results
+select count(*) from t1 right outer join t2 on a = b;
+-- Parallel_Test_1.1.c
+-- results checking for full outer join
+select count(*) from t1 full outer join t2 on a = b;
+
+-- Parallel_Test_1.2
+-- spill and doesn't have to resize nbatches
+analyze t2;
+select * from explain_parallel_multi_batch();
+select count(a) from t1 left outer join t2 on a = b;
+
+-- Parallel_Test_1.3
+-- doesn't spill
+-- does resize nbuckets
+set work_mem = '4MB';
+select * from explain_parallel_multi_batch();
+select count(a) from t1 left outer join t2 on a = b;
+set work_mem = 64;
+
+
+-- Parallel_Test_3
+-- big example
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i from generate_series(20,25000)i;
+insert into t2 select 2 from generate_series(1,100)i;
+analyze t2;
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,111)i;
+insert into t1 select 2 from generate_series(1,180)i;
+analyze t1;
+
+select * from explain_parallel_multi_batch();
+select count(*) from t1 left outer join t2 on a = b;
+
+-- TODO: check what each of these is exercising -- chunk num, etc and write that
+-- down
+-- also, note that this example did reveal with ROJ that it wasn't working, so
+-- maybe keep that but it is not parallel
+-- make sure the plans make sense for the code we are writing
+select count(*) from t1 left outer join t2 on a = b;
+select count(*) from t1, t2 where a = b;
+select count(*) from t1 right outer join t2 on a = b;
+select count(*) from t1 full outer join t2 on a = b;
+
+-- Parallel_Test_4
+-- spill and resize nbatches 2x
+
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i from generate_series(4,1000)i;
+insert into t2 select 2 from generate_series(1,4000)i;
+analyze t2;
+alter table t2 set (autovacuum_enabled = 'false');
+update pg_class
+set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+where relname = 't2';
+
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,11)i;
+insert into t1 select 2 from generate_series(1,18)i;
+insert into t1 values(500);
+analyze t1;
+
+select * from explain_parallel_multi_batch();
+select count(*) from t1 left outer join t2 on a = b;
+select count(*) from t1, t2 where a = b;
+select count(*) from t1 right outer join t2 on a = b;
+select count(*) from t1 full outer join t2 on a = b;
+select count(a) from t1 left outer join t2 on a = b;
+
+-- Parallel_Test_5
+-- revealed race condition because two workers are working on a chunked batch
+-- only 2 unmatched tuples
+
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i%1111 from generate_series(200,10000)i;
+delete from t2 where b = 115;
+delete from t2 where b = 200;
+insert into t2 select 2 from generate_series(1,4000);
+analyze t2;
+alter table t2 set (autovacuum_enabled = 'false');
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,111)i;
+insert into t1 values(115);
+insert into t1 values(200);
+insert into t1 select 2 from generate_series(1,180)i;
+analyze t1;
+
+select * from explain_parallel_multi_batch();
+select count(*) from t1 left outer join t2 on a = b;
+
+-- without count(*), can't reproduce desired plan so can't rely on results
+select count(*) from t1 left outer join t2 on a = b;
+
+drop table if exists t1;
+drop table if exists t2;
+drop function explain_parallel_multi_batch();
+reset enable_mergejoin;
+reset work_mem;
+reset search_path;
+drop schema parallel_adaptive_hj;
-- 
2.25.0

v4-0003-Address-barrier-wait-deadlock-hazard.patchtext/x-patch; charset=US-ASCII; name=v4-0003-Address-barrier-wait-deadlock-hazard.patchDownload

From 26fcccf150b0a7c1f210b331a47fba5ae302130d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sat, 11 Jan 2020 16:57:34 -0800
Subject: [PATCH v4 3/4] Address barrier wait deadlock hazard

Previously, in the chunk phase machine in
ExecParallelHashJoinNewChunk(), we reused one chunk barrier for all
chunks and looped through the phases then set the phase back to the
initial phase and explicitly jumped there.
Now, we initialize an array of chunk barriers, one per chunk, and use a
new chunk barrier for each chunk.
After finishing probing a chunk, upon re-entering
ExecParallelHashJoinNewChunk(), workers will wait on the chunk barrier
for all participants to arrive.
This is okay because the barrier is advanced to the final phase as part
of this wait (per the comment in nodeHashJoin.c about deadlock risk with
waiting on barriers after emitting tuples).
The last worker to arrive will increment the chunk number.
All workers detach from the chunk barrier they are attached to and
select the next chunk barrier.

The hashtable is now reset in the first phase of the chunk phase machine
PHJ_CHUNK_ELECTING. Note that this will cause an unnecessary hashtable
reset for the first chunk.

The loading and probing phases of the chunk phase machine stay the same.

If a worker joins in the PHJ_CHUNK_DONE phase, it will simply detach
from the chunk barrier and move on to the next chunk barrier in the
array of chunk barriers.

In order to mitigate the other cause of deadlock hazard (workers wait on
the batch barrier after emitting tuples), now, in
ExecParallelHashJoinNewBatch(), if we are attached to a batch barrier
and it is a fallback batch, all workers will detach from the batch
barrier and then end their scan of that batch.
The last worker to detach will combine the outer match status files,
then it will detach from the batch, clean up the hashtable, and end its
scan of the inner side.
Then it will return and proceed to emit unmatched outer tuples.
In PHJ_BATCH_ELECTING, the worker that ends up allocating the hashtable
will also initialize the chunk barriers.

Also, this commit moves combined_bitmap to batch from hjstate. It will
be moved into the SharedBits store once that API is added.

Co-authored-by: Jesse Zhang <sbjesse@gmail.com>
---
 src/backend/executor/adaptiveHashjoin.c | 357 +++++++++++++++---------
 src/backend/executor/nodeHash.c         |   1 +
 src/backend/executor/nodeHashjoin.c     | 107 ++++---
 src/backend/postmaster/pgstat.c         |   3 +
 src/include/executor/adaptiveHashjoin.h |   1 -
 src/include/executor/hashjoin.h         |  15 +-
 src/include/executor/nodeHash.h         |   1 +
 src/include/nodes/execnodes.h           |   1 -
 src/include/pgstat.h                    |   1 +
 9 files changed, 301 insertions(+), 186 deletions(-)

diff --git a/src/backend/executor/adaptiveHashjoin.c b/src/backend/executor/adaptiveHashjoin.c
index 64af2a24f346..45846a076916 100644
--- a/src/backend/executor/adaptiveHashjoin.c
+++ b/src/backend/executor/adaptiveHashjoin.c
@@ -24,7 +24,9 @@ ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 	ParallelHashJoinBatch *phj_batch;
 	SharedTuplestoreAccessor *outer_tuples;
 	SharedTuplestoreAccessor *inner_tuples;
+	Barrier    *barriers;
 	Barrier    *chunk_barrier;
+	Barrier    *old_chunk_barrier;
 
 	hashtable = hjstate->hj_HashTable;
 	batchno = hashtable->curbatch;
@@ -33,10 +35,11 @@ ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 	inner_tuples = hashtable->batches[batchno].inner_tuples;
 
 	/*
-	 * This chunk_barrier is initialized in the ELECTING phase when this
+	 * These chunk_barriers are initialized in the ELECTING phase when this
 	 * worker attached to the batch in ExecParallelHashJoinNewBatch()
 	 */
-	chunk_barrier = &hashtable->batches[batchno].shared->chunk_barrier;
+	barriers = dsa_get_address(hashtable->area, hashtable->batches[batchno].shared->chunk_barriers);
+	old_chunk_barrier = &(barriers[phj_batch->current_chunk - 1]);
 
 	/*
 	 * If this worker just came from probing (from HJ_SCAN_BUCKET) we need to
@@ -49,14 +52,16 @@ ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 		 * The current chunk number can't be incremented if *any* worker isn't
 		 * done yet (otherwise they might access the wrong data structure!)
 		 */
-		if (BarrierArriveAndWait(chunk_barrier,
+		if (BarrierArriveAndWait(old_chunk_barrier,
 								 WAIT_EVENT_HASH_CHUNK_PROBING))
 			phj_batch->current_chunk++;
-
+		BarrierDetach(old_chunk_barrier);
 		/* Once the barrier is advanced we'll be in the DONE phase */
 	}
-	else
-		BarrierAttach(chunk_barrier);
+	if (phj_batch->current_chunk > phj_batch->total_chunks)
+		return false;
+	chunk_barrier = &(barriers[phj_batch->current_chunk - 1]);
+	/* is this a race condition ? */
 
 	/*
 	 * The outer side is exhausted and either 1) the current chunk of the
@@ -64,105 +69,186 @@ ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 	 * chunk of the inner side is exhausted and it is time to advance the
 	 * batch
 	 */
-	switch (BarrierPhase(chunk_barrier))
+
+	for (;;)
 	{
-			/*
-			 * TODO: remove this phase and coordinate access to hashtable
-			 * above goto and after incrementing current_chunk
-			 */
-		case PHJ_CHUNK_ELECTING:
-	phj_chunk_electing:
-			BarrierArriveAndWait(chunk_barrier,
-								 WAIT_EVENT_HASH_CHUNK_ELECTING);
-			/* Fall through. */
+		switch (BarrierAttach(chunk_barrier))
+		{
+			case PHJ_CHUNK_ELECTING:
+				if (BarrierArriveAndWait(chunk_barrier,
+										 WAIT_EVENT_HASH_CHUNK_ELECTING))
+				{
+					/*
+					 * TODO: this will unnecessarily reset the hashtable for
+					 * the first chunk. fix this?
+					 */
+					/*
+					 * rewind/reset outer tuplestore and rewind outer match
+					 * status files
+					 */
+					sts_reinitialize(outer_tuples);
 
-		case PHJ_CHUNK_LOADING:
-			/* Start (or join in) loading the next chunk of inner tuples. */
-			sts_begin_parallel_scan(inner_tuples);
+					/*
+					 * reset inner's hashtable and recycle the existing bucket
+					 * array.
+					 */
+					dsa_pointer_atomic *buckets = (dsa_pointer_atomic *)
+					dsa_get_address(hashtable->area, phj_batch->buckets);
 
-			MinimalTuple tuple;
-			tupleMetadata metadata;
+					for (size_t i = 0; i < hashtable->nbuckets; ++i)
+						dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
 
-			while ((tuple = sts_parallel_scan_next(inner_tuples, &metadata)))
-			{
-				if (metadata.chunk != phj_batch->current_chunk)
-					continue;
+					/*
+					 * TODO: this will unfortunately rescan all inner tuples
+					 * in the batch for each chunk
+					 */
 
-				ExecForceStoreMinimalTuple(tuple,
-										   hjstate->hj_HashTupleSlot,
-										   false);
+					/*
+					 * should be able to save the block in the file which
+					 * starts the next chunk instead
+					 */
+					sts_reinitialize(inner_tuples);
+				}
+				/* Fall through. */
+			case PHJ_CHUNK_RESETTING:
+				BarrierArriveAndWait(chunk_barrier, WAIT_EVENT_HASH_CHUNK_RESETTING);
+			case PHJ_CHUNK_LOADING:
+				/* Start (or join in) loading the next chunk of inner tuples. */
+				sts_begin_parallel_scan(inner_tuples);
+
+				MinimalTuple tuple;
+				tupleMetadata metadata;
+
+				while ((tuple = sts_parallel_scan_next(inner_tuples, &metadata)))
+				{
+					if (metadata.chunk != phj_batch->current_chunk)
+						continue;
+
+					ExecForceStoreMinimalTuple(tuple,
+											   hjstate->hj_HashTupleSlot,
+											   false);
+
+					ExecParallelHashTableInsertCurrentBatch(
+															hashtable,
+															hjstate->hj_HashTupleSlot,
+															metadata.hashvalue);
+				}
+				sts_end_parallel_scan(inner_tuples);
+				BarrierArriveAndWait(chunk_barrier,
+									 WAIT_EVENT_HASH_CHUNK_LOADING);
+				/* Fall through. */
+
+			case PHJ_CHUNK_PROBING:
+				sts_begin_parallel_scan(outer_tuples);
+				return true;
 
-				ExecParallelHashTableInsertCurrentBatch(
-														hashtable,
-														hjstate->hj_HashTupleSlot,
-														metadata.hashvalue);
-			}
-			sts_end_parallel_scan(inner_tuples);
-			BarrierArriveAndWait(chunk_barrier,
-								 WAIT_EVENT_HASH_CHUNK_LOADING);
-			/* Fall through. */
+			case PHJ_CHUNK_DONE:
+				if (phj_batch->current_chunk > phj_batch->total_chunks)
+					return false;
+				/* TODO: exercise this somehow (ideally, in a test) */
+				BarrierDetach(chunk_barrier);
+				if (chunk_barrier < barriers + phj_batch->total_chunks)
+				{
+					++chunk_barrier;
+					continue;
+				}
+				else
+					return false;
 
-		case PHJ_CHUNK_PROBING:
-			sts_begin_parallel_scan(outer_tuples);
-			return true;
+			default:
+				elog(ERROR, "unexpected chunk phase %d. pid %i. batch %i.",
+					 BarrierPhase(chunk_barrier), MyProcPid, batchno);
+		}
+	}
 
-		case PHJ_CHUNK_DONE:
+	return false;
+}
 
-			BarrierArriveAndWait(chunk_barrier, WAIT_EVENT_HASH_CHUNK_DONE);
 
-			if (phj_batch->current_chunk > phj_batch->total_chunks)
-			{
-				BarrierDetach(chunk_barrier);
-				return false;
-			}
+static void
+ExecHashTableLoopDetachBatchForChosen(HashJoinTable hashtable)
+{
+	if (hashtable->parallel_state != NULL &&
+		hashtable->curbatch >= 0)
+	{
+		int			curbatch = hashtable->curbatch;
+		ParallelHashJoinBatch *batch = hashtable->batches[curbatch].shared;
 
-			/*
-			 * Otherwise it is time for the next chunk. One worker should
-			 * reset the hashtable
-			 */
-			if (BarrierArriveExplicitAndWait(chunk_barrier, PHJ_CHUNK_ELECTING, WAIT_EVENT_HASH_ADVANCE_CHUNK))
-			{
-				/*
-				 * rewind/reset outer tuplestore and rewind outer match status
-				 * files
-				 */
-				sts_reinitialize(outer_tuples);
+		/* Make sure any temporary files are closed. */
+		sts_end_parallel_scan(hashtable->batches[curbatch].inner_tuples);
 
-				/*
-				 * reset inner's hashtable and recycle the existing bucket
-				 * array.
-				 */
-				dsa_pointer_atomic *buckets = (dsa_pointer_atomic *)
-				dsa_get_address(hashtable->area, phj_batch->buckets);
+		/* Detach from the batch we were last working on. */
 
-				for (size_t i = 0; i < hashtable->nbuckets; ++i)
-					dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
+		/*
+		 * Technically we shouldn't access the barrier because we're no longer
+		 * attached, but since there is no way it's moving after this point it
+		 * seems safe to make the following assertion.
+		 */
+		Assert(BarrierPhase(&batch->batch_barrier) == PHJ_BATCH_DONE);
 
-				/*
-				 * TODO: this will unfortunately rescan all inner tuples in
-				 * the batch for each chunk
-				 */
+		/* Free shared chunks and buckets. */
+		while (DsaPointerIsValid(batch->chunks))
+		{
+			HashMemoryChunk chunk =
+			dsa_get_address(hashtable->area, batch->chunks);
+			dsa_pointer next = chunk->next.shared;
 
-				/*
-				 * should be able to save the block in the file which starts
-				 * the next chunk instead
-				 */
-				sts_reinitialize(inner_tuples);
-			}
-			goto phj_chunk_electing;
+			dsa_free(hashtable->area, batch->chunks);
+			batch->chunks = next;
+		}
+		if (DsaPointerIsValid(batch->buckets))
+		{
+			dsa_free(hashtable->area, batch->buckets);
+			batch->buckets = InvalidDsaPointer;
+		}
 
-		case PHJ_CHUNK_FINAL:
-			BarrierDetach(chunk_barrier);
-			return false;
+		/*
+		 * Free chunk barrier
+		 */
+		/* TODO: why is this NULL check needed? */
+		if (DsaPointerIsValid(batch->chunk_barriers))
+		{
+			dsa_free(hashtable->area, batch->chunk_barriers);
+			batch->chunk_barriers = InvalidDsaPointer;
+		}
 
-		default:
-			elog(ERROR, "unexpected chunk phase %d. pid %i. batch %i.",
-				 BarrierPhase(chunk_barrier), MyProcPid, batchno);
-	}
+		/*
+		 * Track the largest batch we've been attached to.  Though each
+		 * backend might see a different subset of batches, explain.c will
+		 * scan the results from all backends to find the largest value.
+		 */
+		hashtable->spacePeak =
+			Max(hashtable->spacePeak,
+				batch->size + sizeof(dsa_pointer_atomic) * hashtable->nbuckets);
 
-	return false;
+	}
 }
 
+static void
+ExecHashTableLoopDetachBatchForOthers(HashJoinTable hashtable)
+{
+	if (hashtable->parallel_state != NULL &&
+		hashtable->curbatch >= 0)
+	{
+		int			curbatch = hashtable->curbatch;
+		ParallelHashJoinBatch *batch = hashtable->batches[curbatch].shared;
+
+		sts_end_parallel_scan(hashtable->batches[curbatch].inner_tuples);
+		sts_end_parallel_scan(hashtable->batches[curbatch].outer_tuples);
+
+		/*
+		 * Track the largest batch we've been attached to.  Though each
+		 * backend might see a different subset of batches, explain.c will
+		 * scan the results from all backends to find the largest value.
+		 */
+		hashtable->spacePeak =
+			Max(hashtable->spacePeak,
+				batch->size + sizeof(dsa_pointer_atomic) * hashtable->nbuckets);
+
+		/* Remember that we are not attached to a batch. */
+		hashtable->curbatch = -1;
+	}
+}
 
 /*
  * Choose a batch to work on, and attach to it.  Returns true if successful,
@@ -183,18 +269,6 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 	if (hashtable->batches == NULL)
 		return false;
 
-	/*
-	 * For hashloop fallback only Only the elected worker who was chosen to
-	 * combine the outer match status bitmaps should reach here. This worker
-	 * must do some final cleanup and then detach from the batch
-	 */
-	if (hjstate->combined_bitmap != NULL)
-	{
-		BufFileClose(hjstate->combined_bitmap);
-		hjstate->combined_bitmap = NULL;
-		hashtable->batches[hashtable->curbatch].done = true;
-		ExecHashTableDetachBatch(hashtable);
-	}
 
 	/*
 	 * If we were already attached to a batch, remember not to bother checking
@@ -211,41 +285,62 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 		ParallelHashJoinBatchAccessor *accessor = hashtable->batches + hashtable->curbatch;
 		ParallelHashJoinBatch *batch = accessor->shared;
 
-		/*
-		 * End the parallel scan on the outer tuples before we arrive at the
-		 * next barrier so that the last worker to arrive at that barrier can
-		 * reinitialize the SharedTuplestore for another parallel scan.
-		 */
-
 		if (!batch->parallel_hashloop_fallback)
-			BarrierArriveAndWait(&batch->batch_barrier,
-								 WAIT_EVENT_HASH_BATCH_PROBING);
+		{
+			hashtable->batches[hashtable->curbatch].done = true;
+			ExecHashTableDetachBatch(hashtable);
+		}
+
+		else if (accessor->combined_bitmap != NULL)
+		{
+			BufFileClose(accessor->combined_bitmap);
+			accessor->combined_bitmap = NULL;
+			accessor->done = true;
+
+			/*
+			 * though we have already de-commissioned the shared area of the
+			 * hashtable the curbatch is backend-local and should still be
+			 * valid
+			 */
+			sts_end_parallel_scan(hashtable->batches[hashtable->curbatch].outer_tuples);
+			hashtable->curbatch = -1;
+		}
+
 		else
 		{
 			sts_close_outer_match_status_file(accessor->outer_tuples);
 
 			/*
 			 * If all workers (including this one) have finished probing the
-			 * batch, one worker is elected to Combine all the outer match
-			 * status files from the workers who were attached to this batch
-			 * Loop through the outer match status files from all workers that
-			 * were attached to this batch Combine them into one bitmap Use
-			 * the bitmap, loop through the outer batch file again, and emit
-			 * unmatched tuples
+			 * batch, one worker is elected to Loop through the outer match
+			 * status files from all workers that were attached to this batch
+			 * Combine them into one bitmap Use the bitmap, loop through the
+			 * outer batch file again, and emit unmatched tuples All workers
+			 * will detach from the batch barrier and the last worker will
+			 * clean up the hashtable. All workers except the last worker will
+			 * end their scans of the outer and inner side The last worker
+			 * will end its scan of the inner side
 			 */
-
-			if (BarrierArriveAndWait(&batch->batch_barrier,
-									 WAIT_EVENT_HASH_BATCH_PROBING))
+			if (BarrierArriveAndDetach(&batch->batch_barrier))
 			{
-				hjstate->combined_bitmap = sts_combine_outer_match_status_files(accessor->outer_tuples);
+				/*
+				 * For hashloop fallback only Only the elected worker who was
+				 * chosen to combine the outer match status bitmaps should
+				 * reach here. This worker must do some final cleanup and then
+				 * detach from the batch
+				 */
+				accessor->combined_bitmap = sts_combine_outer_match_status_files(accessor->outer_tuples);
+				ExecHashTableLoopDetachBatchForChosen(hashtable);
 				hjstate->last_worker = true;
 				return true;
 			}
+			/* the elected combining worker should not reach here */
+			else
+			{
+				hashtable->batches[hashtable->curbatch].done = true;
+				ExecHashTableLoopDetachBatchForOthers(hashtable);
+			}
 		}
-
-		/* the elected combining worker should not reach here */
-		hashtable->batches[hashtable->curbatch].done = true;
-		ExecHashTableDetachBatch(hashtable);
 	}
 
 	/*
@@ -272,11 +367,16 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 											 WAIT_EVENT_HASH_BATCH_ELECTING))
 					{
 						ExecParallelHashTableAlloc(hashtable, batchno);
-						Barrier    *chunk_barrier =
-						&hashtable->batches[batchno].shared->chunk_barrier;
+						ParallelHashJoinBatch *phj_batch = hashtable->batches[batchno].shared;
+
+						phj_batch->chunk_barriers = dsa_allocate(hashtable->area, phj_batch->total_chunks * sizeof(Barrier));
+						Barrier    *barriers = dsa_get_address(hashtable->area, phj_batch->chunk_barriers);
 
-						BarrierInit(chunk_barrier, 0);
-						hashtable->batches[batchno].shared->current_chunk = 1;
+						for (int i = 0; i < phj_batch->total_chunks; i++)
+						{
+							BarrierInit(&(barriers[i]), 0);
+						}
+						phj_batch->current_chunk = 1;
 					}
 					/* Fall through. */
 
@@ -314,17 +414,6 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 
 					return true;
 
-				case PHJ_BATCH_OUTER_MATCH_STATUS_PROCESSING:
-
-					/*
-					 * The batch isn't done but this worker can't contribute
-					 * anything to it so it might as well be done from this
-					 * worker's perspective. (Only one worker can do work in
-					 * this phase).
-					 */
-
-					/* Fall through. */
-
 				case PHJ_BATCH_DONE:
 
 					/*
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index cb2f95ac0a76..afdc31a3b30c 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -3173,6 +3173,7 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 		accessor->shared = shared;
 		accessor->preallocated = 0;
 		accessor->done = false;
+		accessor->combined_bitmap = NULL;
 		accessor->inner_tuples =
 			sts_attach(ParallelHashJoinBatchInner(shared),
 					   ParallelWorkerNumber + 1,
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 6a8efc0765a4..a454cba54543 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -1068,57 +1068,79 @@ ExecParallelHashJoin(PlanState *pstate)
 				/* FALL THRU */
 
 			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER:
+				{
+					ParallelHashJoinBatchAccessor *batch_accessor =
+					&node->hj_HashTable->batches[node->hj_HashTable->curbatch];
 
-				Assert(node->combined_bitmap != NULL);
-
-				outer_acc = node->hj_HashTable->batches[node->hj_HashTable->curbatch].outer_tuples;
+					Assert(batch_accessor->combined_bitmap != NULL);
 
-				MinimalTuple tuple;
+					/*
+					 * TODO: there should be a way to know the current batch
+					 * for the purposes of getting
+					 */
 
-				do
-				{
-					tupleMetadata metadata;
+					/*
+					 * the outer tuplestore without needing curbatch from the
+					 * hashtable so we can detach
+					 */
+					/* from the batch (ExecHashTableDetachBatch) */
+					outer_acc =
+						batch_accessor->outer_tuples;
+					MinimalTuple tuple;
 
-					if ((tuple = sts_parallel_scan_next(outer_acc, &metadata)) == NULL)
-						break;
+					do
+					{
+						tupleMetadata metadata;
+
+						if ((tuple =
+							 sts_parallel_scan_next(outer_acc, &metadata)) ==
+							NULL)
+							break;
+
+						uint32		bytenum = metadata.tupleid / 8;
+						unsigned char bit = metadata.tupleid % 8;
+						unsigned char byte_to_check = 0;
+
+						/* seek to byte to check */
+						if (BufFileSeek(batch_accessor->combined_bitmap,
+										0,
+										bytenum,
+										SEEK_SET))
+							ereport(ERROR,
+									(errcode_for_file_access(),
+									 errmsg(
+											"could not rewind shared outer temporary file: %m")));
+						/* read byte containing ntuple bit */
+						if (BufFileRead(batch_accessor->combined_bitmap, &byte_to_check, 1) ==
+							0)
+							ereport(ERROR,
+									(errcode_for_file_access(),
+									 errmsg(
+											"could not read byte in outer match status bitmap: %m.")));
+						/* if bit is set */
+						bool		match = ((byte_to_check) >> bit) & 1;
+
+						if (!match)
+							break;
+					}
+					while (1);
 
-					uint32		bytenum = metadata.tupleid / 8;
-					unsigned char bit = metadata.tupleid % 8;
-					unsigned char byte_to_check = 0;
-
-					/* seek to byte to check */
-					if (BufFileSeek(node->combined_bitmap, 0, bytenum, SEEK_SET))
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not rewind shared outer temporary file: %m")));
-					/* read byte containing ntuple bit */
-					if (BufFileRead(node->combined_bitmap, &byte_to_check, 1) == 0)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not read byte in outer match status bitmap: %m.")));
-					/* if bit is set */
-					bool		match = ((byte_to_check) >> bit) & 1;
-
-					if (!match)
+					if (tuple == NULL)
+					{
+						sts_end_parallel_scan(outer_acc);
+						node->hj_JoinState = HJ_NEED_NEW_BATCH;
 						break;
-				} while (1);
-
-				if (tuple == NULL)
-				{
-					sts_end_parallel_scan(outer_acc);
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
-					break;
-				}
-
-				/* Emit the unmatched tuple */
-				ExecForceStoreMinimalTuple(tuple,
-										   econtext->ecxt_outertuple,
-										   false);
-				econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
+					}
 
-				return ExecProject(node->js.ps.ps_ProjInfo);
+					/* Emit the unmatched tuple */
+					ExecForceStoreMinimalTuple(tuple,
+											   econtext->ecxt_outertuple,
+											   false);
+					econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
 
+					return ExecProject(node->js.ps.ps_ProjInfo);
 
+				}
 			default:
 				elog(ERROR, "unrecognized hashjoin state: %d",
 					 (int) node->hj_JoinState);
@@ -1170,7 +1192,6 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate->hj_InnerExhausted = false;
 
 	hjstate->last_worker = false;
-	hjstate->combined_bitmap = NULL;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e582365e8409..887e3fa75022 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3812,6 +3812,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_CHUNK_ELECTING:
 			event_name = "Hash/Chunk/Electing";
 			break;
+		case WAIT_EVENT_HASH_CHUNK_RESETTING:
+			event_name = "Hash/Chunk/Resetting";
+			break;
 		case WAIT_EVENT_HASH_CHUNK_LOADING:
 			event_name = "Hash/Chunk/Loading";
 			break;
diff --git a/src/include/executor/adaptiveHashjoin.h b/src/include/executor/adaptiveHashjoin.h
index 030a04c5c005..135aed0b199c 100644
--- a/src/include/executor/adaptiveHashjoin.h
+++ b/src/include/executor/adaptiveHashjoin.h
@@ -5,5 +5,4 @@
 extern bool ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing);
 extern bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
 
-
 #endif							/* ADAPTIVE_HASHJOIN_H */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index e5a00f84e321..b2cc12dc19be 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -166,7 +166,7 @@ typedef struct ParallelHashJoinBatch
 	int			total_chunks;
 	int			current_chunk;
 	size_t		estimated_chunk_size;
-	Barrier		chunk_barrier;
+	dsa_pointer chunk_barriers;
 	LWLock		lock;
 
 	dsa_pointer chunks;			/* chunks of tuples loaded */
@@ -221,6 +221,7 @@ typedef struct ParallelHashJoinBatchAccessor
 	bool		at_least_one_chunk; /* has this backend allocated a chunk? */
 
 	bool		done;			/* flag to remember that a batch is done */
+	BufFile    *combined_bitmap;	/* for Adaptive Hashjoin only  */
 	SharedTuplestoreAccessor *inner_tuples;
 	SharedTuplestoreAccessor *outer_tuples;
 } ParallelHashJoinBatchAccessor;
@@ -282,14 +283,14 @@ typedef struct ParallelHashJoinState
 #define PHJ_BATCH_ELECTING				0
 #define PHJ_BATCH_ALLOCATING			1
 #define PHJ_BATCH_CHUNKING				2
-#define PHJ_BATCH_OUTER_MATCH_STATUS_PROCESSING 3
-#define PHJ_BATCH_DONE					4
+#define PHJ_BATCH_DONE					3
 
 #define PHJ_CHUNK_ELECTING				0
-#define PHJ_CHUNK_LOADING				1
-#define PHJ_CHUNK_PROBING				2
-#define PHJ_CHUNK_DONE					3
-#define PHJ_CHUNK_FINAL					4
+#define PHJ_CHUNK_RESETTING				1
+#define PHJ_CHUNK_LOADING				2
+#define PHJ_CHUNK_PROBING				3
+#define PHJ_CHUNK_DONE					4
+#define PHJ_CHUNK_FINAL					5
 
 /* The phases of batch growth while hashing, for grow_batches_barrier. */
 #define PHJ_GROW_BATCHES_ELECTING		0
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index dfc221e6a111..f6d5b477085e 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -31,6 +31,7 @@ extern void ExecParallelHashTableAlloc(HashJoinTable hashtable,
 extern void ExecHashTableDestroy(HashJoinTable hashtable);
 extern void ExecHashTableDetach(HashJoinTable hashtable);
 extern void ExecHashTableDetachBatch(HashJoinTable hashtable);
+
 extern void ExecParallelHashTableSetCurrentBatch(HashJoinTable hashtable,
 												 int batchno);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b4f5f0357cb7..21e682334b15 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1967,7 +1967,6 @@ typedef struct HashJoinState
 
 	/* parallel hashloop fallback outer side */
 	bool		last_worker;
-	BufFile    *combined_bitmap;
 } HashJoinState;
 
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 340086a7e77c..dd2e8bd655d5 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -843,6 +843,7 @@ typedef enum
 	WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
 	WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
 	WAIT_EVENT_HASH_CHUNK_ELECTING,
+	WAIT_EVENT_HASH_CHUNK_RESETTING,
 	WAIT_EVENT_HASH_CHUNK_LOADING,
 	WAIT_EVENT_HASH_CHUNK_PROBING,
 	WAIT_EVENT_HASH_CHUNK_DONE,
-- 
2.25.0

#46

Melanie Plageman

melanieplageman@gmail.com

almost 6 years ago

In reply to: Melanie Plageman (#45)

5 attachment(s)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

I've implemented avoiding rescanning all inner tuples for each stripe
in the attached patch:
v5-0005-Avoid-rescanning-inner-tuples-per-stripe.patch

Patchset is rebased--and I had my first merge conflicts as I contend
with maintaining this long-running branch with large differences
between it and current hashjoin. I think I'll need to reconsider the
changes I've made if I want to make it maintainable.

As for patch 0005, not rescanning inner tuples for every stripe,
basically, instead of reinitializing the SharedTuplestore for the
inner side for each stripe (I'm using "stripe" from now on, but I
haven't done any retroactive renaming yet) during fallback, each
participant's read_page is set to the beginning of the
SharedTuplestoreChunk which contains the end of one stripe and the
beginning of another.

Previously all inner tuples were scanned and only tuples from the
current stripe were loaded.

Each SharedTuplestoreAccessor now has a variable "start_page", which
is initialized when it is assigned its read_page (which will always be
the beginning of a SharedTuplestoreChunk).

While loading tuples into the hashtable, if a tuple is from a past
stripe, the worker skips it (that will happen when a stripe straddles
two SharedTuplestoreChunks). If a tuple is from the future, the worker
backs that SharedTuplestoreChunk out and sets the shared read_page (in
the shared SharedTuplestoreParticipant) back to its start_page.

There are a couple mechanisms to provide for synchronization that
address specific race conditions/synchronization points -- those
scenarios are laid out in the commit message.

The first is a rule that a worker can only set read_page to a
start_page which is less than the current value of read_page.

The second is a "rewound" flag in the SharedTuplestoreParticipant. It
indicates if this participant has been rewound during loading of the
current stripe. If it has, a worker cannot be assigned a
SharedTuplestoreChunk. This flag is reset between stripes.

In this patch, Hashjoin makes an unacceptable intrusion into the
SharedTuplestore API. I am looking for feedback on how to solve this.

Basically, because the SharedTuplestore does not know about stripes or
about HashJoin, the logic to decide if a tuple should be loaded into a
hashtable or not is in the stripe phase machine where tuples are loaded
into the hashtable.

So, to ensure that workers have read from all participant files before
assuming all tuples from a stripe are loaded, I have duplicated the
logic from sts_parallel_scan_next() which has workers get the next
participant file and added it into in the body of the tuple loading
loop in the stripe phase machine (see sts_ready_for_next_stripe() and
sts_seen_all_participants()).

This clearly needs to be fixed and it is arguable that there are other
intrusions into the SharedTuplestore API in these patches.

One option is to write each stripe for each participant to a different
file, preserving the idea that a worker is done with a read_file when it
is at EOF.

Outside of addressing the relationship between SharedTuplestore,
stripes, and Hashjoin, I have re-prioritized the next steps for the
patch as follows:

Next Steps:
1) Rename "chunk" to "stripe"
1) refine fallback logic
3) refactor code to make it easier to keep it rebased
4) EXPLAIN ANALYZE instrumentation to show stripes probed by workers
5) anti/semi-join support

1)
The chunk/stripe thing is becoming extremely confusing.

2)
I re-prioritized refining the fallback logic because the premature
disabling of growth in serial hashjoin is making the join_hash test so
slow that it is slowing down iteration speed for me.

3)
I am wondering if Thomas Munro's idea to template-ize Hashjoin [1]/messages/by-id/CA+hUKGJjs6H77u+PL3ovMSowFZ8nib9Z+nHGNF6YNmw6osUU+A@mail.gmail.com
would make maintaining the diff easier, harder, or no different. The
code I've added made the main hashjoin state machine incredibly long,
so I broke it up into Parallel Hashjoin and Serial Hashjoin to make it
more manageable. This, of course, lends itself to difficult rebasing
(luckily only one small commit has been made to nodeHashjoin.c). If
the template-ization were to happen sooner, I could refactor my code
so that there were at least the same function names and the diffs
would be more clear.

4)
It is important that I have some way of knowing if I'm even exercising
code that I'm adding that involves multiple workers probing the same
stripes. As I make changes to the code, even though it will not
necessarily be deterministic, I can change the tests if I am no longer
able to get any of the concurrent behavior I'm looking for.

5)
Seems like it's time

[1]: /messages/by-id/CA+hUKGJjs6H77u+PL3ovMSowFZ8nib9Z+nHGNF6YNmw6osUU+A@mail.gmail.com
/messages/by-id/CA+hUKGJjs6H77u+PL3ovMSowFZ8nib9Z+nHGNF6YNmw6osUU+A@mail.gmail.com

--
Melanie Plageman

Attachments:

v5-0002-Fixup-tupleMetadata-struct-issues.patchapplication/octet-stream; name=v5-0002-Fixup-tupleMetadata-struct-issues.patchDownload

From 905fb5bee42e8bd7e8d64d9988f6dcd4aad05a15 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 7 Jan 2020 16:28:32 -0800
Subject: [PATCH v5 2/5] Fixup tupleMetadata struct issues

Remove __attribute__((packed)) from tupleMetadata. It is not needed
since I am using sizeof(struct tupleMetadata).

Change tupleMetadata members to include a union with an anonymous union
containing tupleid/chunk number.
tupleMetadata's tupleid member will be the tupleid in the outer side and
the chunk number in the inner side. Use a union for this since they will
be different types. Also, fix the signedness and type issues in code
using it. For now, this uses a 32bit int for tuples as I use an atomic
and 64bit atomic operations are not supported on all architecture/OS
combinations. It remains a TODO to make this variable backend local and
combine it to reduce the amount of synchronization needed.
Additionally, the tupleid/chunk number member should not be included for
non-fallback batches, as it bloats the tuplestore.

Also, this patch contains assorted updates to variable names/TODOs.
---
 src/backend/executor/adaptiveHashjoin.c   | 10 +++----
 src/backend/executor/nodeHash.c           | 30 +++++++++++++++------
 src/backend/executor/nodeHashjoin.c       | 25 ++++++++++-------
 src/backend/utils/sort/sharedtuplestore.c | 33 ++++++++++++-----------
 src/include/executor/hashjoin.h           |  4 +--
 src/include/utils/sharedtuplestore.h      | 16 +++++------
 6 files changed, 70 insertions(+), 48 deletions(-)

diff --git a/src/backend/executor/adaptiveHashjoin.c b/src/backend/executor/adaptiveHashjoin.c
index dff5b38d38..64af2a24f3 100644
--- a/src/backend/executor/adaptiveHashjoin.c
+++ b/src/backend/executor/adaptiveHashjoin.c
@@ -51,7 +51,7 @@ ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 		 */
 		if (BarrierArriveAndWait(chunk_barrier,
 								 WAIT_EVENT_HASH_CHUNK_PROBING))
-			phj_batch->current_chunk_num++;
+			phj_batch->current_chunk++;
 
 		/* Once the barrier is advanced we'll be in the DONE phase */
 	}
@@ -68,7 +68,7 @@ ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 	{
 			/*
 			 * TODO: remove this phase and coordinate access to hashtable
-			 * above goto and after incrementing current_chunk_num
+			 * above goto and after incrementing current_chunk
 			 */
 		case PHJ_CHUNK_ELECTING:
 	phj_chunk_electing:
@@ -85,7 +85,7 @@ ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 
 			while ((tuple = sts_parallel_scan_next(inner_tuples, &metadata)))
 			{
-				if (metadata.tupleid != phj_batch->current_chunk_num)
+				if (metadata.chunk != phj_batch->current_chunk)
 					continue;
 
 				ExecForceStoreMinimalTuple(tuple,
@@ -110,7 +110,7 @@ ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 
 			BarrierArriveAndWait(chunk_barrier, WAIT_EVENT_HASH_CHUNK_DONE);
 
-			if (phj_batch->current_chunk_num > phj_batch->total_num_chunks)
+			if (phj_batch->current_chunk > phj_batch->total_chunks)
 			{
 				BarrierDetach(chunk_barrier);
 				return false;
@@ -276,7 +276,7 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 						&hashtable->batches[batchno].shared->chunk_barrier;
 
 						BarrierInit(chunk_barrier, 0);
-						hashtable->batches[batchno].shared->current_chunk_num = 1;
+						hashtable->batches[batchno].shared->current_chunk = 1;
 					}
 					/* Fall through. */
 
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index c5420b169e..cb2f95ac0a 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1362,7 +1362,7 @@ ExecParallelHashRepartitionFirst(HashJoinTable hashtable)
 				/* TODO: should I check batch estimated size here at all? */
 				if (phj_batch->parallel_hashloop_fallback == true && (phj_batch->estimated_chunk_size + tuple_size > hashtable->parallel_state->space_allowed))
 				{
-					phj_batch->total_num_chunks++;
+					phj_batch->total_chunks++;
 					phj_batch->estimated_chunk_size = tuple_size;
 				}
 				else
@@ -1371,10 +1371,15 @@ ExecParallelHashRepartitionFirst(HashJoinTable hashtable)
 				tupleMetadata metadata;
 
 				metadata.hashvalue = hashTuple->hashvalue;
-				metadata.tupleid = phj_batch->total_num_chunks;
+				metadata.chunk = phj_batch->total_chunks;
 				LWLockRelease(&phj_batch->lock);
 
 				hashtable->batches[batchno].estimated_size += tuple_size;
+
+				/*
+				 * TODO: only put the chunk num if it is a fallback batch
+				 * (avoid bloating the metadata written to the file)
+				 */
 				sts_puttuple(hashtable->batches[batchno].inner_tuples,
 							 &metadata, tuple);
 			}
@@ -1451,14 +1456,19 @@ ExecParallelHashRepartitionRest(HashJoinTable hashtable)
 			/* TODO: should I check batch estimated size here at all? */
 			if (phj_batch->parallel_hashloop_fallback == true && (phj_batch->estimated_chunk_size + tuple_size > pstate->space_allowed))
 			{
-				phj_batch->total_num_chunks++;
+				phj_batch->total_chunks++;
 				phj_batch->estimated_chunk_size = tuple_size;
 			}
 			else
 				phj_batch->estimated_chunk_size += tuple_size;
-			metadata.tupleid = phj_batch->total_num_chunks;
+			metadata.chunk = phj_batch->total_chunks;
 			LWLockRelease(&phj_batch->lock);
 			/* Store the tuple its new batch. */
+
+			/*
+			 * TODO: only put the chunk num if it is a fallback batch (avoid
+			 * bloating the metadata written to the file)
+			 */
 			sts_puttuple(hashtable->batches[batchno].inner_tuples,
 						 &metadata, tuple);
 
@@ -1821,7 +1831,7 @@ retry:
 		 */
 		if (phj_batch->parallel_hashloop_fallback == true && (phj_batch->estimated_chunk_size + tuple_size > pstate->space_allowed))
 		{
-			phj_batch->total_num_chunks++;
+			phj_batch->total_chunks++;
 			phj_batch->estimated_chunk_size = tuple_size;
 		}
 		else
@@ -1830,9 +1840,13 @@ retry:
 		tupleMetadata metadata;
 
 		metadata.hashvalue = hashvalue;
-		metadata.tupleid = phj_batch->total_num_chunks;
+		metadata.chunk = phj_batch->total_chunks;
 		LWLockRelease(&phj_batch->lock);
 
+		/*
+		 * TODO: only put the chunk num if it is a fallback batch (avoid
+		 * bloating the metadata written to the file)
+		 */
 		sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata,
 					 tuple);
 	}
@@ -3043,8 +3057,8 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 		shared->parallel_hashloop_fallback = false;
 		LWLockInitialize(&shared->lock,
 						 LWTRANCHE_PARALLEL_HASH_JOIN_BATCH);
-		shared->current_chunk_num = 0;
-		shared->total_num_chunks = 1;
+		shared->current_chunk = 0;
+		shared->total_chunks = 1;
 		shared->estimated_chunk_size = 0;
 
 		/*
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 565b0c289f..91c0170e40 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -435,9 +435,8 @@ ExecHashJoin(PlanState *pstate)
 				{
 					/*
 					 * The current outer tuple has run out of matches, so
-					 * check whether to emit a dummy outer-join tuple.
-					 * Whether we emit one or not, the next state is
-					 * NEED_NEW_OUTER.
+					 * check whether to emit a dummy outer-join tuple. Whether
+					 * we emit one or not, the next state is NEED_NEW_OUTER.
 					 */
 					node->hj_JoinState = HJ_NEED_NEW_OUTER;
 					if (!node->hashloop_fallback || node->hj_HashTable->curbatch == 0)
@@ -907,7 +906,7 @@ ExecParallelHashJoin(PlanState *pstate)
 
 				ParallelHashJoinBatch *phj_batch = node->hj_HashTable->batches[node->hj_HashTable->curbatch].shared;
 
-				if (!phj_batch->parallel_hashloop_fallback || phj_batch->current_chunk_num == 1)
+				if (!phj_batch->parallel_hashloop_fallback || phj_batch->current_chunk == 1)
 					node->hj_MatchedOuter = false;
 				node->hj_JoinState = HJ_SCAN_BUCKET;
 
@@ -924,9 +923,8 @@ ExecParallelHashJoin(PlanState *pstate)
 				{
 					/*
 					 * The current outer tuple has run out of matches, so
-					 * check whether to emit a dummy outer-join tuple.
-					 * Whether we emit one or not, the next state is
-					 * NEED_NEW_OUTER.
+					 * check whether to emit a dummy outer-join tuple. Whether
+					 * we emit one or not, the next state is NEED_NEW_OUTER.
 					 */
 					node->hj_JoinState = HJ_NEED_NEW_OUTER;
 					if (!phj_batch->parallel_hashloop_fallback)
@@ -1096,7 +1094,7 @@ ExecParallelHashJoin(PlanState *pstate)
 					if ((tuple = sts_parallel_scan_next(outer_acc, &metadata)) == NULL)
 						break;
 
-					int			bytenum = metadata.tupleid / 8;
+					uint32		bytenum = metadata.tupleid / 8;
 					unsigned char bit = metadata.tupleid % 8;
 					unsigned char byte_to_check = 0;
 
@@ -1489,7 +1487,7 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 		MinimalTuple tuple;
 
 		tupleMetadata metadata;
-		int			tupleid;
+		uint32		tupleid;
 
 		tuple = sts_parallel_scan_next(hashtable->batches[curbatch].outer_tuples,
 									   &metadata);
@@ -1906,7 +1904,16 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 			metadata.hashvalue = hashvalue;
 			SharedTuplestoreAccessor *accessor = hashtable->batches[batchno].outer_tuples;
 
+			/*
+			 * TODO: add a comment that this means the order is not
+			 * deterministic so don't count on it
+			 */
 			metadata.tupleid = sts_increment_tuplenum(accessor);
+
+			/*
+			 * TODO: only add the tupleid when it is a fallback batch to avoid
+			 * bloating of the sharedtuplestore
+			 */
 			sts_puttuple(accessor, &metadata, mintup);
 
 			if (shouldFree)
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 3cd2ec2e2e..0e5e9db820 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -57,11 +57,15 @@ typedef struct SharedTuplestoreParticipant
 } SharedTuplestoreParticipant;
 
 /* The control object that lives in shared memory. */
+/*  TODO: ntuples atomic 32 bit int is iffy. Didn't use 64bit because wasn't sure */
+/*  about 64bit atomic ints portability */
+/*  Seems like it would be possible to reduce the amount of synchronization instead */
+/*  potentially using worker number to unique-ify the tuple number */
 struct SharedTuplestore
 {
 	int			nparticipants;	/* Number of participants that can write. */
 	pg_atomic_uint32 ntuples;
-			  //TODO:does this belong elsewhere
+	/* TODO:does this belong elsewhere */
 	int			flags;			/* Flag bits from SHARED_TUPLESTORE_XXX */
 	size_t		meta_data_size; /* Size of per-tuple header. */
 	char		name[NAMEDATALEN];	/* A name for this tuplestore. */
@@ -631,8 +635,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	return NULL;
 }
 
-/*  TODO: fix signedness */
-int
+uint32
 sts_increment_tuplenum(SharedTuplestoreAccessor *accessor)
 {
 	return pg_atomic_fetch_add_u32(&accessor->sts->ntuples, 1);
@@ -719,22 +722,22 @@ sts_combine_outer_match_status_files(SharedTuplestoreAccessor *accessor)
 	BufFile    *combined_bitmap_file = BufFileCreateTemp(false);
 
 	for (int64 cur = 0; cur < BufFileSize(statuses[0]); cur++)
-		//make it while not
-			EOF
-		{
-			unsigned char combined_byte = 0;
-
-			for (int i = 0; i < statuses_length; i++)
-			{
-				unsigned char read_byte;
+		/* make it while not */
+		EOF
+	{
+		unsigned char combined_byte = 0;
 
-				BufFileRead(statuses[i], &read_byte, 1);
-				combined_byte |= read_byte;
-			}
+		for (int i = 0; i < statuses_length; i++)
+		{
+			unsigned char read_byte;
 
-			BufFileWrite(combined_bitmap_file, &combined_byte, 1);
+			BufFileRead(statuses[i], &read_byte, 1);
+			combined_byte |= read_byte;
 		}
 
+		BufFileWrite(combined_bitmap_file, &combined_byte, 1);
+	}
+
 	if (BufFileSeek(combined_bitmap_file, 0, 0L, SEEK_SET))
 		ereport(ERROR,
 				(errcode_for_file_access(),
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3e4f4bd574..e5a00f84e3 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -163,8 +163,8 @@ typedef struct ParallelHashJoinBatch
 	 * and does not require a lock to read
 	 */
 	bool		parallel_hashloop_fallback;
-	int			total_num_chunks;
-	int			current_chunk_num;
+	int			total_chunks;
+	int			current_chunk;
 	size_t		estimated_chunk_size;
 	Barrier		chunk_barrier;
 	LWLock		lock;
diff --git a/src/include/utils/sharedtuplestore.h b/src/include/utils/sharedtuplestore.h
index 6152ac163d..8b2433e5c4 100644
--- a/src/include/utils/sharedtuplestore.h
+++ b/src/include/utils/sharedtuplestore.h
@@ -24,17 +24,15 @@ struct SharedTuplestoreAccessor;
 typedef struct SharedTuplestoreAccessor SharedTuplestoreAccessor;
 struct tupleMetadata;
 typedef struct tupleMetadata tupleMetadata;
-
-/*  TODO: conflicting types for tupleid with accessor->sts->ntuples (uint32) */
-/*  TODO: use a union for tupleid (uint32) (make this a uint64) and chunk number (int) */
 struct tupleMetadata
 {
 	uint32		hashvalue;
-	int			tupleid;		/* tuple id on outer side and chunk number for
-								 * inner side */
-}			__attribute__((packed));
-
-/*  TODO: make sure I can get rid of packed now that using sizeof(struct) */
+	union
+	{
+		uint32		tupleid;	/* tuple number or id on the outer side */
+		int			chunk;		/* chunk number for inner side */
+	};
+};
 
 /*
  * A flag indicating that the tuplestore will only be scanned once, so backing
@@ -72,7 +70,7 @@ extern MinimalTuple sts_parallel_scan_next(SharedTuplestoreAccessor *accessor,
 										   void *meta_data);
 
 
-extern int	sts_increment_tuplenum(SharedTuplestoreAccessor *accessor);
+extern uint32 sts_increment_tuplenum(SharedTuplestoreAccessor *accessor);
 
 extern void sts_make_outer_match_status_file(SharedTuplestoreAccessor *accessor);
 extern void sts_set_outer_match_status(SharedTuplestoreAccessor *accessor, uint32 tuplenum);
-- 
2.20.1 (Apple Git-117)

v5-0003-Address-barrier-wait-deadlock-hazard.patchapplication/octet-stream; name=v5-0003-Address-barrier-wait-deadlock-hazard.patchDownload

From 44ab42cb839660b94a26eced5f5f9366a65c7289 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sat, 11 Jan 2020 16:57:34 -0800
Subject: [PATCH v5 3/5] Address barrier wait deadlock hazard

Previously, in the chunk phase machine in
ExecParallelHashJoinNewChunk(), we reused one chunk barrier for all
chunks and looped through the phases then set the phase back to the
initial phase and explicitly jumped there.
Now, we initialize an array of chunk barriers, one per chunk, and use a
new chunk barrier for each chunk.
After finishing probing a chunk, upon re-entering
ExecParallelHashJoinNewChunk(), workers will wait on the chunk barrier
for all participants to arrive.
This is okay because the barrier is advanced to the final phase as part
of this wait (per the comment in nodeHashJoin.c about deadlock risk with
waiting on barriers after emitting tuples).
The last worker to arrive will increment the chunk number.
All workers detach from the chunk barrier they are attached to and
select the next chunk barrier.

The hashtable is now reset in the first phase of the chunk phase machine
PHJ_CHUNK_ELECTING. Note that this will cause an unnecessary hashtable
reset for the first chunk.

The loading and probing phases of the chunk phase machine stay the same.

If a worker joins in the PHJ_CHUNK_DONE phase, it will simply detach
from the chunk barrier and move on to the next chunk barrier in the
array of chunk barriers.

In order to mitigate the other cause of deadlock hazard (workers wait on
the batch barrier after emitting tuples), now, in
ExecParallelHashJoinNewBatch(), if we are attached to a batch barrier
and it is a fallback batch, all workers will detach from the batch
barrier and then end their scan of that batch.
The last worker to detach will combine the outer match status files,
then it will detach from the batch, clean up the hashtable, and end its
scan of the inner side.
Then it will return and proceed to emit unmatched outer tuples.
In PHJ_BATCH_ELECTING, the worker that ends up allocating the hashtable
will also initialize the chunk barriers.

Also, this commit moves combined_bitmap to batch from hjstate. It will
be moved into the SharedBits store once that API is added.

Co-authored-by: Jesse Zhang <sbjesse@gmail.com>
---
 src/backend/executor/adaptiveHashjoin.c | 368 +++++++++++++++---------
 src/backend/executor/nodeHash.c         |  40 +--
 src/backend/executor/nodeHashjoin.c     | 116 +++++---
 src/backend/postmaster/pgstat.c         |   3 +
 src/include/executor/adaptiveHashjoin.h |   1 -
 src/include/executor/hashjoin.h         |  15 +-
 src/include/executor/nodeHash.h         |   1 +
 src/include/nodes/execnodes.h           |   1 -
 src/include/pgstat.h                    |   1 +
 9 files changed, 331 insertions(+), 215 deletions(-)

diff --git a/src/backend/executor/adaptiveHashjoin.c b/src/backend/executor/adaptiveHashjoin.c
index 64af2a24f3..20678ad9ff 100644
--- a/src/backend/executor/adaptiveHashjoin.c
+++ b/src/backend/executor/adaptiveHashjoin.c
@@ -24,7 +24,9 @@ ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 	ParallelHashJoinBatch *phj_batch;
 	SharedTuplestoreAccessor *outer_tuples;
 	SharedTuplestoreAccessor *inner_tuples;
+	Barrier    *barriers;
 	Barrier    *chunk_barrier;
+	Barrier    *old_chunk_barrier;
 
 	hashtable = hjstate->hj_HashTable;
 	batchno = hashtable->curbatch;
@@ -33,10 +35,13 @@ ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 	inner_tuples = hashtable->batches[batchno].inner_tuples;
 
 	/*
-	 * This chunk_barrier is initialized in the ELECTING phase when this
+	 * These chunk_barriers are initialized in the ELECTING phase when this
 	 * worker attached to the batch in ExecParallelHashJoinNewBatch()
 	 */
-	chunk_barrier = &hashtable->batches[batchno].shared->chunk_barrier;
+	barriers = dsa_get_address(hashtable->area, hashtable->batches[batchno].shared->chunk_barriers);
+	LWLockAcquire(&phj_batch->lock, LW_SHARED);
+	old_chunk_barrier = &(barriers[phj_batch->current_chunk - 1]);
+	LWLockRelease(&phj_batch->lock);
 
 	/*
 	 * If this worker just came from probing (from HJ_SCAN_BUCKET) we need to
@@ -49,14 +54,21 @@ ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 		 * The current chunk number can't be incremented if *any* worker isn't
 		 * done yet (otherwise they might access the wrong data structure!)
 		 */
-		if (BarrierArriveAndWait(chunk_barrier,
+		if (BarrierArriveAndWait(old_chunk_barrier,
 								 WAIT_EVENT_HASH_CHUNK_PROBING))
 			phj_batch->current_chunk++;
-
+		BarrierDetach(old_chunk_barrier);
 		/* Once the barrier is advanced we'll be in the DONE phase */
 	}
-	else
-		BarrierAttach(chunk_barrier);
+	/* TODO: definitely seems like a race condition around value of current_chunk */
+	LWLockAcquire(&phj_batch->lock, LW_SHARED);
+	if (phj_batch->current_chunk > phj_batch->total_chunks)
+	{
+		LWLockRelease(&phj_batch->lock);
+		return false;
+	}
+	chunk_barrier = &(barriers[phj_batch->current_chunk - 1]);
+	LWLockRelease(&phj_batch->lock);
 
 	/*
 	 * The outer side is exhausted and either 1) the current chunk of the
@@ -64,105 +76,190 @@ ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 	 * chunk of the inner side is exhausted and it is time to advance the
 	 * batch
 	 */
-	switch (BarrierPhase(chunk_barrier))
+
+	for (;;)
 	{
-			/*
-			 * TODO: remove this phase and coordinate access to hashtable
-			 * above goto and after incrementing current_chunk
-			 */
-		case PHJ_CHUNK_ELECTING:
-	phj_chunk_electing:
-			BarrierArriveAndWait(chunk_barrier,
-								 WAIT_EVENT_HASH_CHUNK_ELECTING);
-			/* Fall through. */
+		switch (BarrierAttach(chunk_barrier))
+		{
+			case PHJ_CHUNK_ELECTING:
+				if (BarrierArriveAndWait(chunk_barrier,
+										 WAIT_EVENT_HASH_CHUNK_ELECTING))
+				{
 
-		case PHJ_CHUNK_LOADING:
-			/* Start (or join in) loading the next chunk of inner tuples. */
-			sts_begin_parallel_scan(inner_tuples);
+					sts_reinitialize(outer_tuples);
 
-			MinimalTuple tuple;
-			tupleMetadata metadata;
+					/*
+					 * reset inner's hashtable and recycle the existing bucket
+					 * array.
+					 * TODO: this will unnecessarily reset the hashtable for the
+					 * first stripe. fix this?
+					 */
+					dsa_pointer_atomic *buckets = (dsa_pointer_atomic *)
+					dsa_get_address(hashtable->area, phj_batch->buckets);
 
-			while ((tuple = sts_parallel_scan_next(inner_tuples, &metadata)))
-			{
-				if (metadata.chunk != phj_batch->current_chunk)
-					continue;
+					for (size_t i = 0; i < hashtable->nbuckets; ++i)
+						dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
 
-				ExecForceStoreMinimalTuple(tuple,
-										   hjstate->hj_HashTupleSlot,
-										   false);
+					/*
+					 * TODO: this will unfortunately rescan all inner tuples
+					 * in the batch for each chunk
+					 */
 
-				ExecParallelHashTableInsertCurrentBatch(
-														hashtable,
-														hjstate->hj_HashTupleSlot,
-														metadata.hashvalue);
-			}
-			sts_end_parallel_scan(inner_tuples);
-			BarrierArriveAndWait(chunk_barrier,
-								 WAIT_EVENT_HASH_CHUNK_LOADING);
-			/* Fall through. */
+					/*
+					 * should be able to save the block in the file which
+					 * starts the next chunk instead
+					 */
+					sts_reinitialize(inner_tuples);
+				}
+				/* Fall through. */
+			case PHJ_CHUNK_RESETTING:
+				BarrierArriveAndWait(chunk_barrier, WAIT_EVENT_HASH_CHUNK_RESETTING);
+			case PHJ_CHUNK_LOADING:
+				/* Start (or join in) loading the next chunk of inner tuples. */
+				sts_begin_parallel_scan(inner_tuples);
+
+				MinimalTuple tuple;
+				tupleMetadata metadata;
+
+				while ((tuple = sts_parallel_scan_next(inner_tuples, &metadata)))
+				{
+					if (metadata.chunk != phj_batch->current_chunk)
+						continue;
+
+					ExecForceStoreMinimalTuple(tuple,
+											   hjstate->hj_HashTupleSlot,
+											   false);
+
+					ExecParallelHashTableInsertCurrentBatch(
+															hashtable,
+															hjstate->hj_HashTupleSlot,
+															metadata.hashvalue);
+				}
+				sts_end_parallel_scan(inner_tuples);
+				BarrierArriveAndWait(chunk_barrier,
+									 WAIT_EVENT_HASH_CHUNK_LOADING);
+				/* Fall through. */
+
+			case PHJ_CHUNK_PROBING:
+				/*
+				 * TODO: Is it a race condition where a worker enters here
+				 * and starts probing before the hashtable is fully loaded?
+				 */
+				sts_begin_parallel_scan(outer_tuples);
+				return true;
 
-		case PHJ_CHUNK_PROBING:
-			sts_begin_parallel_scan(outer_tuples);
-			return true;
+			case PHJ_CHUNK_DONE:
+				LWLockAcquire(&phj_batch->lock, LW_SHARED);
+				if (phj_batch->current_chunk > phj_batch->total_chunks)
+				{
+					LWLockRelease(&phj_batch->lock);
+					return false;
+				}
+				LWLockRelease(&phj_batch->lock);
+				/* TODO: exercise this somehow (ideally, in a test) */
+				BarrierDetach(chunk_barrier);
+				if (chunk_barrier < barriers + phj_batch->total_chunks)
+				{
+					++chunk_barrier;
+					continue;
+				}
+				else
+					return false;
 
-		case PHJ_CHUNK_DONE:
+			default:
+				elog(ERROR, "unexpected chunk phase %d. pid %i. batch %i.",
+					 BarrierPhase(chunk_barrier), MyProcPid, batchno);
+		}
+	}
 
-			BarrierArriveAndWait(chunk_barrier, WAIT_EVENT_HASH_CHUNK_DONE);
+	return false;
+}
 
-			if (phj_batch->current_chunk > phj_batch->total_chunks)
-			{
-				BarrierDetach(chunk_barrier);
-				return false;
-			}
 
-			/*
-			 * Otherwise it is time for the next chunk. One worker should
-			 * reset the hashtable
-			 */
-			if (BarrierArriveExplicitAndWait(chunk_barrier, PHJ_CHUNK_ELECTING, WAIT_EVENT_HASH_ADVANCE_CHUNK))
-			{
-				/*
-				 * rewind/reset outer tuplestore and rewind outer match status
-				 * files
-				 */
-				sts_reinitialize(outer_tuples);
+static void
+ExecHashTableLoopDetachBatchForChosen(HashJoinTable hashtable)
+{
+	if (hashtable->parallel_state != NULL &&
+		hashtable->curbatch >= 0)
+	{
+		int			curbatch = hashtable->curbatch;
+		ParallelHashJoinBatch *batch = hashtable->batches[curbatch].shared;
 
-				/*
-				 * reset inner's hashtable and recycle the existing bucket
-				 * array.
-				 */
-				dsa_pointer_atomic *buckets = (dsa_pointer_atomic *)
-				dsa_get_address(hashtable->area, phj_batch->buckets);
+		/* Make sure any temporary files are closed. */
+		sts_end_parallel_scan(hashtable->batches[curbatch].inner_tuples);
 
-				for (size_t i = 0; i < hashtable->nbuckets; ++i)
-					dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
+		/* Detach from the batch we were last working on. */
 
-				/*
-				 * TODO: this will unfortunately rescan all inner tuples in
-				 * the batch for each chunk
-				 */
+		/*
+		 * Technically we shouldn't access the barrier because we're no longer
+		 * attached, but since there is no way it's moving after this point it
+		 * seems safe to make the following assertion.
+		 */
+		Assert(BarrierPhase(&batch->batch_barrier) == PHJ_BATCH_DONE);
 
-				/*
-				 * should be able to save the block in the file which starts
-				 * the next chunk instead
-				 */
-				sts_reinitialize(inner_tuples);
-			}
-			goto phj_chunk_electing;
+		/* Free shared chunks and buckets. */
+		while (DsaPointerIsValid(batch->chunks))
+		{
+			HashMemoryChunk chunk =
+			dsa_get_address(hashtable->area, batch->chunks);
+			dsa_pointer next = chunk->next.shared;
 
-		case PHJ_CHUNK_FINAL:
-			BarrierDetach(chunk_barrier);
-			return false;
+			dsa_free(hashtable->area, batch->chunks);
+			batch->chunks = next;
+		}
+		if (DsaPointerIsValid(batch->buckets))
+		{
+			dsa_free(hashtable->area, batch->buckets);
+			batch->buckets = InvalidDsaPointer;
+		}
 
-		default:
-			elog(ERROR, "unexpected chunk phase %d. pid %i. batch %i.",
-				 BarrierPhase(chunk_barrier), MyProcPid, batchno);
-	}
+		/*
+		 * Free chunk barrier
+		 */
+		/* TODO: why is this NULL check needed? */
+		if (DsaPointerIsValid(batch->chunk_barriers))
+		{
+			dsa_free(hashtable->area, batch->chunk_barriers);
+			batch->chunk_barriers = InvalidDsaPointer;
+		}
 
-	return false;
+		/*
+		 * Track the largest batch we've been attached to.  Though each
+		 * backend might see a different subset of batches, explain.c will
+		 * scan the results from all backends to find the largest value.
+		 */
+		hashtable->spacePeak =
+			Max(hashtable->spacePeak,
+				batch->size + sizeof(dsa_pointer_atomic) * hashtable->nbuckets);
+
+	}
 }
 
+static void
+ExecHashTableLoopDetachBatchForOthers(HashJoinTable hashtable)
+{
+	if (hashtable->parallel_state != NULL &&
+		hashtable->curbatch >= 0)
+	{
+		int			curbatch = hashtable->curbatch;
+		ParallelHashJoinBatch *batch = hashtable->batches[curbatch].shared;
+
+		sts_end_parallel_scan(hashtable->batches[curbatch].inner_tuples);
+		sts_end_parallel_scan(hashtable->batches[curbatch].outer_tuples);
+
+		/*
+		 * Track the largest batch we've been attached to.  Though each
+		 * backend might see a different subset of batches, explain.c will
+		 * scan the results from all backends to find the largest value.
+		 */
+		hashtable->spacePeak =
+			Max(hashtable->spacePeak,
+				batch->size + sizeof(dsa_pointer_atomic) * hashtable->nbuckets);
+
+		/* Remember that we are not attached to a batch. */
+		hashtable->curbatch = -1;
+	}
+}
 
 /*
  * Choose a batch to work on, and attach to it.  Returns true if successful,
@@ -183,18 +280,6 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 	if (hashtable->batches == NULL)
 		return false;
 
-	/*
-	 * For hashloop fallback only Only the elected worker who was chosen to
-	 * combine the outer match status bitmaps should reach here. This worker
-	 * must do some final cleanup and then detach from the batch
-	 */
-	if (hjstate->combined_bitmap != NULL)
-	{
-		BufFileClose(hjstate->combined_bitmap);
-		hjstate->combined_bitmap = NULL;
-		hashtable->batches[hashtable->curbatch].done = true;
-		ExecHashTableDetachBatch(hashtable);
-	}
 
 	/*
 	 * If we were already attached to a batch, remember not to bother checking
@@ -211,41 +296,62 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 		ParallelHashJoinBatchAccessor *accessor = hashtable->batches + hashtable->curbatch;
 		ParallelHashJoinBatch *batch = accessor->shared;
 
-		/*
-		 * End the parallel scan on the outer tuples before we arrive at the
-		 * next barrier so that the last worker to arrive at that barrier can
-		 * reinitialize the SharedTuplestore for another parallel scan.
-		 */
-
 		if (!batch->parallel_hashloop_fallback)
-			BarrierArriveAndWait(&batch->batch_barrier,
-								 WAIT_EVENT_HASH_BATCH_PROBING);
+		{
+			hashtable->batches[hashtable->curbatch].done = true;
+			ExecHashTableDetachBatch(hashtable);
+		}
+
+		else if (accessor->combined_bitmap != NULL)
+		{
+			BufFileClose(accessor->combined_bitmap);
+			accessor->combined_bitmap = NULL;
+			accessor->done = true;
+
+			/*
+			 * though we have already de-commissioned the shared area of the
+			 * hashtable the curbatch is backend-local and should still be
+			 * valid
+			 */
+			sts_end_parallel_scan(hashtable->batches[hashtable->curbatch].outer_tuples);
+			hashtable->curbatch = -1;
+		}
+
 		else
 		{
 			sts_close_outer_match_status_file(accessor->outer_tuples);
 
 			/*
 			 * If all workers (including this one) have finished probing the
-			 * batch, one worker is elected to Combine all the outer match
-			 * status files from the workers who were attached to this batch
-			 * Loop through the outer match status files from all workers that
-			 * were attached to this batch Combine them into one bitmap Use
-			 * the bitmap, loop through the outer batch file again, and emit
-			 * unmatched tuples
+			 * batch, one worker is elected to Loop through the outer match
+			 * status files from all workers that were attached to this batch
+			 * Combine them into one bitmap Use the bitmap, loop through the
+			 * outer batch file again, and emit unmatched tuples All workers
+			 * will detach from the batch barrier and the last worker will
+			 * clean up the hashtable. All workers except the last worker will
+			 * end their scans of the outer and inner side The last worker
+			 * will end its scan of the inner side
 			 */
-
-			if (BarrierArriveAndWait(&batch->batch_barrier,
-									 WAIT_EVENT_HASH_BATCH_PROBING))
+			if (BarrierArriveAndDetach(&batch->batch_barrier))
 			{
-				hjstate->combined_bitmap = sts_combine_outer_match_status_files(accessor->outer_tuples);
+				/*
+				 * For hashloop fallback only Only the elected worker who was
+				 * chosen to combine the outer match status bitmaps should
+				 * reach here. This worker must do some final cleanup and then
+				 * detach from the batch
+				 */
+				accessor->combined_bitmap = sts_combine_outer_match_status_files(accessor->outer_tuples);
+				ExecHashTableLoopDetachBatchForChosen(hashtable);
 				hjstate->last_worker = true;
 				return true;
 			}
+			/* the elected combining worker should not reach here */
+			else
+			{
+				hashtable->batches[hashtable->curbatch].done = true;
+				ExecHashTableLoopDetachBatchForOthers(hashtable);
+			}
 		}
-
-		/* the elected combining worker should not reach here */
-		hashtable->batches[hashtable->curbatch].done = true;
-		ExecHashTableDetachBatch(hashtable);
 	}
 
 	/*
@@ -272,11 +378,16 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 											 WAIT_EVENT_HASH_BATCH_ELECTING))
 					{
 						ExecParallelHashTableAlloc(hashtable, batchno);
-						Barrier    *chunk_barrier =
-						&hashtable->batches[batchno].shared->chunk_barrier;
+						ParallelHashJoinBatch *phj_batch = hashtable->batches[batchno].shared;
+
+						phj_batch->chunk_barriers = dsa_allocate(hashtable->area, phj_batch->total_chunks * sizeof(Barrier));
+						Barrier    *barriers = dsa_get_address(hashtable->area, phj_batch->chunk_barriers);
 
-						BarrierInit(chunk_barrier, 0);
-						hashtable->batches[batchno].shared->current_chunk = 1;
+						for (int i = 0; i < phj_batch->total_chunks; i++)
+						{
+							BarrierInit(&(barriers[i]), 0);
+						}
+						phj_batch->current_chunk = 1;
 					}
 					/* Fall through. */
 
@@ -314,17 +425,6 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 
 					return true;
 
-				case PHJ_BATCH_OUTER_MATCH_STATUS_PROCESSING:
-
-					/*
-					 * The batch isn't done but this worker can't contribute
-					 * anything to it so it might as well be done from this
-					 * worker's perspective. (Only one worker can do work in
-					 * this phase).
-					 */
-
-					/* Fall through. */
-
 				case PHJ_BATCH_DONE:
 
 					/*
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index cb2f95ac0a..86f3aaff82 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1225,7 +1225,19 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 				ExecParallelHashEnsureBatchAccessors(hashtable);
 				ExecParallelHashTableSetCurrentBatch(hashtable, 0);
 
-				LWLockAcquire(&pstate->lock, LW_EXCLUSIVE);
+				/*
+				 * Currently adaptive hashjoin keeps track of the global
+				 * (HashJoin global) number of increases to nbatches.
+				 * If the number of increases exceeds a fixed amount,
+				 * any subsequent batch exceeding the space_allowed
+				 * (one which triggers disabling growth in nbatches)
+				 * afterward will have parallel_hashloop_fallback set
+				 *
+				 * Note that stripe numbers were already being added to tuples
+				 * going into the STS, so, even batches that do not fall back
+				 * might have more than one stripe.
+				 */
+				LWLockAcquire(&pstate->lock, LW_SHARED);
 				if (pstate->batch_increases >= 2)
 					excessive_batch_num_increases = true;
 				LWLockRelease(&pstate->lock);
@@ -1242,33 +1254,10 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 
 						space_exhausted = true;
 
-						/*
-						 * only once we've increased the number of batches
-						 * overall many times should we start setting
-						 */
-
-						/*
-						 * some batches to use the fallback strategy. Those
-						 * that are still too big will have this option set
-						 */
-
 						/*
 						 * we better not repartition again (growth should be
-						 * disabled), so that we don't overwrite this value
-						 */
-
-						/*
-						 * this tells us if we have set fallback to true or
-						 * not and how many chunks -- useful for seeing how
-						 * many chunks
-						 */
-
-						/*
-						 * we can get to before setting it to true (since we
-						 * still mark chunks (work_mem sized chunks)) in
-						 * batches even if we don't fall back
+						 * disabled), so that we don't overwrite this flag
 						 */
-						/* same for below but opposite */
 						if (excessive_batch_num_increases == true)
 							batch->parallel_hashloop_fallback = true;
 
@@ -3173,6 +3162,7 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 		accessor->shared = shared;
 		accessor->preallocated = 0;
 		accessor->done = false;
+		accessor->combined_bitmap = NULL;
 		accessor->inner_tuples =
 			sts_attach(ParallelHashJoinBatchInner(shared),
 					   ParallelWorkerNumber + 1,
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 91c0170e40..ca09012d17 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -905,9 +905,10 @@ ExecParallelHashJoin(PlanState *pstate)
 				 */
 
 				ParallelHashJoinBatch *phj_batch = node->hj_HashTable->batches[node->hj_HashTable->curbatch].shared;
-
+				LWLockAcquire(&phj_batch->lock, LW_SHARED);
 				if (!phj_batch->parallel_hashloop_fallback || phj_batch->current_chunk == 1)
 					node->hj_MatchedOuter = false;
+				LWLockRelease(&phj_batch->lock);
 				node->hj_JoinState = HJ_SCAN_BUCKET;
 
 				/* FALL THRU */
@@ -1029,13 +1030,12 @@ ExecParallelHashJoin(PlanState *pstate)
 
 			case HJ_NEED_NEW_BATCH:
 
-				phj_batch = hashtable->batches[hashtable->curbatch].shared;
-
 				/*
 				 * Try to advance to next batch.  Done if there are no more.
 				 */
 				if (!ExecParallelHashJoinNewBatch(node))
 					return NULL;	/* end of parallel-aware join */
+				phj_batch = hashtable->batches[hashtable->curbatch].shared;
 
 				if (node->last_worker
 					&& HJ_FILL_OUTER(node) && phj_batch->parallel_hashloop_fallback)
@@ -1073,6 +1073,11 @@ ExecParallelHashJoin(PlanState *pstate)
 			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT:
 
 				outer_acc = hashtable->batches[hashtable->curbatch].outer_tuples;
+				/*
+				 * This should only ever be called by one worker.
+				 * It is not protected by a barrier explicitly here. However,
+				 * more than one worker should never enter this state for a batch
+				 */
 				sts_reinitialize(outer_acc);
 				sts_begin_parallel_scan(outer_acc);
 
@@ -1080,57 +1085,75 @@ ExecParallelHashJoin(PlanState *pstate)
 				/* FALL THRU */
 
 			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER:
-
-				Assert(node->combined_bitmap != NULL);
-
-				outer_acc = node->hj_HashTable->batches[node->hj_HashTable->curbatch].outer_tuples;
-
-				MinimalTuple tuple;
-
-				do
 				{
-					tupleMetadata metadata;
+					ParallelHashJoinBatchAccessor *batch_accessor =
+					&node->hj_HashTable->batches[node->hj_HashTable->curbatch];
 
-					if ((tuple = sts_parallel_scan_next(outer_acc, &metadata)) == NULL)
-						break;
+					Assert(batch_accessor->combined_bitmap != NULL);
 
-					uint32		bytenum = metadata.tupleid / 8;
-					unsigned char bit = metadata.tupleid % 8;
-					unsigned char byte_to_check = 0;
-
-					/* seek to byte to check */
-					if (BufFileSeek(node->combined_bitmap, 0, bytenum, SEEK_SET))
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not rewind shared outer temporary file: %m")));
-					/* read byte containing ntuple bit */
-					if (BufFileRead(node->combined_bitmap, &byte_to_check, 1) == 0)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not read byte in outer match status bitmap: %m.")));
-					/* if bit is set */
-					bool		match = ((byte_to_check) >> bit) & 1;
-
-					if (!match)
-						break;
-				} while (1);
+					/*
+					 * TODO: there should be a way to know the current batch
+					 * for the purposes of getting the outer tuplestore without
+					 * needing curbatch from the hashtable so we can detach
+					 * from the batch (ExecHashTableDetachBatch)
+					 */
+					outer_acc =
+						batch_accessor->outer_tuples;
+					MinimalTuple tuple;
 
-				if (tuple == NULL)
-				{
-					sts_end_parallel_scan(outer_acc);
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
-					break;
-				}
+					do
+					{
+						tupleMetadata metadata;
+
+						if ((tuple =
+							 sts_parallel_scan_next(outer_acc, &metadata)) ==
+							NULL)
+							break;
+
+						uint32		bytenum = metadata.tupleid / 8;
+						unsigned char bit = metadata.tupleid % 8;
+						unsigned char byte_to_check = 0;
+
+						/* seek to byte to check */
+						if (BufFileSeek(batch_accessor->combined_bitmap,
+										0,
+										bytenum,
+										SEEK_SET))
+							ereport(ERROR,
+									(errcode_for_file_access(),
+									 errmsg(
+											"could not rewind shared outer temporary file: %m")));
+						/* read byte containing ntuple bit */
+						if (BufFileRead(batch_accessor->combined_bitmap, &byte_to_check, 1) ==
+							0)
+							ereport(ERROR,
+									(errcode_for_file_access(),
+									 errmsg(
+											"could not read byte in outer match status bitmap: %m.")));
+						/* if bit is set */
+						bool		match = ((byte_to_check) >> bit) & 1;
+
+						if (!match)
+							break;
+					}
+					while (1);
 
-				/* Emit the unmatched tuple */
-				ExecForceStoreMinimalTuple(tuple,
-										   econtext->ecxt_outertuple,
-										   false);
-				econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
+					if (tuple == NULL)
+					{
+						sts_end_parallel_scan(outer_acc);
+						node->hj_JoinState = HJ_NEED_NEW_BATCH;
+						break;
+					}
 
-				return ExecProject(node->js.ps.ps_ProjInfo);
+					/* Emit the unmatched tuple */
+					ExecForceStoreMinimalTuple(tuple,
+											   econtext->ecxt_outertuple,
+											   false);
+					econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
 
+					return ExecProject(node->js.ps.ps_ProjInfo);
 
+				}
 			default:
 				elog(ERROR, "unrecognized hashjoin state: %d",
 					 (int) node->hj_JoinState);
@@ -1182,7 +1205,6 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate->hj_InnerExhausted = false;
 
 	hjstate->last_worker = false;
-	hjstate->combined_bitmap = NULL;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index eeddf0009c..8aced92b31 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3812,6 +3812,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_CHUNK_ELECTING:
 			event_name = "Hash/Chunk/Electing";
 			break;
+		case WAIT_EVENT_HASH_CHUNK_RESETTING:
+			event_name = "Hash/Chunk/Resetting";
+			break;
 		case WAIT_EVENT_HASH_CHUNK_LOADING:
 			event_name = "Hash/Chunk/Loading";
 			break;
diff --git a/src/include/executor/adaptiveHashjoin.h b/src/include/executor/adaptiveHashjoin.h
index 030a04c5c0..135aed0b19 100644
--- a/src/include/executor/adaptiveHashjoin.h
+++ b/src/include/executor/adaptiveHashjoin.h
@@ -5,5 +5,4 @@
 extern bool ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing);
 extern bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
 
-
 #endif							/* ADAPTIVE_HASHJOIN_H */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index e5a00f84e3..b2cc12dc19 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -166,7 +166,7 @@ typedef struct ParallelHashJoinBatch
 	int			total_chunks;
 	int			current_chunk;
 	size_t		estimated_chunk_size;
-	Barrier		chunk_barrier;
+	dsa_pointer chunk_barriers;
 	LWLock		lock;
 
 	dsa_pointer chunks;			/* chunks of tuples loaded */
@@ -221,6 +221,7 @@ typedef struct ParallelHashJoinBatchAccessor
 	bool		at_least_one_chunk; /* has this backend allocated a chunk? */
 
 	bool		done;			/* flag to remember that a batch is done */
+	BufFile    *combined_bitmap;	/* for Adaptive Hashjoin only  */
 	SharedTuplestoreAccessor *inner_tuples;
 	SharedTuplestoreAccessor *outer_tuples;
 } ParallelHashJoinBatchAccessor;
@@ -282,14 +283,14 @@ typedef struct ParallelHashJoinState
 #define PHJ_BATCH_ELECTING				0
 #define PHJ_BATCH_ALLOCATING			1
 #define PHJ_BATCH_CHUNKING				2
-#define PHJ_BATCH_OUTER_MATCH_STATUS_PROCESSING 3
-#define PHJ_BATCH_DONE					4
+#define PHJ_BATCH_DONE					3
 
 #define PHJ_CHUNK_ELECTING				0
-#define PHJ_CHUNK_LOADING				1
-#define PHJ_CHUNK_PROBING				2
-#define PHJ_CHUNK_DONE					3
-#define PHJ_CHUNK_FINAL					4
+#define PHJ_CHUNK_RESETTING				1
+#define PHJ_CHUNK_LOADING				2
+#define PHJ_CHUNK_PROBING				3
+#define PHJ_CHUNK_DONE					4
+#define PHJ_CHUNK_FINAL					5
 
 /* The phases of batch growth while hashing, for grow_batches_barrier. */
 #define PHJ_GROW_BATCHES_ELECTING		0
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index dfc221e6a1..f6d5b47708 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -31,6 +31,7 @@ extern void ExecParallelHashTableAlloc(HashJoinTable hashtable,
 extern void ExecHashTableDestroy(HashJoinTable hashtable);
 extern void ExecHashTableDetach(HashJoinTable hashtable);
 extern void ExecHashTableDetachBatch(HashJoinTable hashtable);
+
 extern void ExecParallelHashTableSetCurrentBatch(HashJoinTable hashtable,
 												 int batchno);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 93fe6dddb2..32f0dd8cfe 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1968,7 +1968,6 @@ typedef struct HashJoinState
 
 	/* parallel hashloop fallback outer side */
 	bool		last_worker;
-	BufFile    *combined_bitmap;
 } HashJoinState;
 
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 340086a7e7..dd2e8bd655 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -843,6 +843,7 @@ typedef enum
 	WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
 	WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
 	WAIT_EVENT_HASH_CHUNK_ELECTING,
+	WAIT_EVENT_HASH_CHUNK_RESETTING,
 	WAIT_EVENT_HASH_CHUNK_LOADING,
 	WAIT_EVENT_HASH_CHUNK_PROBING,
 	WAIT_EVENT_HASH_CHUNK_DONE,
-- 
2.20.1 (Apple Git-117)

v5-0004-Add-SharedBits-API.patchapplication/octet-stream; name=v5-0004-Add-SharedBits-API.patchDownload

From bff9e7bb6f6613c37cf5fe1b12b9186e62493c23 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 24 Jan 2020 11:17:49 -0800
Subject: [PATCH v5 4/5] Add SharedBits API

Add SharedBits API--a way for workers to collaboratively make a bitmap.
The SharedBits store is currently meant for each backend to write to its
own bitmap file in one phase and for a single worker to combine all of
the bitmaps into a combined bitmap in another phase. In other words, it
supports parallel write but not parallel scan (and not concurrent
read/write). This could be modified in the future.

Also, the SharedBits uses a SharedFileset which uses BufFiles. This is
not the ideal API for the bitmap. The access pattern is small sequential
writes and random reads. It would also be nice to maintain the fixed
size buffer but have an API that let us write an arbitrary number of
bytes to it in bufsize chunks without incurring additional function call
overhead.

This commit also moves the outer match status file and combined_bitmap
into a new SharedBits store.

Co-authored-by: Jesse Zhang <sbjesse@gmail.com>
---
 src/backend/executor/adaptiveHashjoin.c   |  18 +-
 src/backend/executor/nodeHash.c           |   8 +-
 src/backend/executor/nodeHashjoin.c       | 183 +++++++-------
 src/backend/storage/file/buffile.c        |  51 ----
 src/backend/utils/sort/Makefile           |   1 +
 src/backend/utils/sort/sharedbits.c       | 276 ++++++++++++++++++++++
 src/backend/utils/sort/sharedtuplestore.c | 122 +---------
 src/include/executor/hashjoin.h           |  13 +-
 src/include/storage/buffile.h             |   1 -
 src/include/utils/sharedbits.h            |  40 ++++
 src/include/utils/sharedtuplestore.h      |   7 +-
 11 files changed, 430 insertions(+), 290 deletions(-)
 create mode 100644 src/backend/utils/sort/sharedbits.c
 create mode 100644 src/include/utils/sharedbits.h

diff --git a/src/backend/executor/adaptiveHashjoin.c b/src/backend/executor/adaptiveHashjoin.c
index 20678ad9ff..696bfc1c79 100644
--- a/src/backend/executor/adaptiveHashjoin.c
+++ b/src/backend/executor/adaptiveHashjoin.c
@@ -13,9 +13,6 @@
 
 #include "executor/adaptiveHashjoin.h"
 
-
-
-
 bool
 ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 {
@@ -302,10 +299,9 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 			ExecHashTableDetachBatch(hashtable);
 		}
 
-		else if (accessor->combined_bitmap != NULL)
+		else if (sb_combined_exists(accessor->sba))
 		{
-			BufFileClose(accessor->combined_bitmap);
-			accessor->combined_bitmap = NULL;
+			sb_end_read(accessor->sba);
 			accessor->done = true;
 
 			/*
@@ -319,7 +315,7 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 
 		else
 		{
-			sts_close_outer_match_status_file(accessor->outer_tuples);
+			sb_end_write(accessor->sba);
 
 			/*
 			 * If all workers (including this one) have finished probing the
@@ -340,7 +336,7 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 				 * reach here. This worker must do some final cleanup and then
 				 * detach from the batch
 				 */
-				accessor->combined_bitmap = sts_combine_outer_match_status_files(accessor->outer_tuples);
+				sb_combine(accessor->sba);
 				ExecHashTableLoopDetachBatchForChosen(hashtable);
 				hjstate->last_worker = true;
 				return true;
@@ -421,7 +417,11 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 					 * to by this worker and readable by any worker
 					 */
 					if (hashtable->batches[batchno].shared->parallel_hashloop_fallback)
-						sts_make_outer_match_status_file(hashtable->batches[batchno].outer_tuples);
+					{
+						ParallelHashJoinBatchAccessor *accessor = hashtable->batches + hashtable->curbatch;
+
+						sb_initialize_accessor(accessor->sba, sts_get_tuplenum(accessor->outer_tuples));
+					}
 
 					return true;
 
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 86f3aaff82..0e41dffe47 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -3041,7 +3041,9 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 	{
 		ParallelHashJoinBatchAccessor *accessor = &hashtable->batches[i];
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
+		SharedBits *sbits = ParallelHashJoinBatchOuterBits(shared, pstate->nparticipants);
 		char		name[MAXPGPATH];
+		char		sbname[MAXPGPATH];
 
 		shared->parallel_hashloop_fallback = false;
 		LWLockInitialize(&shared->lock,
@@ -3087,6 +3089,9 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
+		snprintf(sbname, MAXPGPATH, "%s.bitmaps", name);
+		accessor->sba = sb_initialize(sbits, pstate->nparticipants,
+									  ParallelWorkerNumber + 1, &pstate->sbfileset, sbname);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -3158,11 +3163,11 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 	{
 		ParallelHashJoinBatchAccessor *accessor = &hashtable->batches[i];
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
+		SharedBits *sbits = ParallelHashJoinBatchOuterBits(shared, pstate->nparticipants);
 
 		accessor->shared = shared;
 		accessor->preallocated = 0;
 		accessor->done = false;
-		accessor->combined_bitmap = NULL;
 		accessor->inner_tuples =
 			sts_attach(ParallelHashJoinBatchInner(shared),
 					   ParallelWorkerNumber + 1,
@@ -3172,6 +3177,7 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 												  pstate->nparticipants),
 					   ParallelWorkerNumber + 1,
 					   &pstate->fileset);
+		accessor->sba = sb_attach(sbits, ParallelWorkerNumber + 1, &pstate->sbfileset);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index ca09012d17..acebe93b21 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -914,93 +914,98 @@ ExecParallelHashJoin(PlanState *pstate)
 				/* FALL THRU */
 
 			case HJ_SCAN_BUCKET:
-
-				/*
-				 * Scan the selected hash bucket for matches to current outer
-				 */
-				phj_batch = node->hj_HashTable->batches[node->hj_HashTable->curbatch].shared;
-
-				if (!ExecParallelScanHashBucket(node, econtext))
 				{
 					/*
-					 * The current outer tuple has run out of matches, so
-					 * check whether to emit a dummy outer-join tuple. Whether
-					 * we emit one or not, the next state is NEED_NEW_OUTER.
+					 * Scan the selected hash bucket for matches to current
+					 * outer
 					 */
-					node->hj_JoinState = HJ_NEED_NEW_OUTER;
-					if (!phj_batch->parallel_hashloop_fallback)
+					ParallelHashJoinBatchAccessor *accessor =
+					&node->hj_HashTable->batches[node->hj_HashTable->curbatch];
+
+					phj_batch = accessor->shared;
+
+					if (!ExecParallelScanHashBucket(node, econtext))
 					{
-						TupleTableSlot *slot = emitUnmatchedOuterTuple(otherqual, econtext, node);
+						/*
+						 * The current outer tuple has run out of matches, so
+						 * check whether to emit a dummy outer-join tuple.
+						 * Whether we emit one or not, the next state is
+						 * NEED_NEW_OUTER.
+						 */
+						node->hj_JoinState = HJ_NEED_NEW_OUTER;
+						if (!phj_batch->parallel_hashloop_fallback)
+						{
+							TupleTableSlot *slot = emitUnmatchedOuterTuple(otherqual, econtext, node);
 
-						if (slot != NULL)
-							return slot;
+							if (slot != NULL)
+								return slot;
+						}
+						continue;
 					}
-					continue;
-				}
 
-				/*
-				 * We've got a match, but still need to test non-hashed quals.
-				 * ExecScanHashBucket already set up all the state needed to
-				 * call ExecQual.
-				 *
-				 * If we pass the qual, then save state for next call and have
-				 * ExecProject form the projection, store it in the tuple
-				 * table, and return the slot.
-				 *
-				 * Only the joinquals determine tuple match status, but all
-				 * quals must pass to actually return the tuple.
-				 */
-				if (joinqual != NULL && !ExecQual(joinqual, econtext))
-				{
-					InstrCountFiltered1(node, 1);
-					break;
-				}
+					/*
+					 * We've got a match, but still need to test non-hashed
+					 * quals. ExecScanHashBucket already set up all the state
+					 * needed to call ExecQual.
+					 *
+					 * If we pass the qual, then save state for next call and
+					 * have ExecProject form the projection, store it in the
+					 * tuple table, and return the slot.
+					 *
+					 * Only the joinquals determine tuple match status, but
+					 * all quals must pass to actually return the tuple.
+					 */
+					if (joinqual != NULL && !ExecQual(joinqual, econtext))
+					{
+						InstrCountFiltered1(node, 1);
+						break;
+					}
 
-				node->hj_MatchedOuter = true;
-				/*
-				 * Full/right outer joins are currently not supported
-				 * for parallel joins, so we don't need to set the
-				 * match bit.  Experiments show that it's worth
-				 * avoiding the shared memory traffic on large
-				 * systems.
-				 */
-				Assert(!HJ_FILL_INNER(node));
+					node->hj_MatchedOuter = true;
+					/*
+					 * Full/right outer joins are currently not supported
+					 * for parallel joins, so we don't need to set the
+					 * match bit.  Experiments show that it's worth
+					 * avoiding the shared memory traffic on large
+					 * systems.
+					 */
+					Assert(!HJ_FILL_INNER(node));
 
-				/*
-				 * TODO: how does this interact with PAHJ -- do I need to set
-				 * matchbit?
-				 */
-				/* In an antijoin, we never return a matched tuple */
-				if (node->js.jointype == JOIN_ANTI)
-				{
-					node->hj_JoinState = HJ_NEED_NEW_OUTER;
-					continue;
-				}
+					/*
+					 * TODO: how does this interact with PAHJ -- do I need to
+					 * set matchbit?
+					 */
+					/* In an antijoin, we never return a matched tuple */
+					if (node->js.jointype == JOIN_ANTI)
+					{
+						node->hj_JoinState = HJ_NEED_NEW_OUTER;
+						continue;
+					}
 
-				/*
-				 * If we only need to join to the first matching inner tuple,
-				 * then consider returning this one, but after that continue
-				 * with next outer tuple.
-				 */
-				if (node->js.single_match)
-					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+					/*
+					 * If we only need to join to the first matching inner
+					 * tuple, then consider returning this one, but after that
+					 * continue with next outer tuple.
+					 */
+					if (node->js.single_match)
+						node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
-				/*
-				 * Set the match bit for this outer tuple in the match status
-				 * file
-				 */
-				if (phj_batch->parallel_hashloop_fallback)
-				{
-					sts_set_outer_match_status(hashtable->batches[hashtable->curbatch].outer_tuples,
-											   econtext->ecxt_outertuple->tuplenum);
+					/*
+					 * Set the match bit for this outer tuple in the match
+					 * status file
+					 */
+					if (phj_batch->parallel_hashloop_fallback)
+					{
+						sb_setbit(accessor->sba,
+								  econtext->ecxt_outertuple->tuplenum);
 
+					}
+					if (otherqual == NULL || ExecQual(otherqual, econtext))
+						return ExecProject(node->js.ps.ps_ProjInfo);
+					else
+						InstrCountFiltered2(node, 1);
+					break;
 				}
-				if (otherqual == NULL || ExecQual(otherqual, econtext))
-					return ExecProject(node->js.ps.ps_ProjInfo);
-				else
-					InstrCountFiltered2(node, 1);
-				break;
-
 			case HJ_FILL_INNER_TUPLES:
 
 				/*
@@ -1089,8 +1094,6 @@ ExecParallelHashJoin(PlanState *pstate)
 					ParallelHashJoinBatchAccessor *batch_accessor =
 					&node->hj_HashTable->batches[node->hj_HashTable->curbatch];
 
-					Assert(batch_accessor->combined_bitmap != NULL);
-
 					/*
 					 * TODO: there should be a way to know the current batch
 					 * for the purposes of getting the outer tuplestore without
@@ -1105,33 +1108,10 @@ ExecParallelHashJoin(PlanState *pstate)
 					{
 						tupleMetadata metadata;
 
-						if ((tuple =
-							 sts_parallel_scan_next(outer_acc, &metadata)) ==
-							NULL)
+						if ((tuple = sts_parallel_scan_next(outer_acc, &metadata)) == NULL)
 							break;
 
-						uint32		bytenum = metadata.tupleid / 8;
-						unsigned char bit = metadata.tupleid % 8;
-						unsigned char byte_to_check = 0;
-
-						/* seek to byte to check */
-						if (BufFileSeek(batch_accessor->combined_bitmap,
-										0,
-										bytenum,
-										SEEK_SET))
-							ereport(ERROR,
-									(errcode_for_file_access(),
-									 errmsg(
-											"could not rewind shared outer temporary file: %m")));
-						/* read byte containing ntuple bit */
-						if (BufFileRead(batch_accessor->combined_bitmap, &byte_to_check, 1) ==
-							0)
-							ereport(ERROR,
-									(errcode_for_file_access(),
-									 errmsg(
-											"could not read byte in outer match status bitmap: %m.")));
-						/* if bit is set */
-						bool		match = ((byte_to_check) >> bit) & 1;
+						bool		match = sb_checkbit(batch_accessor->sba, metadata.tupleid);
 
 						if (!match)
 							break;
@@ -2003,6 +1983,7 @@ ExecHashJoinInitializeDSM(HashJoinState *state, ParallelContext *pcxt)
 
 	/* Set up the space we'll use for shared temporary files. */
 	SharedFileSetInit(&pstate->fileset, pcxt->seg);
+	SharedFileSetInit(&pstate->sbfileset, pcxt->seg);
 
 	/* Initialize the shared state in the hash node. */
 	hashNode = (HashState *) innerPlanState(state);
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index cb49329d3f..f0e920b416 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -269,57 +269,6 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
 	return file;
 }
 
-/*
- * Open a shared file created by any backend if it exists, otherwise return NULL
- */
-BufFile *
-BufFileOpenSharedIfExists(SharedFileSet *fileset, const char *name)
-{
-	BufFile    *file;
-	char		segment_name[MAXPGPATH];
-	Size		capacity = 16;
-	File	   *files;
-	int			nfiles = 0;
-
-	files = palloc(sizeof(File) * capacity);
-
-	/*
-	 * We don't know how many segments there are, so we'll probe the
-	 * filesystem to find out.
-	 */
-	for (;;)
-	{
-		/* See if we need to expand our file segment array. */
-		if (nfiles + 1 > capacity)
-		{
-			capacity *= 2;
-			files = repalloc(files, sizeof(File) * capacity);
-		}
-		/* Try to load a segment. */
-		SharedSegmentName(segment_name, name, nfiles);
-		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
-		if (files[nfiles] <= 0)
-			break;
-		++nfiles;
-
-		CHECK_FOR_INTERRUPTS();
-	}
-
-	/*
-	 * If we didn't find any files at all, then no BufFile exists with this
-	 * name.
-	 */
-	if (nfiles == 0)
-		return NULL;
-	file = makeBufFileCommon(nfiles);
-	file->files = files;
-	file->readOnly = true;		/* Can't write to files opened this way */
-	file->fileset = fileset;
-	file->name = pstrdup(name);
-
-	return file;
-}
-
 /*
  * Open a file that was previously created in another backend (or this one)
  * with BufFileCreateShared in the same SharedFileSet using the same name.
diff --git a/src/backend/utils/sort/Makefile b/src/backend/utils/sort/Makefile
index 7ac3659261..f11fe85aeb 100644
--- a/src/backend/utils/sort/Makefile
+++ b/src/backend/utils/sort/Makefile
@@ -16,6 +16,7 @@ override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS)
 
 OBJS = \
 	logtape.o \
+	sharedbits.o \
 	sharedtuplestore.o \
 	sortsupport.o \
 	tuplesort.o \
diff --git a/src/backend/utils/sort/sharedbits.c b/src/backend/utils/sort/sharedbits.c
new file mode 100644
index 0000000000..9d04d6b236
--- /dev/null
+++ b/src/backend/utils/sort/sharedbits.c
@@ -0,0 +1,276 @@
+#include "postgres.h"
+#include "storage/buffile.h"
+#include "utils/sharedbits.h"
+
+/*  TODO: put a comment about not currently supporting parallel scan of the SharedBits */
+
+/* Per-participant shared state */
+struct SharedBitsParticipant
+{
+	bool		present;
+	bool		writing;
+};
+
+/* Shared control object */
+struct SharedBits
+{
+	int			nparticipants;	/* Number of participants that can write. */
+	int64		nbits;
+	char		name[NAMEDATALEN];	/* A name for this tuplestore. */
+
+	SharedBitsParticipant participants[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/* backend-local state */
+struct SharedBitsAccessor
+{
+	int			participant;
+	SharedBits *bits;
+	SharedFileSet *fileset;
+	BufFile    *write_file;
+	BufFile    *combined;
+};
+
+SharedBitsAccessor *
+sb_attach(SharedBits *sbits, int my_participant_number, SharedFileSet *fileset)
+{
+	SharedBitsAccessor *accessor = palloc0(sizeof(SharedBitsAccessor));
+
+	accessor->participant = my_participant_number;
+	accessor->bits = sbits;
+	accessor->fileset = fileset;
+	accessor->write_file = NULL;
+	accessor->combined = NULL;
+	return accessor;
+}
+
+SharedBitsAccessor *
+sb_initialize(SharedBits *sbits,
+			  int participants,
+			  int my_participant_number,
+			  SharedFileSet *fileset,
+			  char *name)
+{
+	SharedBitsAccessor *accessor;
+
+	sbits->nparticipants = participants;
+	strcpy(sbits->name, name);
+	sbits->nbits = 0;			/* TODO: maybe delete this */
+
+	accessor = palloc0(sizeof(SharedBitsAccessor));
+	accessor->participant = my_participant_number;
+	accessor->bits = sbits;
+	accessor->fileset = fileset;
+	accessor->write_file = NULL;
+	accessor->combined = NULL;
+	return accessor;
+}
+
+/*  TODO: is "initialize_accessor" a clear enough API for this? (making the file)? */
+void
+sb_initialize_accessor(SharedBitsAccessor *accessor, uint32 nbits)
+{
+	char		name[MAXPGPATH];
+
+	snprintf(name, MAXPGPATH, "%s.p%d.bitmap", accessor->bits->name, accessor->participant);
+
+	accessor->write_file =
+		BufFileCreateShared(accessor->fileset, name);
+
+	accessor->bits->participants[accessor->participant].present = true;
+	/* TODO: check this math. tuplenumber will be too high. */
+	uint32		num_to_write = nbits / 8 + 1;
+
+	/*
+	 * TODO: add tests that could exercise a problem with junk being written
+	 * to bitmap
+	 */
+
+	/*
+	 * TODO: is there a better way to write the bytes to the file without
+	 * calling
+	 */
+
+	/*
+	 * BufFileWrite() like this? palloc()ing an undetermined number of bytes
+	 * feels
+	 */
+
+	/*
+	 * like it is against the spirit of this patch to begin with, but the many
+	 * function
+	 */
+	/* calls seem expensive */
+	for (int i = 0; i < num_to_write; i++)
+	{
+		unsigned char byteToWrite = 0;
+
+		BufFileWrite(accessor->write_file, &byteToWrite, 1);
+	}
+
+	if (BufFileSeek(accessor->write_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+}
+
+size_t
+sb_estimate(int participants)
+{
+	return offsetof(SharedBits, participants) + participants * sizeof(SharedBitsParticipant);
+}
+
+
+void
+sb_setbit(SharedBitsAccessor *accessor, uint64 bit)
+{
+	Assert(accessor->write_file);
+	SharedBitsParticipant *const participant =
+	&accessor->bits->participants[accessor->participant];
+
+	if (!participant->writing)
+		participant->writing = true;
+	unsigned char current_outer_byte;
+
+	BufFileSeek(accessor->write_file, 0, bit / 8, SEEK_SET);
+	BufFileRead(accessor->write_file, &current_outer_byte, 1);
+
+	current_outer_byte |= 1U << (bit % 8);
+
+	/* TODO: don't seek back one but instead seek explicitly to that byte */
+	BufFileSeek(accessor->write_file, 0, -1, SEEK_CUR);
+	BufFileWrite(accessor->write_file, &current_outer_byte, 1);
+}
+
+bool
+sb_checkbit(SharedBitsAccessor *accessor, uint32 n)
+{
+	Assert(accessor->combined);
+	uint32		bytenum = n / 8;
+	unsigned char bit = n % 8;
+	unsigned char byte_to_check = 0;
+
+	/* seek to byte to check */
+	if (BufFileSeek(accessor->combined,
+					0,
+					bytenum,
+					SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg(
+						"could not rewind shared outer temporary file: %m")));
+	/* read byte containing ntuple bit */
+	if (BufFileRead(accessor->combined, &byte_to_check, 1) == 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg(
+						"could not read byte in outer match status bitmap: %m.")));
+	/* if bit is set */
+	bool		match = ((byte_to_check) >> bit) & 1;
+
+	return match;
+}
+
+BufFile *
+sb_combine(SharedBitsAccessor *accessor)
+{
+	/* TODO: this tries to close an outer match status file for */
+	/* each participant in the tuplestore. technically, only participants */
+	/* in the barrier could have outer match status files, however, */
+	/* all but one participant continue on and detach from the barrier */
+	/* so we won't have a reliable way to close only files for those attached */
+	/* to the barrier */
+	int			nbparticipants = 0;
+
+	for (int l = 0; l < accessor->bits->nparticipants; l++)
+	{
+		SharedBitsParticipant participant = accessor->bits->participants[l];
+
+		if (participant.present)
+		{
+			Assert(!participant.writing);
+			nbparticipants++;
+		}
+	}
+	BufFile   **statuses = palloc(sizeof(BufFile *) * nbparticipants);
+
+	/*
+	 * Open the bitmap shared BufFile from each participant. TODO: explain why
+	 * file can be NULLs
+	 */
+	int			statuses_length = 0;
+
+	for (int i = 0; i < accessor->bits->nparticipants; i++)
+	{
+		char		bitmap_filename[MAXPGPATH];
+
+		/* TODO: make a function that will do this */
+		snprintf(bitmap_filename, MAXPGPATH, "%s.p%d.bitmap", accessor->bits->name, i);
+
+		if (!accessor->bits->participants[i].present)
+			continue;
+		BufFile    *file = BufFileOpenShared(accessor->fileset, bitmap_filename);
+
+		Assert(file);
+
+		statuses[statuses_length++] = file;
+	}
+
+	BufFile    *combined_bitmap_file = BufFileCreateTemp(false);
+
+	for (int64 cur = 0; cur < BufFileSize(statuses[0]); cur++)	/* make it while not EOF */
+	{
+		unsigned char combined_byte = 0;
+
+		for (int i = 0; i < statuses_length; i++)
+		{
+			unsigned char read_byte;
+
+			BufFileRead(statuses[i], &read_byte, 1);
+			combined_byte |= read_byte;
+		}
+
+		BufFileWrite(combined_bitmap_file, &combined_byte, 1);
+	}
+
+	if (BufFileSeek(combined_bitmap_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	for (int i = 0; i < statuses_length; i++)
+		BufFileClose(statuses[i]);
+	pfree(statuses);
+
+	accessor->combined = combined_bitmap_file;
+	return combined_bitmap_file;
+}
+
+/*  TODO: this is an API leak. We should be able to use something in the hashjoin state */
+/*  to indicate that the worker is the elected worker */
+/*  We tried using last_worker, but the problem is that last_worker can be false when */
+/*  there is a combined file (meaning this is the last worker), so, clearly, something needs */
+/*  to change about the flag. it is not expressing what it was meant to express. */
+bool
+sb_combined_exists(SharedBitsAccessor *accessor)
+{
+	return accessor->combined != NULL;
+}
+
+void
+sb_end_write(SharedBitsAccessor *sba)
+{
+	SharedBitsParticipant
+			   *const participant = &sba->bits->participants[sba->participant];
+
+	participant->writing = false;
+	BufFileClose(sba->write_file);
+	sba->write_file = NULL;
+}
+
+void
+sb_end_read(SharedBitsAccessor *accessor)
+{
+	BufFileClose(accessor->combined);
+	accessor->combined = NULL;
+}
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 0e5e9db820..045b8eca80 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -98,15 +98,10 @@ struct SharedTuplestoreAccessor
 	BlockNumber write_page;		/* The next page to write to. */
 	char	   *write_pointer;	/* Current write pointer within chunk. */
 	char	   *write_end;		/* One past the end of the current chunk. */
-
-	/* Bitmap of matched outer tuples (currently only used for hashjoin). */
-	BufFile    *outer_match_status_file;
 };
 
 static void sts_filename(char *name, SharedTuplestoreAccessor *accessor,
 						 int participant);
-static void
-			sts_bitmap_filename(char *name, SharedTuplestoreAccessor *accessor, int participant);
 
 /*
  * Return the amount of shared memory required to hold SharedTuplestore for a
@@ -178,7 +173,6 @@ sts_initialize(SharedTuplestore *sts, int participants,
 	accessor->sts = sts;
 	accessor->fileset = fileset;
 	accessor->context = CurrentMemoryContext;
-	accessor->outer_match_status_file = NULL;
 
 	return accessor;
 }
@@ -641,120 +635,10 @@ sts_increment_tuplenum(SharedTuplestoreAccessor *accessor)
 	return pg_atomic_fetch_add_u32(&accessor->sts->ntuples, 1);
 }
 
-void
-sts_make_outer_match_status_file(SharedTuplestoreAccessor *accessor)
-{
-	uint32		tuplenum = pg_atomic_read_u32(&accessor->sts->ntuples);
-
-	/* don't make the outer match status file if there are no tuples */
-	if (tuplenum == 0)
-		return;
-
-	char		name[MAXPGPATH];
-
-	sts_bitmap_filename(name, accessor, accessor->participant);
-
-	accessor->outer_match_status_file = BufFileCreateShared(accessor->fileset, name);
-
-	/* TODO: check this math. tuplenumber will be too high. */
-	uint32		num_to_write = tuplenum / 8 + 1;
-
-	unsigned char byteToWrite = 0;
-
-	BufFileWrite(accessor->outer_match_status_file, &byteToWrite, num_to_write);
-
-	if (BufFileSeek(accessor->outer_match_status_file, 0, 0L, SEEK_SET))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not rewind hash-join temporary file: %m")));
-}
-
-void
-sts_set_outer_match_status(SharedTuplestoreAccessor *accessor, uint32 tuplenum)
-{
-	BufFile    *parallel_outer_matchstatuses = accessor->outer_match_status_file;
-	unsigned char current_outer_byte;
-
-	BufFileSeek(parallel_outer_matchstatuses, 0, tuplenum / 8, SEEK_SET);
-	BufFileRead(parallel_outer_matchstatuses, &current_outer_byte, 1);
-
-	current_outer_byte |= 1U << (tuplenum % 8);
-
-	if (BufFileSeek(parallel_outer_matchstatuses, 0, -1, SEEK_CUR) != 0)
-		elog(ERROR, "there is a problem with outer match status file. pid %i.", MyProcPid);
-	BufFileWrite(parallel_outer_matchstatuses, &current_outer_byte, 1);
-}
-
-void
-sts_close_outer_match_status_file(SharedTuplestoreAccessor *accessor)
-{
-	BufFileClose(accessor->outer_match_status_file);
-}
-
-BufFile *
-sts_combine_outer_match_status_files(SharedTuplestoreAccessor *accessor)
-{
-	/* TODO: this tries to close an outer match status file for */
-	/* each participant in the tuplestore. technically, only participants */
-	/* in the barrier could have outer match status files, however, */
-	/* all but one participant continue on and detach from the barrier */
-	/* so we won't have a reliable way to close only files for those attached */
-	/* to the barrier */
-	BufFile   **statuses = palloc(sizeof(BufFile *) * accessor->sts->nparticipants);
-
-	/*
-	 * Open the bitmap shared BufFile from each participant. TODO: explain why
-	 * file can be NULLs
-	 */
-	int			statuses_length = 0;
-
-	for (int i = 0; i < accessor->sts->nparticipants; i++)
-	{
-		char		bitmap_filename[MAXPGPATH];
-
-		sts_bitmap_filename(bitmap_filename, accessor, i);
-		BufFile    *file = BufFileOpenSharedIfExists(accessor->fileset, bitmap_filename);
-
-		if (file != NULL)
-			statuses[statuses_length++] = file;
-	}
-
-	BufFile    *combined_bitmap_file = BufFileCreateTemp(false);
-
-	for (int64 cur = 0; cur < BufFileSize(statuses[0]); cur++)
-		/* make it while not */
-		EOF
-	{
-		unsigned char combined_byte = 0;
-
-		for (int i = 0; i < statuses_length; i++)
-		{
-			unsigned char read_byte;
-
-			BufFileRead(statuses[i], &read_byte, 1);
-			combined_byte |= read_byte;
-		}
-
-		BufFileWrite(combined_bitmap_file, &combined_byte, 1);
-	}
-
-	if (BufFileSeek(combined_bitmap_file, 0, 0L, SEEK_SET))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not rewind hash-join temporary file: %m")));
-
-	for (int i = 0; i < statuses_length; i++)
-		BufFileClose(statuses[i]);
-	pfree(statuses);
-
-	return combined_bitmap_file;
-}
-
-
-static void
-sts_bitmap_filename(char *name, SharedTuplestoreAccessor *accessor, int participant)
+uint32
+sts_get_tuplenum(SharedTuplestoreAccessor *accessor)
 {
-	snprintf(name, MAXPGPATH, "%s.p%d.bitmap", accessor->sts->name, participant);
+	return pg_atomic_read_u32(&accessor->sts->ntuples);
 }
 
 /*
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index b2cc12dc19..164a97ef96 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -19,6 +19,7 @@
 #include "storage/barrier.h"
 #include "storage/buffile.h"
 #include "storage/lwlock.h"
+#include "utils/sharedbits.h"
 
 /* ----------------------------------------------------------------
  *				hash-join hash table structures
@@ -193,10 +194,17 @@ typedef struct ParallelHashJoinBatch
 	 ((char *) ParallelHashJoinBatchInner(batch) +						\
 	  MAXALIGN(sts_estimate(nparticipants))))
 
+/* Accessor for sharedbits following a ParallelHashJoinBatch. */
+#define ParallelHashJoinBatchOuterBits(batch, nparticipants) \
+	((SharedBits *)												\
+	 ((char *) ParallelHashJoinBatchOuter(batch, nparticipants) +						\
+	  MAXALIGN(sts_estimate(nparticipants))))
+
 /* Total size of a ParallelHashJoinBatch and tuplestores. */
 #define EstimateParallelHashJoinBatch(hashtable)						\
 	(MAXALIGN(sizeof(ParallelHashJoinBatch)) +							\
-	 MAXALIGN(sts_estimate((hashtable)->parallel_state->nparticipants)) * 2)
+	 MAXALIGN(sts_estimate((hashtable)->parallel_state->nparticipants)) * 2 + \
+	 MAXALIGN(sb_estimate((hashtable)->parallel_state->nparticipants)))
 
 /* Accessor for the nth ParallelHashJoinBatch given the base. */
 #define NthParallelHashJoinBatch(base, n)								\
@@ -221,9 +229,9 @@ typedef struct ParallelHashJoinBatchAccessor
 	bool		at_least_one_chunk; /* has this backend allocated a chunk? */
 
 	bool		done;			/* flag to remember that a batch is done */
-	BufFile    *combined_bitmap;	/* for Adaptive Hashjoin only  */
 	SharedTuplestoreAccessor *inner_tuples;
 	SharedTuplestoreAccessor *outer_tuples;
+	SharedBitsAccessor *sba;
 } ParallelHashJoinBatchAccessor;
 
 /*
@@ -270,6 +278,7 @@ typedef struct ParallelHashJoinState
 	pg_atomic_uint32 distributor;	/* counter for load balancing */
 
 	SharedFileSet fileset;		/* space for shared temporary files */
+	SharedFileSet sbfileset;
 } ParallelHashJoinState;
 
 /* The phases for building batches, used by build_barrier. */
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index f790f7e121..82c0f83611 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,6 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
-extern BufFile *BufFileOpenSharedIfExists(SharedFileSet *fileset, const char *name);
 extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
 
diff --git a/src/include/utils/sharedbits.h b/src/include/utils/sharedbits.h
new file mode 100644
index 0000000000..a554a59a38
--- /dev/null
+++ b/src/include/utils/sharedbits.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * sharedbits.h
+ *	  Simple mechanism for sharing bits between backends.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/sharedbits.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SHAREDBITS_H
+#define SHAREDBITS_H
+
+#include "storage/sharedfileset.h"
+
+struct SharedBits;
+typedef struct SharedBits SharedBits;
+
+struct SharedBitsParticipant;
+typedef struct SharedBitsParticipant SharedBitsParticipant;
+
+struct SharedBitsAccessor;
+typedef struct SharedBitsAccessor SharedBitsAccessor;
+
+extern SharedBitsAccessor *sb_attach(SharedBits *sbits, int my_participant_number, SharedFileSet *fileset);
+extern SharedBitsAccessor *sb_initialize(SharedBits *sbits, int participants, int my_participant_number, SharedFileSet *fileset, char *name);
+extern void sb_initialize_accessor(SharedBitsAccessor *accessor, uint32 nbits);
+extern size_t sb_estimate(int participants);
+
+extern void sb_setbit(SharedBitsAccessor *accessor, uint64 bit);
+extern bool sb_checkbit(SharedBitsAccessor *accessor, uint32 n);
+extern BufFile *sb_combine(SharedBitsAccessor *accessor);
+extern bool sb_combined_exists(SharedBitsAccessor *accessor);
+
+extern void sb_end_write(SharedBitsAccessor *sba);
+extern void sb_end_read(SharedBitsAccessor *accessor);
+
+#endif							/* SHAREDBITS_H */
diff --git a/src/include/utils/sharedtuplestore.h b/src/include/utils/sharedtuplestore.h
index 8b2433e5c4..5e78f4bb15 100644
--- a/src/include/utils/sharedtuplestore.h
+++ b/src/include/utils/sharedtuplestore.h
@@ -71,11 +71,6 @@ extern MinimalTuple sts_parallel_scan_next(SharedTuplestoreAccessor *accessor,
 
 
 extern uint32 sts_increment_tuplenum(SharedTuplestoreAccessor *accessor);
-
-extern void sts_make_outer_match_status_file(SharedTuplestoreAccessor *accessor);
-extern void sts_set_outer_match_status(SharedTuplestoreAccessor *accessor, uint32 tuplenum);
-extern void sts_close_outer_match_status_file(SharedTuplestoreAccessor *accessor);
-extern BufFile *sts_combine_outer_match_status_files(SharedTuplestoreAccessor *accessor);
-
+extern uint32 sts_get_tuplenum(SharedTuplestoreAccessor *accessor);
 
 #endif							/* SHAREDTUPLESTORE_H */
-- 
2.20.1 (Apple Git-117)

v5-0005-Avoid-rescanning-inner-tuples-per-stripe.patchapplication/octet-stream; name=v5-0005-Avoid-rescanning-inner-tuples-per-stripe.patchDownload

From 46b05c9ac73ab661332a77fd4bd90238cb652836 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 11 Feb 2020 11:11:38 -0800
Subject: [PATCH v5 5/5] Avoid rescanning inner tuples per stripe

Instead of reinitializing the SharedTuplestore for the inner side for
each stripe during fallback, each participant's read_page is set to the
begininng of the SharedTuplestoreChunk which contains the end of one
stripe and the beginning of another.

Previously all inner tuples were scanned and only tuples from the
current stripe were loaded.

Each SharedTuplestoreAccessor now has a variable start_page, which is
initialized when it is assigned its read_page (which will always be the
beginning of a SharedTuplestoreChunk).

While loading tuples into the hashtable, if a tuple is from a past
stripe, the worker skips it (that will happen when a stripe straddles
two SharedTuplestoreChunks). If a tuple is from the future, the worker
backs that SharedTuplestoreChunk out and sets the shared read_page (in
the shared SharedTuplestoreParticipant) back to its start_page.

There are a couple mechanisms to provide for synchronization that
address specific race conditions/synchronization points:

Scenario 1:
- given a batch which has multiple stripes and has chosen the fallback
  strategy, two workers have each started reading from a single participant file
- worker0 is assigned pages 0-4
- worker1 is assigned pages 4-8
- stripe 1 starts on page 0 and ends on page 3
- worker0 sees that a tuple on page 3 is from stripe 2, so it proceeds
  to back out the read_page for this participant from 8 to its
  start_page of 0
- read_page is now 0
- worker1 was descheduled or distracted and starts reading a bit later.
  It sees that the very first tuple on page 4 is from a future
  stripe, so it wants to back out read_page to its start_page: 4
- If worker1 was allowed to do this, read_page would incorrectly be 4
  and tuples from stripe 2 on pages 3 and 4 would not be loaded into the
  hashtable
To handle this, a worker can only set read_page to a start_page which
is less than the current value of read_page

Scenario 2:
- given a batch which has multiple stripes and has chosen the fallback
  strategy, worker0 reads from participant file 0 and worker1 reads from
  participant file 1
- worker0 is assigned pages 0-4 in file1
- stripe 1 starts on page 0 and ends on page 3
- the current stripe is stripe 1
- worker0 sees on page3 that stripe 2 starts, so it backs out the
  read_page to 0
- worker1 finishes participant file 1 and proceeds to read from
  participant file 0
- worker1 opens the file and goes to get read_page
- read_page is 0, so worker1 loads tuples from stripe 1 in pages 0-3
- now both workers have loaded the same tuples into the hashtable
To handle this scenario, the participant has a rewound flag, which
indicates if this participant has been rewound during loading of the
current stripe. If it has, a worker cannot be assigned a
SharedTuplestoreChunk. This flag is reset later.

In this patch, Hashjoin makes an unacceptable intrusion into the
SharedTuplestore API. I am looking for feedback on how to solve this.
Basically, because the SharedTuplestore does not know about stripes or
about HashJoin, the logic to decide if a tuple should be loaded into a
hashtable or not is in the stripe phase machine where tuples are loaded
into the hashtable. So, to ensure that workers have read from all
participant files before assuming all tuples from a stripe are
loaded, I have duplicated the logic from sts_parallel_scan_next() which
has workers try the next participant in the body of the tuple loading
loop in the stripe phase machine (see sts_ready_for_next_stripe() and
sts_seen_all_participants()).

This clearly needs to be fixed and it is arguable that there are other
intrusions into the SharedTuplestore API in these patches.

One option is to write each stripe for each participant to a different
file, preserving the idea that a worker is done with a read_file when it
is at EOF.
---
 src/backend/executor/adaptiveHashjoin.c   |  37 +++++---
 src/backend/utils/sort/sharedtuplestore.c | 108 +++++++++++++++++++++-
 src/include/utils/sharedtuplestore.h      |   9 ++
 3 files changed, 140 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/adaptiveHashjoin.c b/src/backend/executor/adaptiveHashjoin.c
index 696bfc1c79..bc0eb46b90 100644
--- a/src/backend/executor/adaptiveHashjoin.c
+++ b/src/backend/executor/adaptiveHashjoin.c
@@ -85,6 +85,9 @@ ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 
 					sts_reinitialize(outer_tuples);
 
+					/* set the rewound flag back to false to prepare for the next stripe */
+					sts_reset_rewound(inner_tuples);
+
 					/*
 					 * reset inner's hashtable and recycle the existing bucket
 					 * array.
@@ -96,33 +99,37 @@ ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
 
 					for (size_t i = 0; i < hashtable->nbuckets; ++i)
 						dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
-
-					/*
-					 * TODO: this will unfortunately rescan all inner tuples
-					 * in the batch for each chunk
-					 */
-
-					/*
-					 * should be able to save the block in the file which
-					 * starts the next chunk instead
-					 */
-					sts_reinitialize(inner_tuples);
 				}
 				/* Fall through. */
 			case PHJ_CHUNK_RESETTING:
 				BarrierArriveAndWait(chunk_barrier, WAIT_EVENT_HASH_CHUNK_RESETTING);
 			case PHJ_CHUNK_LOADING:
 				/* Start (or join in) loading the next chunk of inner tuples. */
-				sts_begin_parallel_scan(inner_tuples);
+				sts_resume_parallel_scan(inner_tuples);
 
 				MinimalTuple tuple;
 				tupleMetadata metadata;
 
 				while ((tuple = sts_parallel_scan_next(inner_tuples, &metadata)))
 				{
-					if (metadata.chunk != phj_batch->current_chunk)
+					int current_stripe;
+					LWLockAcquire(&phj_batch->lock, LW_SHARED);
+					current_stripe = phj_batch->current_chunk;
+					LWLockRelease(&phj_batch->lock);
+
+					/* tuple from past. skip */
+					if (metadata.chunk < current_stripe)
 						continue;
+					/* tuple from future. time to back out read_page. end of stripe */
+					else if (metadata.chunk > current_stripe)
+					{
+						sts_backout_chunk(inner_tuples);
+						if (sts_seen_all_participants(inner_tuples))
+							break;
 
+						sts_ready_for_next_stripe(inner_tuples);
+						continue;
+					}
 					ExecForceStoreMinimalTuple(tuple,
 											   hjstate->hj_HashTupleSlot,
 											   false);
@@ -384,6 +391,8 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 							BarrierInit(&(barriers[i]), 0);
 						}
 						phj_batch->current_chunk = 1;
+						/* one worker needs to 0 out the read_pages of all the participants in the new batch */
+						sts_reinitialize(hashtable->batches[batchno].inner_tuples);
 					}
 					/* Fall through. */
 
@@ -410,6 +419,8 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 					if (batchno == 0)
 						sts_begin_parallel_scan(hashtable->batches[batchno].outer_tuples);
 
+					sts_begin_parallel_scan(hashtable->batches[batchno].inner_tuples);
+
 					/*
 					 * Create an outer match status file for this batch for
 					 * this worker This file must be accessible to the other
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 045b8eca80..a45f86bdd2 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -52,6 +52,7 @@ typedef struct SharedTuplestoreParticipant
 {
 	LWLock		lock;
 	BlockNumber read_page;		/* Page number for next read. */
+	bool    rewound;
 	BlockNumber npages;			/* Number of pages written. */
 	bool		writing;		/* Used only for assertions. */
 } SharedTuplestoreParticipant;
@@ -91,6 +92,7 @@ struct SharedTuplestoreAccessor
 	char	   *read_buffer;	/* A buffer for loading tuples. */
 	size_t		read_buffer_size;
 	BlockNumber read_next_page; /* Lowest block we'll consider reading. */
+	BlockNumber start_page; /* page to reset p->read_page to if back out required */
 
 	/* State for writing. */
 	SharedTuplestoreChunk *write_chunk; /* Buffer for writing. */
@@ -103,6 +105,21 @@ struct SharedTuplestoreAccessor
 static void sts_filename(char *name, SharedTuplestoreAccessor *accessor,
 						 int participant);
 
+bool
+sts_seen_all_participants(SharedTuplestoreAccessor *accessor)
+{
+	accessor->read_participant = (accessor->read_participant + 1) %
+		accessor->sts->nparticipants;
+	return accessor->read_participant == accessor->participant;
+}
+void
+sts_ready_for_next_stripe(SharedTuplestoreAccessor *accessor)
+{
+	accessor->read_next_page = 0;
+	BufFileClose(accessor->read_file);
+	accessor->read_file = NULL;
+}
+
 /*
  * Return the amount of shared memory required to hold SharedTuplestore for a
  * given number of participants.
@@ -166,6 +183,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 						 LWTRANCHE_SHARED_TUPLESTORE);
 		sts->participants[i].read_page = 0;
 		sts->participants[i].writing = false;
+		sts->participants[i].rewound = false;
 	}
 
 	accessor = palloc0(sizeof(SharedTuplestoreAccessor));
@@ -284,6 +302,47 @@ sts_begin_parallel_scan(SharedTuplestoreAccessor *accessor)
 	accessor->read_participant = accessor->participant;
 	accessor->read_file = NULL;
 	accessor->read_next_page = 0;
+	/*
+	 * As long as all code paths go through the Stripe Phase Machine and the
+	 * Batch Phase Machine, it is not required to zero out start_page here.
+	 * Do it anyway, for now.
+	 */
+	accessor->start_page = 0;
+}
+
+void
+sts_resume_parallel_scan(SharedTuplestoreAccessor *accessor)
+{
+	int			i PG_USED_FOR_ASSERTS_ONLY;
+
+	/* End any existing scan that was in progress. */
+	sts_end_parallel_scan(accessor);
+
+	/*
+	 * Any backend that might have written into this shared tuplestore must
+	 * have called sts_end_write(), so that all buffers are flushed and the
+	 * files have stopped growing.
+	 */
+	for (i = 0; i < accessor->sts->nparticipants; ++i)
+		Assert(!accessor->sts->participants[i].writing);
+
+	/*
+	 * We will start out reading the file that THIS backend wrote.  There may
+	 * be some caching locality advantage to that.
+	 */
+	/*
+	 * TODO: does this still apply in the multi-stripe case?
+	 * It seems like if a participant file is exhausted for the current stripe
+	 * it might be better to remember that
+	 */
+	accessor->read_participant = accessor->participant;
+	accessor->read_file = NULL;
+	SharedTuplestoreParticipant *p = &accessor->sts->participants[accessor->read_participant];
+
+	LWLockAcquire(&p->lock, LW_SHARED);
+	accessor->start_page = accessor->sts->participants[accessor->read_participant].read_page;
+	LWLockRelease(&p->lock);
+	accessor->read_next_page = 0;
 }
 
 /*
@@ -302,6 +361,30 @@ sts_end_parallel_scan(SharedTuplestoreAccessor *accessor)
 		BufFileClose(accessor->read_file);
 		accessor->read_file = NULL;
 	}
+	/* It is probably not required to zero out start_page here */
+	accessor->start_page = 0;
+}
+
+void
+sts_backout_chunk(SharedTuplestoreAccessor *accessor)
+{
+	SharedTuplestoreParticipant *p = &accessor->sts->participants[accessor->read_participant];
+
+	LWLockAcquire(&p->lock, LW_EXCLUSIVE);
+	/*
+	 * Only set the read_page back to the start of the sts_chunk this worker was
+	 * reading if some other worker has not already done so. It could be the case
+	 * that this worker saw a tuple from a future stripe and another worker did
+	 * also in its stschunk and it already set read_page to its start_page
+	 * If so, we want to set read_page to the lowest value to ensure that we
+	 * read all tuples from the stripe (don't miss tuples)
+	 */
+	if (accessor->start_page < p->read_page)
+	{
+		p->read_page = accessor->start_page;
+		p->rewound = true;
+	}
+	LWLockRelease(&p->lock);
 }
 
 /*
@@ -526,6 +609,17 @@ sts_read_tuple(SharedTuplestoreAccessor *accessor, void *meta_data)
 	return tuple;
 }
 
+void
+sts_reset_rewound(SharedTuplestoreAccessor *accessor)
+{
+	SharedTuplestoreParticipant *p;
+	for (int i = 0; i < accessor->sts->nparticipants; ++i)
+	{
+		p = &accessor->sts->participants[i];
+		p->rewound = false;
+	}
+}
+
 /*
  * Get the next tuple in the current parallel scan.
  */
@@ -539,7 +633,12 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	for (;;)
 	{
 		/* Can we read more tuples from the current chunk? */
-		if (accessor->read_ntuples < accessor->read_ntuples_available)
+		/*
+		 * Added a check for accessor->read_file being present here, as it
+		 * became relevant for adaptive hashjoin. Not sure if this has
+		 * other consequences for correctness
+		 */
+		if (accessor->read_ntuples < accessor->read_ntuples_available && accessor->read_file)
 			return sts_read_tuple(accessor, meta_data);
 
 		/* Find the location of a new chunk to read. */
@@ -552,11 +651,18 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 		eof = p->read_page >= p->npages;
 		if (!eof)
 		{
+			if (p->rewound == true)
+			{
+				LWLockRelease(&p->lock);
+				return NULL;
+			}
 			/* Claim the next chunk. */
 			read_page = p->read_page;
 			/* Advance the read head for the next reader. */
 			p->read_page += STS_CHUNK_PAGES;
 			accessor->read_next_page = p->read_page;
+			/* initialize start_page to the read_page this participant will start reading from */
+			accessor->start_page = read_page;
 		}
 		LWLockRelease(&p->lock);
 
diff --git a/src/include/utils/sharedtuplestore.h b/src/include/utils/sharedtuplestore.h
index 5e78f4bb15..fadf0232d0 100644
--- a/src/include/utils/sharedtuplestore.h
+++ b/src/include/utils/sharedtuplestore.h
@@ -60,8 +60,16 @@ extern void sts_reinitialize(SharedTuplestoreAccessor *accessor);
 
 extern void sts_begin_parallel_scan(SharedTuplestoreAccessor *accessor);
 
+extern void sts_resume_parallel_scan(SharedTuplestoreAccessor *accessor);
+
 extern void sts_end_parallel_scan(SharedTuplestoreAccessor *accessor);
 
+extern void sts_backout_chunk(SharedTuplestoreAccessor *accessor);
+
+extern bool sts_seen_all_participants(SharedTuplestoreAccessor *accessor);
+
+extern void sts_ready_for_next_stripe(SharedTuplestoreAccessor *accessor);
+
 extern void sts_puttuple(SharedTuplestoreAccessor *accessor,
 						 void *meta_data,
 						 MinimalTuple tuple);
@@ -69,6 +77,7 @@ extern void sts_puttuple(SharedTuplestoreAccessor *accessor,
 extern MinimalTuple sts_parallel_scan_next(SharedTuplestoreAccessor *accessor,
 										   void *meta_data);
 
+extern void sts_reset_rewound(SharedTuplestoreAccessor *accessor);
 
 extern uint32 sts_increment_tuplenum(SharedTuplestoreAccessor *accessor);
 extern uint32 sts_get_tuplenum(SharedTuplestoreAccessor *accessor);
-- 
2.20.1 (Apple Git-117)

v5-0001-Implement-Adaptive-Hashjoin.patchapplication/octet-stream; name=v5-0001-Implement-Adaptive-Hashjoin.patchDownload

From 7e5d82b65006523702221bb1c7c9e1d079781cc9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 29 Dec 2019 18:56:42 -0800
Subject: [PATCH v5 1/5] Implement Adaptive Hashjoin

Serial Hashloop Fallback:

"Chunk" the inner file into arbitrary partitions of work_mem size
offset along tuple bounds while loading the batch into the hashtable.

Note that this makes it impossible to increase nbatches during the
loading of batches after initial hashtable creation.

In preparation for doing this chunking, separate "advance batch" and
"load batch".

Implement outer tuple batch rewinding per chunk of inner batch. Would
be a simple rewind and replay of outer side for each chunk of inner if
it weren't for LOJ. Because we need to wait to emit NULL-extended
tuples for LOJ until after all chunks of inner have been processed.

To do this without incurring additional memory pressure, use a
temporary BufFile to capture the match status of each outer side
tuple. Use one bit per tuple to represent the match status, and, since
for parallel-oblivious hashjoin the outer side tuples are encountered
in a deterministic order, synchronizing the outer tuples match status
file with the outer tuples in the batch file to decide which ones to
emit NULL-extended is easy and can be done with a simple counter.

For non-hashloop fallback scenario (including batch 0), this file is
not created and unmatched outer tuples should be emitted as they are
encountered.

Parallel Hashloop Fallback:

During initial allocation of the hashtable, each time the number of
batches is increased, a new variable in the ParallelHashJoinState,
batch_increases, is incremented.

In PHJ_GROW_BATCHES_DECIDING, if pstate->batch_increases >= 2,
parallel_hashloop_fallback will be enabled for qualifying batches.
From then on, if a batch is still too large to fit into the
space_allowed, then parallel_hashloop_fallback is set on that batch.
It will not be allowed to divide further and, during execution, the
fallback strategy will be used.

For a batch which has parallel_hashloop_fallback set, tuples inserted
into the the batch's inner and outer batch files will have an
additional piece of metadata (other than the hashvalue). For the inner
side, this additional metadata is the chunk number, For the outer
side, this additional metadata is the tuple identifier--needed when
rescanning the outer side batch file for each chunk of the inner.

During execution of a parallel hashjoin batch which needs to fall
back, the worker will create an "outer match status file" which
contains a bitmap tracking which outer tuples have matched an inner
tuple. All bits in the worker's outer match status file are initially
unset. During probing, the worker will set the corresponding bit (the
bit at the index of the tuple identifier) in the outer match status
bitmap for an outer tuple which matches any inner tuple.

Workers probing a fallback batch will wait until all workers have
finished probing before moving on so that an elected worker can read
and combine the outer match status files into a single bitmap and use
it to emit unmatched outer tuples after all chunks of the inner side
have been processed.
---
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/adaptiveHashjoin.c       |  349 +++++
 src/backend/executor/nodeHash.c               |  127 +-
 src/backend/executor/nodeHashjoin.c           | 1202 +++++++++++-----
 src/backend/postmaster/pgstat.c               |   21 +
 src/backend/storage/file/buffile.c            |   65 +
 src/backend/storage/ipc/barrier.c             |   85 ++
 src/backend/utils/sort/sharedtuplestore.c     |  133 ++
 src/include/executor/adaptiveHashjoin.h       |    9 +
 src/include/executor/hashjoin.h               |   28 +-
 src/include/executor/nodeHash.h               |    5 +-
 src/include/executor/tuptable.h               |    3 +-
 src/include/nodes/execnodes.h                 |   17 +
 src/include/pgstat.h                          |    8 +
 src/include/storage/barrier.h                 |    1 +
 src/include/storage/buffile.h                 |    3 +
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/sharedtuplestore.h          |   22 +
 src/test/regress/expected/adaptive_hj.out     | 1233 +++++++++++++++++
 .../regress/expected/parallel_adaptive_hj.out |  343 +++++
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/post_schedule                |    8 +
 src/test/regress/pre_schedule                 |  120 ++
 src/test/regress/serial_schedule              |    2 +
 src/test/regress/sql/adaptive_hj.sql          |  240 ++++
 src/test/regress/sql/parallel_adaptive_hj.sql |  182 +++
 26 files changed, 3841 insertions(+), 369 deletions(-)
 create mode 100644 src/backend/executor/adaptiveHashjoin.c
 create mode 100644 src/include/executor/adaptiveHashjoin.h
 create mode 100644 src/test/regress/expected/adaptive_hj.out
 create mode 100644 src/test/regress/expected/parallel_adaptive_hj.out
 create mode 100644 src/test/regress/post_schedule
 create mode 100644 src/test/regress/pre_schedule
 create mode 100644 src/test/regress/sql/adaptive_hj.sql
 create mode 100644 src/test/regress/sql/parallel_adaptive_hj.sql

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..54799d7644 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	adaptiveHashjoin.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/adaptiveHashjoin.c b/src/backend/executor/adaptiveHashjoin.c
new file mode 100644
index 0000000000..dff5b38d38
--- /dev/null
+++ b/src/backend/executor/adaptiveHashjoin.c
@@ -0,0 +1,349 @@
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/parallel.h"
+#include "executor/executor.h"
+#include "executor/hashjoin.h"
+#include "executor/nodeHash.h"
+#include "executor/nodeHashjoin.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/sharedtuplestore.h"
+
+#include "executor/adaptiveHashjoin.h"
+
+
+
+
+bool
+ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing)
+{
+	HashJoinTable hashtable;
+	int			batchno;
+	ParallelHashJoinBatch *phj_batch;
+	SharedTuplestoreAccessor *outer_tuples;
+	SharedTuplestoreAccessor *inner_tuples;
+	Barrier    *chunk_barrier;
+
+	hashtable = hjstate->hj_HashTable;
+	batchno = hashtable->curbatch;
+	phj_batch = hashtable->batches[batchno].shared;
+	outer_tuples = hashtable->batches[batchno].outer_tuples;
+	inner_tuples = hashtable->batches[batchno].inner_tuples;
+
+	/*
+	 * This chunk_barrier is initialized in the ELECTING phase when this
+	 * worker attached to the batch in ExecParallelHashJoinNewBatch()
+	 */
+	chunk_barrier = &hashtable->batches[batchno].shared->chunk_barrier;
+
+	/*
+	 * If this worker just came from probing (from HJ_SCAN_BUCKET) we need to
+	 * advance the chunk number here. Otherwise this worker isn't attached yet
+	 * to the chunk barrier.
+	 */
+	if (advance_from_probing)
+	{
+		/*
+		 * The current chunk number can't be incremented if *any* worker isn't
+		 * done yet (otherwise they might access the wrong data structure!)
+		 */
+		if (BarrierArriveAndWait(chunk_barrier,
+								 WAIT_EVENT_HASH_CHUNK_PROBING))
+			phj_batch->current_chunk_num++;
+
+		/* Once the barrier is advanced we'll be in the DONE phase */
+	}
+	else
+		BarrierAttach(chunk_barrier);
+
+	/*
+	 * The outer side is exhausted and either 1) the current chunk of the
+	 * inner side is exhausted and it is time to advance the chunk 2) the last
+	 * chunk of the inner side is exhausted and it is time to advance the
+	 * batch
+	 */
+	switch (BarrierPhase(chunk_barrier))
+	{
+			/*
+			 * TODO: remove this phase and coordinate access to hashtable
+			 * above goto and after incrementing current_chunk_num
+			 */
+		case PHJ_CHUNK_ELECTING:
+	phj_chunk_electing:
+			BarrierArriveAndWait(chunk_barrier,
+								 WAIT_EVENT_HASH_CHUNK_ELECTING);
+			/* Fall through. */
+
+		case PHJ_CHUNK_LOADING:
+			/* Start (or join in) loading the next chunk of inner tuples. */
+			sts_begin_parallel_scan(inner_tuples);
+
+			MinimalTuple tuple;
+			tupleMetadata metadata;
+
+			while ((tuple = sts_parallel_scan_next(inner_tuples, &metadata)))
+			{
+				if (metadata.tupleid != phj_batch->current_chunk_num)
+					continue;
+
+				ExecForceStoreMinimalTuple(tuple,
+										   hjstate->hj_HashTupleSlot,
+										   false);
+
+				ExecParallelHashTableInsertCurrentBatch(
+														hashtable,
+														hjstate->hj_HashTupleSlot,
+														metadata.hashvalue);
+			}
+			sts_end_parallel_scan(inner_tuples);
+			BarrierArriveAndWait(chunk_barrier,
+								 WAIT_EVENT_HASH_CHUNK_LOADING);
+			/* Fall through. */
+
+		case PHJ_CHUNK_PROBING:
+			sts_begin_parallel_scan(outer_tuples);
+			return true;
+
+		case PHJ_CHUNK_DONE:
+
+			BarrierArriveAndWait(chunk_barrier, WAIT_EVENT_HASH_CHUNK_DONE);
+
+			if (phj_batch->current_chunk_num > phj_batch->total_num_chunks)
+			{
+				BarrierDetach(chunk_barrier);
+				return false;
+			}
+
+			/*
+			 * Otherwise it is time for the next chunk. One worker should
+			 * reset the hashtable
+			 */
+			if (BarrierArriveExplicitAndWait(chunk_barrier, PHJ_CHUNK_ELECTING, WAIT_EVENT_HASH_ADVANCE_CHUNK))
+			{
+				/*
+				 * rewind/reset outer tuplestore and rewind outer match status
+				 * files
+				 */
+				sts_reinitialize(outer_tuples);
+
+				/*
+				 * reset inner's hashtable and recycle the existing bucket
+				 * array.
+				 */
+				dsa_pointer_atomic *buckets = (dsa_pointer_atomic *)
+				dsa_get_address(hashtable->area, phj_batch->buckets);
+
+				for (size_t i = 0; i < hashtable->nbuckets; ++i)
+					dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
+
+				/*
+				 * TODO: this will unfortunately rescan all inner tuples in
+				 * the batch for each chunk
+				 */
+
+				/*
+				 * should be able to save the block in the file which starts
+				 * the next chunk instead
+				 */
+				sts_reinitialize(inner_tuples);
+			}
+			goto phj_chunk_electing;
+
+		case PHJ_CHUNK_FINAL:
+			BarrierDetach(chunk_barrier);
+			return false;
+
+		default:
+			elog(ERROR, "unexpected chunk phase %d. pid %i. batch %i.",
+				 BarrierPhase(chunk_barrier), MyProcPid, batchno);
+	}
+
+	return false;
+}
+
+
+/*
+ * Choose a batch to work on, and attach to it.  Returns true if successful,
+ * false if there are no more batches.
+ */
+bool
+ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			start_batchno;
+	int			batchno;
+
+	/*
+	 * If we started up so late that the batch tracking array has been freed
+	 * already by ExecHashTableDetach(), then we are finished.  See also
+	 * ExecParallelHashEnsureBatchAccessors().
+	 */
+	if (hashtable->batches == NULL)
+		return false;
+
+	/*
+	 * For hashloop fallback only Only the elected worker who was chosen to
+	 * combine the outer match status bitmaps should reach here. This worker
+	 * must do some final cleanup and then detach from the batch
+	 */
+	if (hjstate->combined_bitmap != NULL)
+	{
+		BufFileClose(hjstate->combined_bitmap);
+		hjstate->combined_bitmap = NULL;
+		hashtable->batches[hashtable->curbatch].done = true;
+		ExecHashTableDetachBatch(hashtable);
+	}
+
+	/*
+	 * If we were already attached to a batch, remember not to bother checking
+	 * it again, and detach from it (possibly freeing the hash table if we are
+	 * last to detach). curbatch is set when the batch_barrier phase is either
+	 * PHJ_BATCH_LOADING or PHJ_BATCH_CHUNKING (note that the
+	 * PHJ_BATCH_LOADING case will fall through to the PHJ_BATCH_CHUNKING
+	 * case). The PHJ_BATCH_CHUNKING case returns to the caller. So when this
+	 * function is reentered with a curbatch >= 0 then we must be done
+	 * probing.
+	 */
+	if (hashtable->curbatch >= 0)
+	{
+		ParallelHashJoinBatchAccessor *accessor = hashtable->batches + hashtable->curbatch;
+		ParallelHashJoinBatch *batch = accessor->shared;
+
+		/*
+		 * End the parallel scan on the outer tuples before we arrive at the
+		 * next barrier so that the last worker to arrive at that barrier can
+		 * reinitialize the SharedTuplestore for another parallel scan.
+		 */
+
+		if (!batch->parallel_hashloop_fallback)
+			BarrierArriveAndWait(&batch->batch_barrier,
+								 WAIT_EVENT_HASH_BATCH_PROBING);
+		else
+		{
+			sts_close_outer_match_status_file(accessor->outer_tuples);
+
+			/*
+			 * If all workers (including this one) have finished probing the
+			 * batch, one worker is elected to Combine all the outer match
+			 * status files from the workers who were attached to this batch
+			 * Loop through the outer match status files from all workers that
+			 * were attached to this batch Combine them into one bitmap Use
+			 * the bitmap, loop through the outer batch file again, and emit
+			 * unmatched tuples
+			 */
+
+			if (BarrierArriveAndWait(&batch->batch_barrier,
+									 WAIT_EVENT_HASH_BATCH_PROBING))
+			{
+				hjstate->combined_bitmap = sts_combine_outer_match_status_files(accessor->outer_tuples);
+				hjstate->last_worker = true;
+				return true;
+			}
+		}
+
+		/* the elected combining worker should not reach here */
+		hashtable->batches[hashtable->curbatch].done = true;
+		ExecHashTableDetachBatch(hashtable);
+	}
+
+	/*
+	 * Search for a batch that isn't done.  We use an atomic counter to start
+	 * our search at a different batch in every participant when there are
+	 * more batches than participants.
+	 */
+	batchno = start_batchno =
+		pg_atomic_fetch_add_u32(&hashtable->parallel_state->distributor, 1) %
+		hashtable->nbatch;
+
+	do
+	{
+		if (!hashtable->batches[batchno].done)
+		{
+			Barrier    *batch_barrier =
+			&hashtable->batches[batchno].shared->batch_barrier;
+
+			switch (BarrierAttach(batch_barrier))
+			{
+				case PHJ_BATCH_ELECTING:
+					/* One backend allocates the hash table. */
+					if (BarrierArriveAndWait(batch_barrier,
+											 WAIT_EVENT_HASH_BATCH_ELECTING))
+					{
+						ExecParallelHashTableAlloc(hashtable, batchno);
+						Barrier    *chunk_barrier =
+						&hashtable->batches[batchno].shared->chunk_barrier;
+
+						BarrierInit(chunk_barrier, 0);
+						hashtable->batches[batchno].shared->current_chunk_num = 1;
+					}
+					/* Fall through. */
+
+				case PHJ_BATCH_ALLOCATING:
+					/* Wait for allocation to complete. */
+					BarrierArriveAndWait(batch_barrier,
+										 WAIT_EVENT_HASH_BATCH_ALLOCATING);
+					/* Fall through. */
+
+				case PHJ_BATCH_CHUNKING:
+
+					/*
+					 * This batch is ready to probe.  Return control to
+					 * caller. We stay attached to batch_barrier so that the
+					 * hash table stays alive until everyone's finished
+					 * probing it, but no participant is allowed to wait at
+					 * this barrier again (or else a deadlock could occur).
+					 * All attached participants must eventually call
+					 * BarrierArriveAndDetach() so that the final phase
+					 * PHJ_BATCH_DONE can be reached.
+					 */
+					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
+
+					if (batchno == 0)
+						sts_begin_parallel_scan(hashtable->batches[batchno].outer_tuples);
+
+					/*
+					 * Create an outer match status file for this batch for
+					 * this worker This file must be accessible to the other
+					 * workers But *only* written to by this worker. Written
+					 * to by this worker and readable by any worker
+					 */
+					if (hashtable->batches[batchno].shared->parallel_hashloop_fallback)
+						sts_make_outer_match_status_file(hashtable->batches[batchno].outer_tuples);
+
+					return true;
+
+				case PHJ_BATCH_OUTER_MATCH_STATUS_PROCESSING:
+
+					/*
+					 * The batch isn't done but this worker can't contribute
+					 * anything to it so it might as well be done from this
+					 * worker's perspective. (Only one worker can do work in
+					 * this phase).
+					 */
+
+					/* Fall through. */
+
+				case PHJ_BATCH_DONE:
+
+					/*
+					 * Already done. Detach and go around again (if any
+					 * remain).
+					 */
+					BarrierDetach(batch_barrier);
+
+					hashtable->batches[batchno].done = true;
+					hashtable->curbatch = -1;
+					break;
+
+				default:
+					elog(ERROR, "unexpected batch phase %d. pid %i. batchno %i.",
+						 BarrierPhase(batch_barrier), MyProcPid, batchno);
+			}
+		}
+		batchno = (batchno + 1) % hashtable->nbatch;
+	} while (batchno != start_batchno);
+
+	return false;
+}
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index b6d5084908..c5420b169e 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -588,7 +588,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 		 * Attach to the build barrier.  The corresponding detach operation is
 		 * in ExecHashTableDetach.  Note that we won't attach to the
 		 * batch_barrier for batch 0 yet.  We'll attach later and start it out
-		 * in PHJ_BATCH_PROBING phase, because batch 0 is allocated up front
+		 * in PHJ_BATCH_CHUNKING phase, because batch 0 is allocated up front
 		 * and then loaded while hashing (the standard hybrid hash join
 		 * algorithm), and we'll coordinate that using build_barrier.
 		 */
@@ -1061,6 +1061,9 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 	int			i;
 
 	Assert(BarrierPhase(&pstate->build_barrier) == PHJ_BUILD_HASHING_INNER);
+	LWLockAcquire(&pstate->lock, LW_EXCLUSIVE);
+	pstate->batch_increases++;
+	LWLockRelease(&pstate->lock);
 
 	/*
 	 * It's unlikely, but we need to be prepared for new participants to show
@@ -1216,11 +1219,17 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 			{
 				bool		space_exhausted = false;
 				bool		extreme_skew_detected = false;
+				bool		excessive_batch_num_increases = false;
 
 				/* Make sure that we have the current dimensions and buckets. */
 				ExecParallelHashEnsureBatchAccessors(hashtable);
 				ExecParallelHashTableSetCurrentBatch(hashtable, 0);
 
+				LWLockAcquire(&pstate->lock, LW_EXCLUSIVE);
+				if (pstate->batch_increases >= 2)
+					excessive_batch_num_increases = true;
+				LWLockRelease(&pstate->lock);
+
 				/* Are any of the new generation of batches exhausted? */
 				for (i = 0; i < hashtable->nbatch; ++i)
 				{
@@ -1233,6 +1242,36 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 
 						space_exhausted = true;
 
+						/*
+						 * only once we've increased the number of batches
+						 * overall many times should we start setting
+						 */
+
+						/*
+						 * some batches to use the fallback strategy. Those
+						 * that are still too big will have this option set
+						 */
+
+						/*
+						 * we better not repartition again (growth should be
+						 * disabled), so that we don't overwrite this value
+						 */
+
+						/*
+						 * this tells us if we have set fallback to true or
+						 * not and how many chunks -- useful for seeing how
+						 * many chunks
+						 */
+
+						/*
+						 * we can get to before setting it to true (since we
+						 * still mark chunks (work_mem sized chunks)) in
+						 * batches even if we don't fall back
+						 */
+						/* same for below but opposite */
+						if (excessive_batch_num_increases == true)
+							batch->parallel_hashloop_fallback = true;
+
 						/*
 						 * Did this batch receive ALL of the tuples from its
 						 * parent batch?  That would indicate that further
@@ -1248,6 +1287,8 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 				/* Don't keep growing if it's not helping or we'd overflow. */
 				if (extreme_skew_detected || hashtable->nbatch >= INT_MAX / 2)
 					pstate->growth = PHJ_GROWTH_DISABLED;
+				else if (excessive_batch_num_increases && space_exhausted)
+					pstate->growth = PHJ_GROWTH_DISABLED;
 				else if (space_exhausted)
 					pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
 				else
@@ -1315,9 +1356,27 @@ ExecParallelHashRepartitionFirst(HashJoinTable hashtable)
 				MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 
 				/* It belongs in a later batch. */
+				ParallelHashJoinBatch *phj_batch = hashtable->batches[batchno].shared;
+
+				LWLockAcquire(&phj_batch->lock, LW_EXCLUSIVE);
+				/* TODO: should I check batch estimated size here at all? */
+				if (phj_batch->parallel_hashloop_fallback == true && (phj_batch->estimated_chunk_size + tuple_size > hashtable->parallel_state->space_allowed))
+				{
+					phj_batch->total_num_chunks++;
+					phj_batch->estimated_chunk_size = tuple_size;
+				}
+				else
+					phj_batch->estimated_chunk_size += tuple_size;
+
+				tupleMetadata metadata;
+
+				metadata.hashvalue = hashTuple->hashvalue;
+				metadata.tupleid = phj_batch->total_num_chunks;
+				LWLockRelease(&phj_batch->lock);
+
 				hashtable->batches[batchno].estimated_size += tuple_size;
 				sts_puttuple(hashtable->batches[batchno].inner_tuples,
-							 &hashTuple->hashvalue, tuple);
+							 &metadata, tuple);
 			}
 
 			/* Count this tuple. */
@@ -1369,12 +1428,15 @@ ExecParallelHashRepartitionRest(HashJoinTable hashtable)
 
 		/* Scan one partition from the previous generation. */
 		sts_begin_parallel_scan(old_inner_tuples[i]);
-		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i], &hashvalue)))
+		tupleMetadata metadata;
+
+		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i], &metadata)))
 		{
 			size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 			int			bucketno;
 			int			batchno;
 
+			hashvalue = metadata.hashvalue;
 			/* Decide which partition it goes to in the new generation. */
 			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
 									  &batchno);
@@ -1383,10 +1445,27 @@ ExecParallelHashRepartitionRest(HashJoinTable hashtable)
 			++hashtable->batches[batchno].ntuples;
 			++hashtable->batches[i].old_ntuples;
 
+			ParallelHashJoinBatch *phj_batch = hashtable->batches[batchno].shared;
+
+			LWLockAcquire(&phj_batch->lock, LW_EXCLUSIVE);
+			/* TODO: should I check batch estimated size here at all? */
+			if (phj_batch->parallel_hashloop_fallback == true && (phj_batch->estimated_chunk_size + tuple_size > pstate->space_allowed))
+			{
+				phj_batch->total_num_chunks++;
+				phj_batch->estimated_chunk_size = tuple_size;
+			}
+			else
+				phj_batch->estimated_chunk_size += tuple_size;
+			metadata.tupleid = phj_batch->total_num_chunks;
+			LWLockRelease(&phj_batch->lock);
 			/* Store the tuple its new batch. */
 			sts_puttuple(hashtable->batches[batchno].inner_tuples,
-						 &hashvalue, tuple);
+						 &metadata, tuple);
 
+			/*
+			 * TODO: should I zero out metadata here to make sure old values
+			 * aren't reused?
+			 */
 			CHECK_FOR_INTERRUPTS();
 		}
 		sts_end_parallel_scan(old_inner_tuples[i]);
@@ -1719,6 +1798,7 @@ retry:
 		size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 
 		Assert(batchno > 0);
+		ParallelHashJoinState *pstate = hashtable->parallel_state;
 
 		/* Try to preallocate space in the batch if necessary. */
 		if (hashtable->batches[batchno].preallocated < tuple_size)
@@ -1729,7 +1809,31 @@ retry:
 
 		Assert(hashtable->batches[batchno].preallocated >= tuple_size);
 		hashtable->batches[batchno].preallocated -= tuple_size;
-		sts_puttuple(hashtable->batches[batchno].inner_tuples, &hashvalue,
+		ParallelHashJoinBatch *phj_batch = hashtable->batches[batchno].shared;
+
+		LWLockAcquire(&phj_batch->lock, LW_EXCLUSIVE);
+
+		/* TODO: should batch estimated size be considered here? */
+
+		/*
+		 * TODO: should this be done in
+		 * ExecParallelHashTableInsertCurrentBatch instead?
+		 */
+		if (phj_batch->parallel_hashloop_fallback == true && (phj_batch->estimated_chunk_size + tuple_size > pstate->space_allowed))
+		{
+			phj_batch->total_num_chunks++;
+			phj_batch->estimated_chunk_size = tuple_size;
+		}
+		else
+			phj_batch->estimated_chunk_size += tuple_size;
+
+		tupleMetadata metadata;
+
+		metadata.hashvalue = hashvalue;
+		metadata.tupleid = phj_batch->total_num_chunks;
+		LWLockRelease(&phj_batch->lock);
+
+		sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata,
 					 tuple);
 	}
 	++hashtable->batches[batchno].ntuples;
@@ -2936,6 +3040,13 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
 		char		name[MAXPGPATH];
 
+		shared->parallel_hashloop_fallback = false;
+		LWLockInitialize(&shared->lock,
+						 LWTRANCHE_PARALLEL_HASH_JOIN_BATCH);
+		shared->current_chunk_num = 0;
+		shared->total_num_chunks = 1;
+		shared->estimated_chunk_size = 0;
+
 		/*
 		 * All members of shared were zero-initialized.  We just need to set
 		 * up the Barrier.
@@ -2945,7 +3056,7 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 		{
 			/* Batch 0 doesn't need to be loaded. */
 			BarrierAttach(&shared->batch_barrier);
-			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_PROBING)
+			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_CHUNKING)
 				BarrierArriveAndWait(&shared->batch_barrier, 0);
 			BarrierDetach(&shared->batch_barrier);
 		}
@@ -2959,7 +3070,7 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 			sts_initialize(ParallelHashJoinBatchInner(shared),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
@@ -2969,7 +3080,7 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 													  pstate->nparticipants),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index c901a80923..565b0c289f 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -81,11 +81,11 @@
  *  PHJ_BATCH_ELECTING       -- initial state
  *  PHJ_BATCH_ALLOCATING     -- one allocates buckets
  *  PHJ_BATCH_LOADING        -- all load the hash table from disk
- *  PHJ_BATCH_PROBING        -- all probe
+ *  PHJ_BATCH_CHUNKING       -- all probe
  *  PHJ_BATCH_DONE           -- end
  *
  * Batch 0 is a special case, because it starts out in phase
- * PHJ_BATCH_PROBING; populating batch 0's hash table is done during
+ * PHJ_BATCH_CHUNKING; populating batch 0's hash table is done during
  * PHJ_BUILD_HASHING_INNER so we can skip loading.
  *
  * Initially we try to plan for a single-batch hash join using the combined
@@ -98,7 +98,7 @@
  * already arrived.  Practically, that means that we never return a tuple
  * while attached to a barrier, unless the barrier has reached its final
  * state.  In the slightly special case of the per-batch barrier, we return
- * tuples while in PHJ_BATCH_PROBING phase, but that's OK because we use
+ * tuples while in PHJ_BATCH_CHUNKING phase, but that's OK because we use
  * BarrierArriveAndDetach() to advance it to PHJ_BATCH_DONE without waiting.
  *
  *-------------------------------------------------------------------------
@@ -117,6 +117,8 @@
 #include "utils/memutils.h"
 #include "utils/sharedtuplestore.h"
 
+#include "executor/adaptiveHashjoin.h"
+
 
 /*
  * States of the ExecHashJoin state machine
@@ -124,9 +126,11 @@
 #define HJ_BUILD_HASHTABLE		1
 #define HJ_NEED_NEW_OUTER		2
 #define HJ_SCAN_BUCKET			3
-#define HJ_FILL_OUTER_TUPLE		4
-#define HJ_FILL_INNER_TUPLES	5
-#define HJ_NEED_NEW_BATCH		6
+#define HJ_FILL_INNER_TUPLES    4
+#define HJ_NEED_NEW_BATCH		5
+#define HJ_NEED_NEW_INNER_CHUNK 6
+#define HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT 7
+#define HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER 8
 
 /* Returns true if doing null-fill on outer relation */
 #define HJ_FILL_OUTER(hjstate)	((hjstate)->hj_NullInnerTupleSlot != NULL)
@@ -143,10 +147,15 @@ static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 												 BufFile *file,
 												 uint32 *hashvalue,
 												 TupleTableSlot *tupleSlot);
-static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
-static bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
+
+static bool ExecHashJoinAdvanceBatch(HashJoinState *hjstate);
+static bool ExecHashJoinLoadInnerBatch(HashJoinState *hjstate);
 static void ExecParallelHashJoinPartitionOuter(HashJoinState *node);
 
+static TupleTableSlot *emitUnmatchedOuterTuple(ExprState *otherqual,
+											   ExprContext *econtext,
+											   HashJoinState *hjstate);
+
 
 /* ----------------------------------------------------------------
  *		ExecHashJoinImpl
@@ -161,8 +170,15 @@ static void ExecParallelHashJoinPartitionOuter(HashJoinState *node);
  *			  the other one is "outer".
  * ----------------------------------------------------------------
  */
-static pg_attribute_always_inline TupleTableSlot *
-ExecHashJoinImpl(PlanState *pstate, bool parallel)
+
+/* ----------------------------------------------------------------
+ *		ExecHashJoin
+ *
+ *		Parallel-oblivious version.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *			/* return: a tuple or NULL */
+ExecHashJoin(PlanState *pstate)
 {
 	HashJoinState *node = castNode(HashJoinState, pstate);
 	PlanState  *outerNode;
@@ -174,7 +190,8 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 	TupleTableSlot *outerTupleSlot;
 	uint32		hashvalue;
 	int			batchno;
-	ParallelHashJoinState *parallel_state;
+
+	BufFile    *outerFileForAdaptiveRead;
 
 	/*
 	 * get information from HashJoin node
@@ -185,7 +202,6 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 	outerNode = outerPlanState(node);
 	hashtable = node->hj_HashTable;
 	econtext = node->js.ps.ps_ExprContext;
-	parallel_state = hashNode->parallel_state;
 
 	/*
 	 * Reset per-tuple memory context to free any expression evaluation
@@ -243,18 +259,6 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					/* no chance to not build the hash table */
 					node->hj_FirstOuterTupleSlot = NULL;
 				}
-				else if (parallel)
-				{
-					/*
-					 * The empty-outer optimization is not implemented for
-					 * shared hash tables, because no one participant can
-					 * determine that there are no outer tuples, and it's not
-					 * yet clear that it's worth the synchronization overhead
-					 * of reaching consensus to figure that out.  So we have
-					 * to build the hash table.
-					 */
-					node->hj_FirstOuterTupleSlot = NULL;
-				}
 				else if (HJ_FILL_OUTER(node) ||
 						 (outerNode->plan->startup_cost < hashNode->ps.plan->total_cost &&
 						  !node->hj_OuterNotEmpty))
@@ -271,17 +275,533 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				else
 					node->hj_FirstOuterTupleSlot = NULL;
 
-				/*
-				 * Create the hash table.  If using Parallel Hash, then
-				 * whoever gets here first will create the hash table and any
-				 * later arrivals will merely attach to it.
-				 */
+				/* Create the hash table. */
 				hashtable = ExecHashTableCreate(hashNode,
 												node->hj_HashOperators,
 												node->hj_Collations,
 												HJ_FILL_INNER(node));
 				node->hj_HashTable = hashtable;
 
+				/* Execute the Hash node, to build the hash table. */
+				hashNode->hashtable = hashtable;
+				(void) MultiExecProcNode((PlanState *) hashNode);
+
+				/*
+				 * If the inner relation is completely empty, and we're not
+				 * doing a left outer join, we can quit without scanning the
+				 * outer relation.
+				 */
+				if (hashtable->totalTuples == 0 && !HJ_FILL_OUTER(node))
+					return NULL;
+
+				/*
+				 * need to remember whether nbatch has increased since we
+				 * began scanning the outer relation
+				 */
+				hashtable->nbatch_outstart = hashtable->nbatch;
+
+				/*
+				 * Reset OuterNotEmpty for scan.  (It's OK if we fetched a
+				 * tuple above, because ExecHashJoinOuterGetTuple will
+				 * immediately set it again.)
+				 */
+				node->hj_OuterNotEmpty = false;
+
+				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+
+				/* FALL THRU */
+
+			case HJ_NEED_NEW_OUTER:
+
+				/*
+				 * We don't have an outer tuple, try to get the next one
+				 */
+				outerTupleSlot =
+					ExecHashJoinOuterGetTuple(outerNode, node, &hashvalue);
+
+				if (TupIsNull(outerTupleSlot))
+				{
+					/*
+					 * end of batch, or maybe whole join. for hashloop
+					 * fallback, all we know is outer batch is exhausted.
+					 * inner could have more chunks
+					 */
+					if (HJ_FILL_INNER(node))
+					{
+						/* set up to scan for unmatched inner tuples */
+						ExecPrepHashTableForUnmatched(node);
+						node->hj_JoinState = HJ_FILL_INNER_TUPLES;
+						break;
+					}
+					node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
+					break;
+				}
+
+				econtext->ecxt_outertuple = outerTupleSlot;
+
+				/*
+				 * Find the corresponding bucket for this tuple in the main
+				 * hash table or skew hash table.
+				 */
+				node->hj_CurHashValue = hashvalue;
+				ExecHashGetBucketAndBatch(hashtable, hashvalue,
+										  &node->hj_CurBucketNo, &batchno);
+				node->hj_CurSkewBucketNo = ExecHashGetSkewBucket(hashtable,
+																 hashvalue);
+				node->hj_CurTuple = NULL;
+
+				/*
+				 * for the hashloop fallback case, only initialize
+				 * hj_MatchedOuter to false during the first chunk. otherwise,
+				 * we will be resetting hj_MatchedOuter to false for an outer
+				 * tuple that has already matched an inner tuple. also,
+				 * hj_MatchedOuter should be set to false for batch 0. there
+				 * are no chunks for batch 0, and node->hj_InnerFirstChunk
+				 * isn't set to true until HJ_NEED_NEW_BATCH, so need to
+				 * handle batch 0 explicitly
+				 */
+
+				if (!node->hashloop_fallback || hashtable->curbatch == 0 || node->hj_InnerFirstChunk)
+					node->hj_MatchedOuter = false;
+
+				/*
+				 * The tuple might not belong to the current batch (where
+				 * "current batch" includes the skew buckets if any).
+				 */
+				if (batchno != hashtable->curbatch &&
+					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO)
+				{
+					bool		shouldFree;
+					MinimalTuple mintuple = ExecFetchSlotMinimalTuple(outerTupleSlot,
+																	  &shouldFree);
+
+					/*
+					 * Need to postpone this outer tuple to a later batch.
+					 * Save it in the corresponding outer-batch file.
+					 */
+					Assert(batchno > hashtable->curbatch);
+					ExecHashJoinSaveTuple(mintuple, hashvalue,
+										  &hashtable->outerBatchFile[batchno]);
+
+					if (shouldFree)
+						heap_free_minimal_tuple(mintuple);
+
+					/* Loop around, staying in HJ_NEED_NEW_OUTER state */
+					continue;
+				}
+
+				if (node->hashloop_fallback)
+				{
+					/* first tuple of new batch */
+					if (node->hj_OuterMatchStatusesFile == NULL)
+					{
+						node->hj_OuterTupleCount = 0;
+						node->hj_OuterMatchStatusesFile = BufFileCreateTemp(false);
+					}
+
+					/* for fallback case, always increment tuple count */
+					node->hj_OuterTupleCount++;
+
+					/* Use the next byte on every 8th tuple */
+					if ((node->hj_OuterTupleCount - 1) % 8 == 0)
+					{
+						/*
+						 * first chunk of new batch, so write and initialize
+						 * enough bytes in the outer tuple match status file
+						 * to capture all tuples' match statuses
+						 */
+						if (node->hj_InnerFirstChunk)
+						{
+							node->hj_OuterCurrentByte = 0;
+							BufFileWrite(node->hj_OuterMatchStatusesFile, &node->hj_OuterCurrentByte, 1);
+						}
+						/* otherwise, just read the next byte */
+						else
+							BufFileRead(node->hj_OuterMatchStatusesFile, &node->hj_OuterCurrentByte, 1);
+					}
+				}
+
+				/* OK, let's scan the bucket for matches */
+				node->hj_JoinState = HJ_SCAN_BUCKET;
+
+				/* FALL THRU */
+
+			case HJ_SCAN_BUCKET:
+
+				/*
+				 * Scan the selected hash bucket for matches to current outer
+				 */
+				if (!ExecScanHashBucket(node, econtext))
+				{
+					/*
+					 * The current outer tuple has run out of matches, so
+					 * check whether to emit a dummy outer-join tuple.
+					 * Whether we emit one or not, the next state is
+					 * NEED_NEW_OUTER.
+					 */
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+					if (!node->hashloop_fallback || node->hj_HashTable->curbatch == 0)
+					{
+						TupleTableSlot *slot = emitUnmatchedOuterTuple(otherqual, econtext, node);
+
+						if (slot != NULL)
+							return slot;
+					}
+					continue;
+				}
+
+				if (joinqual != NULL && !ExecQual(joinqual, econtext))
+				{
+					InstrCountFiltered1(node, 1);
+					break;
+				}
+
+				/*
+				 * We've got a match, but still need to test non-hashed quals.
+				 * ExecScanHashBucket already set up all the state needed to
+				 * call ExecQual.
+				 *
+				 * If we pass the qual, then save state for next call and have
+				 * ExecProject form the projection, store it in the tuple
+				 * table, and return the slot.
+				 *
+				 * Only the joinquals determine tuple match status, but all
+				 * quals must pass to actually return the tuple.
+				 */
+
+				node->hj_MatchedOuter = true;
+
+				/*
+				 * This is really only needed if HJ_FILL_INNER(node),
+				 * but we'll avoid the branch and just set it always.
+				 */
+				HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
+
+				/* In an antijoin, we never return a matched tuple */
+				if (node->js.jointype == JOIN_ANTI)
+				{
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+					continue;
+				}
+
+				/*
+				 * If we only need to join to the first matching inner tuple,
+				 * then consider returning this one, but after that, continue
+				 * with next outer tuple.
+				 */
+				/* TODO: is semi-join correct for AHJ */
+				if (node->js.single_match)
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+
+				/*
+				 * Set the match bit for this outer tuple in the match status
+				 * file
+				 */
+				if (node->hj_OuterMatchStatusesFile != NULL)
+				{
+					Assert(node->hashloop_fallback == true);
+					int			byte_to_set = (node->hj_OuterTupleCount - 1) / 8;
+					int			bit_to_set_in_byte = (node->hj_OuterTupleCount - 1) % 8;
+
+					BufFileSeek(node->hj_OuterMatchStatusesFile, 0, byte_to_set, SEEK_SET);
+
+					node->hj_OuterCurrentByte = node->hj_OuterCurrentByte | (1 << bit_to_set_in_byte);
+
+					BufFileWrite(node->hj_OuterMatchStatusesFile, &node->hj_OuterCurrentByte, 1);
+				}
+
+				if (otherqual == NULL || ExecQual(otherqual, econtext))
+					return ExecProject(node->js.ps.ps_ProjInfo);
+				InstrCountFiltered2(node, 1);
+				break;
+
+			case HJ_FILL_INNER_TUPLES:
+
+				/*
+				 * We have finished a batch, but we are doing right/full join,
+				 * so any unmatched inner tuples in the hashtable have to be
+				 * emitted before we continue to the next batch.
+				 */
+				if (!ExecScanHashTableForUnmatched(node, econtext))
+				{
+					/* no more unmatched tuples */
+					node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
+					continue;
+				}
+
+				/*
+				 * Generate a fake join tuple with nulls for the outer tuple,
+				 * and return it if it passes the non-join quals.
+				 */
+				econtext->ecxt_outertuple = node->hj_NullOuterTupleSlot;
+
+				if (otherqual == NULL || ExecQual(otherqual, econtext))
+					return ExecProject(node->js.ps.ps_ProjInfo);
+				InstrCountFiltered2(node, 1);
+				break;
+
+			case HJ_NEED_NEW_BATCH:
+
+				/*
+				 * Try to advance to next batch.  Done if there are no more.
+				 * for batches after batch 0 for which hashloop_fallback is
+				 * true, if inner is exhausted, need to consider emitting
+				 * unmatched tuples we should never get here when
+				 * hashloop_fallback is false but hj_InnerExhausted is true,
+				 * however, it felt more clear to check for hashloop_fallback
+				 * explicitly
+				 */
+				if (node->hashloop_fallback && HJ_FILL_OUTER(node) && node->hj_InnerExhausted)
+				{
+					/*
+					 * For hashloop fallback, outer tuples are not emitted
+					 * until directly before advancing the batch (after all
+					 * inner chunks have been processed).
+					 * node->hashloop_fallback should be true because it is
+					 * not reset to false until advancing the batches
+					 */
+					node->hj_InnerExhausted = false;
+					node->hj_JoinState = HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT;
+					break;
+				}
+
+				if (!ExecHashJoinAdvanceBatch(node))
+					return NULL;
+
+				/*
+				 * TODO: need to find a better way to distinguish if I should
+				 * load inner batch again than checking for outer batch file
+				 */
+				/* I need to also do this even if it is NULL when it is a ROJ */
+
+				/*
+				 * need to load inner again if it is an inner or left outer
+				 * join and there are outer tuples in the batch OR
+				 */
+
+				/*
+				 * if it is a ROJ and there are inner tuples in the batch --
+				 * should never have no tuples in either batch...
+				 */
+				if (BufFileRewindIfExists(node->hj_HashTable->outerBatchFile[node->hj_HashTable->curbatch]) != NULL ||
+					(node->hj_HashTable->innerBatchFile[node->hj_HashTable->curbatch] != NULL && HJ_FILL_INNER(node)))
+					ExecHashJoinLoadInnerBatch(node);	/* TODO: should I ever
+														 * load inner when outer
+														 * file is not present? */
+
+				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				break;
+
+			case HJ_NEED_NEW_INNER_CHUNK:
+
+				if (!node->hashloop_fallback)
+				{
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					break;
+				}
+
+				/*
+				 * it is the hashloop fallback case and there are no more
+				 * chunks inner is exhausted, so we must advance the batches
+				 */
+				if (node->hj_InnerPageOffset == 0L)
+				{
+					node->hj_InnerExhausted = true;
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					break;
+				}
+
+				/*
+				 * This is the hashloop fallback case and we have more chunks
+				 * in inner. curbatch > 0. Rewind outer batch file (if
+				 * present) so that we can start reading it. Rewind outer
+				 * match statuses file if present so that we can set match
+				 * bits as needed. Reset the tuple count and load the next
+				 * chunk of inner. Then proceed to get a new outer tuple from
+				 * our rewound outer batch file
+				 */
+				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+
+				/*
+				 * TODO: need to find a better way to distinguish if I should
+				 * load inner batch again than checking for outer batch file
+				 */
+				/* I need to also do this even if it is NULL when it is a ROJ */
+
+				/*
+				 * need to load inner again if it is an inner or left outer
+				 * join and there are outer tuples in the batch OR
+				 */
+
+				/*
+				 * if it is a ROJ and there are inner tuples in the batch --
+				 * should never have no tuples in either batch...
+				 */
+
+				/*
+				 * if outer is not null or if it is a ROJ and inner is not
+				 * null, must rewind outer match status and load inner
+				 */
+				if (BufFileRewindIfExists(node->hj_HashTable->outerBatchFile[node->hj_HashTable->curbatch]) != NULL ||
+					(node->hj_HashTable->innerBatchFile[node->hj_HashTable->curbatch] != NULL && HJ_FILL_INNER(node)))
+				{
+					BufFileRewindIfExists(node->hj_OuterMatchStatusesFile);
+					node->hj_OuterTupleCount = 0;
+					ExecHashJoinLoadInnerBatch(node);
+				}
+				break;
+
+			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT:
+
+				node->hj_OuterTupleCount = 0;
+				BufFileRewindIfExists(node->hj_OuterMatchStatusesFile);
+
+				/*
+				 * TODO: is it okay to use the hashtable to get the outer
+				 * batch file here?
+				 */
+				outerFileForAdaptiveRead = hashtable->outerBatchFile[hashtable->curbatch];
+				if (outerFileForAdaptiveRead == NULL)	/* TODO: could this
+														 * happen */
+				{
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					break;
+				}
+				BufFileRewindIfExists(outerFileForAdaptiveRead);
+
+				node->hj_JoinState = HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER;
+				/* fall through */
+
+			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER:
+
+				outerFileForAdaptiveRead = hashtable->outerBatchFile[hashtable->curbatch];
+
+				while (true)
+				{
+					uint32		unmatchedOuterHashvalue;
+					TupleTableSlot *slot = ExecHashJoinGetSavedTuple(node,
+																	 outerFileForAdaptiveRead,
+																	 &unmatchedOuterHashvalue,
+																	 node->hj_OuterTupleSlot);
+
+					node->hj_OuterTupleCount++;
+
+					if (slot == NULL)
+					{
+						node->hj_JoinState = HJ_NEED_NEW_BATCH;
+						break;
+					}
+
+					unsigned char bit = (node->hj_OuterTupleCount - 1) % 8;
+
+					/* need to read the next byte */
+					if (bit == 0)
+						BufFileRead(node->hj_OuterMatchStatusesFile, &node->hj_OuterCurrentByte, 1);
+
+					/* if the match bit is set for this tuple, continue */
+					if ((node->hj_OuterCurrentByte >> bit) & 1)
+						continue;
+
+					/* if it is not a match then emit it NULL-extended */
+					econtext->ecxt_outertuple = slot;
+					econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
+					return ExecProject(node->js.ps.ps_ProjInfo);
+				}
+				/* came here from HJ_NEED_NEW_BATCH, so go back there */
+				node->hj_JoinState = HJ_NEED_NEW_BATCH;
+				break;
+
+			default:
+				elog(ERROR, "unrecognized hashjoin state: %d",
+					 (int) node->hj_JoinState);
+		}
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecParallelHashJoin
+ *
+ *		Parallel-aware version.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *			/* return: a tuple or NULL */
+ExecParallelHashJoin(PlanState *pstate)
+{
+	HashJoinState *node = castNode(HashJoinState, pstate);
+	PlanState  *outerNode;
+	HashState  *hashNode;
+	ExprState  *joinqual;
+	ExprState  *otherqual;
+	ExprContext *econtext;
+	HashJoinTable hashtable;
+	TupleTableSlot *outerTupleSlot;
+	uint32		hashvalue;
+	int			batchno;
+	ParallelHashJoinState *parallel_state;
+
+	/*
+	 * get information from HashJoin node
+	 */
+	joinqual = node->js.joinqual;
+	otherqual = node->js.ps.qual;
+	hashNode = (HashState *) innerPlanState(node);
+	outerNode = outerPlanState(node);
+	hashtable = node->hj_HashTable;
+	econtext = node->js.ps.ps_ExprContext;
+	parallel_state = hashNode->parallel_state;
+
+	bool		advance_from_probing = false;
+
+	/*
+	 * Reset per-tuple memory context to free any expression evaluation
+	 * storage allocated in the previous tuple cycle.
+	 */
+	ResetExprContext(econtext);
+
+	/*
+	 * run the hash join state machine
+	 */
+	for (;;)
+	{
+		SharedTuplestoreAccessor *outer_acc;
+
+		/*
+		 * It's possible to iterate this loop many times before returning a
+		 * tuple, in some pathological cases such as needing to move much of
+		 * the current batch to a later batch.  So let's check for interrupts
+		 * each time through.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		switch (node->hj_JoinState)
+		{
+			case HJ_BUILD_HASHTABLE:
+
+				/*
+				 * First time through: build hash table for inner relation.
+				 */
+				Assert(hashtable == NULL);
+				/* volatile int mybp = 0; while (mybp == 0); */
+
+				/*
+				 * The empty-outer optimization is not implemented for shared
+				 * hash tables, because no one participant can determine that
+				 * there are no outer tuples, and it's not yet clear that it's
+				 * worth the synchronization overhead of reaching consensus to
+				 * figure that out.  So we have to build the hash table.
+				 */
+				node->hj_FirstOuterTupleSlot = NULL;
+
+				/*
+				 * Create the hash table.  If using Parallel Hash, then
+				 * whoever gets here first will create the hash table and any
+				 * later arrivals will merely attach to it.
+				 */
+				node->hj_HashTable = hashtable = ExecHashTableCreate(hashNode,
+																	 node->hj_HashOperators,
+																	 node->hj_Collations,
+																	 HJ_FILL_INNER(node));
+
 				/*
 				 * Execute the Hash node, to build the hash table.  If using
 				 * Parallel Hash, then we'll try to help hashing unless we
@@ -311,66 +831,59 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				 */
 				node->hj_OuterNotEmpty = false;
 
-				if (parallel)
-				{
-					Barrier    *build_barrier;
-
-					build_barrier = &parallel_state->build_barrier;
-					Assert(BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER ||
-						   BarrierPhase(build_barrier) == PHJ_BUILD_DONE);
-					if (BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER)
-					{
-						/*
-						 * If multi-batch, we need to hash the outer relation
-						 * up front.
-						 */
-						if (hashtable->nbatch > 1)
-							ExecParallelHashJoinPartitionOuter(node);
-						BarrierArriveAndWait(build_barrier,
-											 WAIT_EVENT_HASH_BUILD_HASHING_OUTER);
-					}
-					Assert(BarrierPhase(build_barrier) == PHJ_BUILD_DONE);
-
-					/* Each backend should now select a batch to work on. */
-					hashtable->curbatch = -1;
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+				Barrier    *build_barrier;
 
-					continue;
+				build_barrier = &parallel_state->build_barrier;
+				Assert(BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER ||
+					   BarrierPhase(build_barrier) == PHJ_BUILD_DONE);
+				if (BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER)
+				{
+					/*
+					 * If multi-batch, we need to hash the outer relation up
+					 * front.
+					 */
+					if (hashtable->nbatch > 1)
+						ExecParallelHashJoinPartitionOuter(node);
+					BarrierArriveAndWait(build_barrier,
+										 WAIT_EVENT_HASH_BUILD_HASHING_OUTER);
 				}
-				else
-					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				Assert(BarrierPhase(build_barrier) == PHJ_BUILD_DONE);
 
-				/* FALL THRU */
+				/* Each backend should now select a batch to work on. */
+				hashtable->curbatch = -1;
+				node->hj_JoinState = HJ_NEED_NEW_BATCH;
+
+				continue;
 
 			case HJ_NEED_NEW_OUTER:
 
 				/*
 				 * We don't have an outer tuple, try to get the next one
 				 */
-				if (parallel)
-					outerTupleSlot =
-						ExecParallelHashJoinOuterGetTuple(outerNode, node,
-														  &hashvalue);
-				else
-					outerTupleSlot =
-						ExecHashJoinOuterGetTuple(outerNode, node, &hashvalue);
+				outerTupleSlot =
+					ExecParallelHashJoinOuterGetTuple(outerNode, node,
+													  &hashvalue);
 
 				if (TupIsNull(outerTupleSlot))
 				{
-					/* end of batch, or maybe whole join */
+					/*
+					 * end of batch, or maybe whole join. for hashloop
+					 * fallback, all we know is outer batch is exhausted.
+					 * inner could have more chunks
+					 */
 					if (HJ_FILL_INNER(node))
 					{
 						/* set up to scan for unmatched inner tuples */
 						ExecPrepHashTableForUnmatched(node);
 						node->hj_JoinState = HJ_FILL_INNER_TUPLES;
+						break;
 					}
-					else
-						node->hj_JoinState = HJ_NEED_NEW_BATCH;
-					continue;
+					advance_from_probing = true;
+					node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
+					break;
 				}
 
 				econtext->ecxt_outertuple = outerTupleSlot;
-				node->hj_MatchedOuter = false;
 
 				/*
 				 * Find the corresponding bucket for this tuple in the main
@@ -384,33 +897,18 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				node->hj_CurTuple = NULL;
 
 				/*
-				 * The tuple might not belong to the current batch (where
-				 * "current batch" includes the skew buckets if any).
+				 * for the hashloop fallback case, only initialize
+				 * hj_MatchedOuter to false during the first chunk. otherwise,
+				 * we will be resetting hj_MatchedOuter to false for an outer
+				 * tuple that has already matched an inner tuple. also,
+				 * hj_MatchedOuter should be set to false for batch 0. there
+				 * are no chunks for batch 0
 				 */
-				if (batchno != hashtable->curbatch &&
-					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO)
-				{
-					bool		shouldFree;
-					MinimalTuple mintuple = ExecFetchSlotMinimalTuple(outerTupleSlot,
-																	  &shouldFree);
 
-					/*
-					 * Need to postpone this outer tuple to a later batch.
-					 * Save it in the corresponding outer-batch file.
-					 */
-					Assert(parallel_state == NULL);
-					Assert(batchno > hashtable->curbatch);
-					ExecHashJoinSaveTuple(mintuple, hashvalue,
-										  &hashtable->outerBatchFile[batchno]);
-
-					if (shouldFree)
-						heap_free_minimal_tuple(mintuple);
-
-					/* Loop around, staying in HJ_NEED_NEW_OUTER state */
-					continue;
-				}
+				ParallelHashJoinBatch *phj_batch = node->hj_HashTable->batches[node->hj_HashTable->curbatch].shared;
 
-				/* OK, let's scan the bucket for matches */
+				if (!phj_batch->parallel_hashloop_fallback || phj_batch->current_chunk_num == 1)
+					node->hj_MatchedOuter = false;
 				node->hj_JoinState = HJ_SCAN_BUCKET;
 
 				/* FALL THRU */
@@ -420,23 +918,25 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				/*
 				 * Scan the selected hash bucket for matches to current outer
 				 */
-				if (parallel)
-				{
-					if (!ExecParallelScanHashBucket(node, econtext))
-					{
-						/* out of matches; check for possible outer-join fill */
-						node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
-						continue;
-					}
-				}
-				else
+				phj_batch = node->hj_HashTable->batches[node->hj_HashTable->curbatch].shared;
+
+				if (!ExecParallelScanHashBucket(node, econtext))
 				{
-					if (!ExecScanHashBucket(node, econtext))
+					/*
+					 * The current outer tuple has run out of matches, so
+					 * check whether to emit a dummy outer-join tuple.
+					 * Whether we emit one or not, the next state is
+					 * NEED_NEW_OUTER.
+					 */
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+					if (!phj_batch->parallel_hashloop_fallback)
 					{
-						/* out of matches; check for possible outer-join fill */
-						node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
-						continue;
+						TupleTableSlot *slot = emitUnmatchedOuterTuple(otherqual, econtext, node);
+
+						if (slot != NULL)
+							return slot;
 					}
+					continue;
 				}
 
 				/*
@@ -451,77 +951,55 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				 * Only the joinquals determine tuple match status, but all
 				 * quals must pass to actually return the tuple.
 				 */
-				if (joinqual == NULL || ExecQual(joinqual, econtext))
+				if (joinqual != NULL && !ExecQual(joinqual, econtext))
 				{
-					node->hj_MatchedOuter = true;
-
-					if (parallel)
-					{
-						/*
-						 * Full/right outer joins are currently not supported
-						 * for parallel joins, so we don't need to set the
-						 * match bit.  Experiments show that it's worth
-						 * avoiding the shared memory traffic on large
-						 * systems.
-						 */
-						Assert(!HJ_FILL_INNER(node));
-					}
-					else
-					{
-						/*
-						 * This is really only needed if HJ_FILL_INNER(node),
-						 * but we'll avoid the branch and just set it always.
-						 */
-						HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
-					}
-
-					/* In an antijoin, we never return a matched tuple */
-					if (node->js.jointype == JOIN_ANTI)
-					{
-						node->hj_JoinState = HJ_NEED_NEW_OUTER;
-						continue;
-					}
+					InstrCountFiltered1(node, 1);
+					break;
+				}
 
-					/*
-					 * If we only need to join to the first matching inner
-					 * tuple, then consider returning this one, but after that
-					 * continue with next outer tuple.
-					 */
-					if (node->js.single_match)
-						node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				node->hj_MatchedOuter = true;
+				/*
+				 * Full/right outer joins are currently not supported
+				 * for parallel joins, so we don't need to set the
+				 * match bit.  Experiments show that it's worth
+				 * avoiding the shared memory traffic on large
+				 * systems.
+				 */
+				Assert(!HJ_FILL_INNER(node));
 
-					if (otherqual == NULL || ExecQual(otherqual, econtext))
-						return ExecProject(node->js.ps.ps_ProjInfo);
-					else
-						InstrCountFiltered2(node, 1);
+				/*
+				 * TODO: how does this interact with PAHJ -- do I need to set
+				 * matchbit?
+				 */
+				/* In an antijoin, we never return a matched tuple */
+				if (node->js.jointype == JOIN_ANTI)
+				{
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+					continue;
 				}
-				else
-					InstrCountFiltered1(node, 1);
-				break;
-
-			case HJ_FILL_OUTER_TUPLE:
 
 				/*
-				 * The current outer tuple has run out of matches, so check
-				 * whether to emit a dummy outer-join tuple.  Whether we emit
-				 * one or not, the next state is NEED_NEW_OUTER.
+				 * If we only need to join to the first matching inner tuple,
+				 * then consider returning this one, but after that continue
+				 * with next outer tuple.
 				 */
-				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				if (node->js.single_match)
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
-				if (!node->hj_MatchedOuter &&
-					HJ_FILL_OUTER(node))
+				/*
+				 * Set the match bit for this outer tuple in the match status
+				 * file
+				 */
+				if (phj_batch->parallel_hashloop_fallback)
 				{
-					/*
-					 * Generate a fake join tuple with nulls for the inner
-					 * tuple, and return it if it passes the non-join quals.
-					 */
-					econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
+					sts_set_outer_match_status(hashtable->batches[hashtable->curbatch].outer_tuples,
+											   econtext->ecxt_outertuple->tuplenum);
 
-					if (otherqual == NULL || ExecQual(otherqual, econtext))
-						return ExecProject(node->js.ps.ps_ProjInfo);
-					else
-						InstrCountFiltered2(node, 1);
 				}
+				if (otherqual == NULL || ExecQual(otherqual, econtext))
+					return ExecProject(node->js.ps.ps_ProjInfo);
+				else
+					InstrCountFiltered2(node, 1);
 				break;
 
 			case HJ_FILL_INNER_TUPLES:
@@ -534,7 +1012,8 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				if (!ExecScanHashTableForUnmatched(node, econtext))
 				{
 					/* no more unmatched tuples */
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					advance_from_probing = true;
+					node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
 					continue;
 				}
 
@@ -552,22 +1031,108 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 
 			case HJ_NEED_NEW_BATCH:
 
+				phj_batch = hashtable->batches[hashtable->curbatch].shared;
+
 				/*
 				 * Try to advance to next batch.  Done if there are no more.
 				 */
-				if (parallel)
+				if (!ExecParallelHashJoinNewBatch(node))
+					return NULL;	/* end of parallel-aware join */
+
+				if (node->last_worker
+					&& HJ_FILL_OUTER(node) && phj_batch->parallel_hashloop_fallback)
 				{
-					if (!ExecParallelHashJoinNewBatch(node))
-						return NULL;	/* end of parallel-aware join */
+					node->last_worker = false;
+					node->hj_JoinState = HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT;
+					break;
 				}
-				else
+				if (node->hj_HashTable->curbatch == 0)
 				{
-					if (!ExecHashJoinNewBatch(node))
-						return NULL;	/* end of parallel-oblivious join */
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+					break;
 				}
-				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				advance_from_probing = false;
+				node->hj_JoinState = HJ_NEED_NEW_INNER_CHUNK;
+				/* FALL THRU */
+
+			case HJ_NEED_NEW_INNER_CHUNK:
+
+				if (hashtable->curbatch == -1 || hashtable->curbatch == 0)
+
+					/*
+					 * If we're not attached to a batch at all then we need to
+					 * go to HJ_NEED_NEW_BATCH. Also batch 0 doesn't have more
+					 * than 1 chunk.
+					 */
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+				else if (!ExecParallelHashJoinNewChunk(node, advance_from_probing))
+					/* If there's no next chunk then go to the next batch */
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+				else
+					node->hj_JoinState = HJ_NEED_NEW_OUTER;
 				break;
 
+			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER_INIT:
+
+				outer_acc = hashtable->batches[hashtable->curbatch].outer_tuples;
+				sts_reinitialize(outer_acc);
+				sts_begin_parallel_scan(outer_acc);
+
+				node->hj_JoinState = HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER;
+				/* FALL THRU */
+
+			case HJ_ADAPTIVE_EMIT_UNMATCHED_OUTER:
+
+				Assert(node->combined_bitmap != NULL);
+
+				outer_acc = node->hj_HashTable->batches[node->hj_HashTable->curbatch].outer_tuples;
+
+				MinimalTuple tuple;
+
+				do
+				{
+					tupleMetadata metadata;
+
+					if ((tuple = sts_parallel_scan_next(outer_acc, &metadata)) == NULL)
+						break;
+
+					int			bytenum = metadata.tupleid / 8;
+					unsigned char bit = metadata.tupleid % 8;
+					unsigned char byte_to_check = 0;
+
+					/* seek to byte to check */
+					if (BufFileSeek(node->combined_bitmap, 0, bytenum, SEEK_SET))
+						ereport(ERROR,
+								(errcode_for_file_access(),
+								 errmsg("could not rewind shared outer temporary file: %m")));
+					/* read byte containing ntuple bit */
+					if (BufFileRead(node->combined_bitmap, &byte_to_check, 1) == 0)
+						ereport(ERROR,
+								(errcode_for_file_access(),
+								 errmsg("could not read byte in outer match status bitmap: %m.")));
+					/* if bit is set */
+					bool		match = ((byte_to_check) >> bit) & 1;
+
+					if (!match)
+						break;
+				} while (1);
+
+				if (tuple == NULL)
+				{
+					sts_end_parallel_scan(outer_acc);
+					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					break;
+				}
+
+				/* Emit the unmatched tuple */
+				ExecForceStoreMinimalTuple(tuple,
+										   econtext->ecxt_outertuple,
+										   false);
+				econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
+
+				return ExecProject(node->js.ps.ps_ProjInfo);
+
+
 			default:
 				elog(ERROR, "unrecognized hashjoin state: %d",
 					 (int) node->hj_JoinState);
@@ -575,38 +1140,6 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 	}
 }
 
-/* ----------------------------------------------------------------
- *		ExecHashJoin
- *
- *		Parallel-oblivious version.
- * ----------------------------------------------------------------
- */
-static TupleTableSlot *			/* return: a tuple or NULL */
-ExecHashJoin(PlanState *pstate)
-{
-	/*
-	 * On sufficiently smart compilers this should be inlined with the
-	 * parallel-aware branches removed.
-	 */
-	return ExecHashJoinImpl(pstate, false);
-}
-
-/* ----------------------------------------------------------------
- *		ExecParallelHashJoin
- *
- *		Parallel-aware version.
- * ----------------------------------------------------------------
- */
-static TupleTableSlot *			/* return: a tuple or NULL */
-ExecParallelHashJoin(PlanState *pstate)
-{
-	/*
-	 * On sufficiently smart compilers this should be inlined with the
-	 * parallel-oblivious branches removed.
-	 */
-	return ExecHashJoinImpl(pstate, true);
-}
-
 /* ----------------------------------------------------------------
  *		ExecInitHashJoin
  *
@@ -641,6 +1174,18 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate->js.ps.ExecProcNode = ExecHashJoin;
 	hjstate->js.jointype = node->join.jointype;
 
+	hjstate->hashloop_fallback = false;
+	hjstate->hj_InnerPageOffset = 0L;
+	hjstate->hj_InnerFirstChunk = false;
+	hjstate->hj_OuterCurrentByte = 0;
+
+	hjstate->hj_OuterMatchStatusesFile = NULL;
+	hjstate->hj_OuterTupleCount = 0;
+	hjstate->hj_InnerExhausted = false;
+
+	hjstate->last_worker = false;
+	hjstate->combined_bitmap = NULL;
+
 	/*
 	 * Miscellaneous initialization
 	 *
@@ -792,6 +1337,30 @@ ExecEndHashJoin(HashJoinState *node)
 	ExecEndNode(innerPlanState(node));
 }
 
+
+static TupleTableSlot *
+emitUnmatchedOuterTuple(ExprState *otherqual, ExprContext *econtext, HashJoinState *hjstate)
+{
+	if (hjstate->hj_MatchedOuter)
+		return NULL;
+
+	if (!HJ_FILL_OUTER(hjstate))
+		return NULL;
+
+	econtext->ecxt_innertuple = hjstate->hj_NullInnerTupleSlot;
+
+	/*
+	 * Generate a fake join tuple with nulls for the inner tuple, and return
+	 * it if it passes the non-join quals.
+	 */
+
+	if (otherqual == NULL || ExecQual(otherqual, econtext))
+		return ExecProject(hjstate->js.ps.ps_ProjInfo);
+
+	InstrCountFiltered2(hjstate, 1);
+	return NULL;
+}
+
 /*
  * ExecHashJoinOuterGetTuple
  *
@@ -919,13 +1488,20 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 	{
 		MinimalTuple tuple;
 
+		tupleMetadata metadata;
+		int			tupleid;
+
 		tuple = sts_parallel_scan_next(hashtable->batches[curbatch].outer_tuples,
-									   hashvalue);
+									   &metadata);
 		if (tuple != NULL)
 		{
+			/* where is this hashvalue being used? */
+			*hashvalue = metadata.hashvalue;
+			tupleid = metadata.tupleid;
 			ExecForceStoreMinimalTuple(tuple,
 									   hjstate->hj_OuterTupleSlot,
 									   false);
+			hjstate->hj_OuterTupleSlot->tuplenum = tupleid;
 			slot = hjstate->hj_OuterTupleSlot;
 			return slot;
 		}
@@ -938,20 +1514,17 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 }
 
 /*
- * ExecHashJoinNewBatch
+ * ExecHashJoinAdvanceBatch
  *		switch to a new hashjoin batch
  *
  * Returns true if successful, false if there are no more batches.
  */
 static bool
-ExecHashJoinNewBatch(HashJoinState *hjstate)
+ExecHashJoinAdvanceBatch(HashJoinState *hjstate)
 {
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	int			nbatch;
 	int			curbatch;
-	BufFile    *innerFile;
-	TupleTableSlot *slot;
-	uint32		hashvalue;
 
 	nbatch = hashtable->nbatch;
 	curbatch = hashtable->curbatch;
@@ -1026,10 +1599,36 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 		curbatch++;
 	}
 
+	hjstate->hj_InnerPageOffset = 0L;
+	hjstate->hj_InnerFirstChunk = true;
+	hjstate->hashloop_fallback = false; /* new batch, so start it off false */
+	if (hjstate->hj_OuterMatchStatusesFile != NULL)
+		BufFileClose(hjstate->hj_OuterMatchStatusesFile);
+	hjstate->hj_OuterMatchStatusesFile = NULL;
 	if (curbatch >= nbatch)
 		return false;			/* no more batches */
 
 	hashtable->curbatch = curbatch;
+	return true;
+}
+
+/*
+ * Returns true if there are more chunks left, false otherwise
+ */
+static bool
+ExecHashJoinLoadInnerBatch(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			curbatch = hashtable->curbatch;
+	BufFile    *innerFile;
+	TupleTableSlot *slot;
+	uint32		hashvalue;
+
+	off_t		tup_start_offset;
+	off_t		chunk_start_offset;
+	off_t		tup_end_offset;
+	int64		current_saved_size;
+	int			current_fileno;
 
 	/*
 	 * Reload the hash table with the new inner batch (which could be empty)
@@ -1038,171 +1637,60 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 
 	innerFile = hashtable->innerBatchFile[curbatch];
 
+	/* Reset this even if the innerfile is not null */
+	hjstate->hj_InnerFirstChunk = hjstate->hj_InnerPageOffset == 0L;
+
 	if (innerFile != NULL)
 	{
-		if (BufFileSeek(innerFile, 0, 0L, SEEK_SET))
+		/* TODO: should fileno always be 0? */
+		if (BufFileSeek(innerFile, 0, hjstate->hj_InnerPageOffset, SEEK_SET))
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not rewind hash-join temporary file: %m")));
 
+		chunk_start_offset = hjstate->hj_InnerPageOffset;
+		tup_end_offset = hjstate->hj_InnerPageOffset;
 		while ((slot = ExecHashJoinGetSavedTuple(hjstate,
 												 innerFile,
 												 &hashvalue,
 												 hjstate->hj_HashTupleSlot)))
 		{
+			/* next tuple's start is last tuple's end */
+			tup_start_offset = tup_end_offset;
+			/* after we got the tuple, figure out what the offset is */
+			BufFileTell(innerFile, &current_fileno, &tup_end_offset);
+			current_saved_size = tup_end_offset - chunk_start_offset;
+			if (current_saved_size > work_mem)
+			{
+				hjstate->hj_InnerPageOffset = tup_start_offset;
+				hjstate->hashloop_fallback = true;
+				return true;
+			}
+			hjstate->hj_InnerPageOffset = tup_end_offset;
+
 			/*
-			 * NOTE: some tuples may be sent to future batches.  Also, it is
-			 * possible for hashtable->nbatch to be increased here!
+			 * NOTE: some tuples may be sent to future batches. With current
+			 * hashloop patch, however, it is not possible for
+			 * hashtable->nbatch to be increased here
 			 */
 			ExecHashTableInsert(hashtable, slot, hashvalue);
 		}
 
+		/* this is the end of the file */
+		hjstate->hj_InnerPageOffset = 0L;
+
 		/*
-		 * after we build the hash table, the inner batch file is no longer
+		 * after we processed all chunks, the inner batch file is no longer
 		 * needed
 		 */
 		BufFileClose(innerFile);
 		hashtable->innerBatchFile[curbatch] = NULL;
 	}
 
-	/*
-	 * Rewind outer batch file (if present), so that we can start reading it.
-	 */
-	if (hashtable->outerBatchFile[curbatch] != NULL)
-	{
-		if (BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not rewind hash-join temporary file: %m")));
-	}
-
-	return true;
-}
-
-/*
- * Choose a batch to work on, and attach to it.  Returns true if successful,
- * false if there are no more batches.
- */
-static bool
-ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
-{
-	HashJoinTable hashtable = hjstate->hj_HashTable;
-	int			start_batchno;
-	int			batchno;
-
-	/*
-	 * If we started up so late that the batch tracking array has been freed
-	 * already by ExecHashTableDetach(), then we are finished.  See also
-	 * ExecParallelHashEnsureBatchAccessors().
-	 */
-	if (hashtable->batches == NULL)
-		return false;
-
-	/*
-	 * If we were already attached to a batch, remember not to bother checking
-	 * it again, and detach from it (possibly freeing the hash table if we are
-	 * last to detach).
-	 */
-	if (hashtable->curbatch >= 0)
-	{
-		hashtable->batches[hashtable->curbatch].done = true;
-		ExecHashTableDetachBatch(hashtable);
-	}
-
-	/*
-	 * Search for a batch that isn't done.  We use an atomic counter to start
-	 * our search at a different batch in every participant when there are
-	 * more batches than participants.
-	 */
-	batchno = start_batchno =
-		pg_atomic_fetch_add_u32(&hashtable->parallel_state->distributor, 1) %
-		hashtable->nbatch;
-	do
-	{
-		uint32		hashvalue;
-		MinimalTuple tuple;
-		TupleTableSlot *slot;
-
-		if (!hashtable->batches[batchno].done)
-		{
-			SharedTuplestoreAccessor *inner_tuples;
-			Barrier    *batch_barrier =
-			&hashtable->batches[batchno].shared->batch_barrier;
-
-			switch (BarrierAttach(batch_barrier))
-			{
-				case PHJ_BATCH_ELECTING:
-
-					/* One backend allocates the hash table. */
-					if (BarrierArriveAndWait(batch_barrier,
-											 WAIT_EVENT_HASH_BATCH_ELECTING))
-						ExecParallelHashTableAlloc(hashtable, batchno);
-					/* Fall through. */
-
-				case PHJ_BATCH_ALLOCATING:
-					/* Wait for allocation to complete. */
-					BarrierArriveAndWait(batch_barrier,
-										 WAIT_EVENT_HASH_BATCH_ALLOCATING);
-					/* Fall through. */
-
-				case PHJ_BATCH_LOADING:
-					/* Start (or join in) loading tuples. */
-					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					inner_tuples = hashtable->batches[batchno].inner_tuples;
-					sts_begin_parallel_scan(inner_tuples);
-					while ((tuple = sts_parallel_scan_next(inner_tuples,
-														   &hashvalue)))
-					{
-						ExecForceStoreMinimalTuple(tuple,
-												   hjstate->hj_HashTupleSlot,
-												   false);
-						slot = hjstate->hj_HashTupleSlot;
-						ExecParallelHashTableInsertCurrentBatch(hashtable, slot,
-																hashvalue);
-					}
-					sts_end_parallel_scan(inner_tuples);
-					BarrierArriveAndWait(batch_barrier,
-										 WAIT_EVENT_HASH_BATCH_LOADING);
-					/* Fall through. */
-
-				case PHJ_BATCH_PROBING:
-
-					/*
-					 * This batch is ready to probe.  Return control to
-					 * caller. We stay attached to batch_barrier so that the
-					 * hash table stays alive until everyone's finished
-					 * probing it, but no participant is allowed to wait at
-					 * this barrier again (or else a deadlock could occur).
-					 * All attached participants must eventually call
-					 * BarrierArriveAndDetach() so that the final phase
-					 * PHJ_BATCH_DONE can be reached.
-					 */
-					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					sts_begin_parallel_scan(hashtable->batches[batchno].outer_tuples);
-					return true;
-
-				case PHJ_BATCH_DONE:
-
-					/*
-					 * Already done.  Detach and go around again (if any
-					 * remain).
-					 */
-					BarrierDetach(batch_barrier);
-					hashtable->batches[batchno].done = true;
-					hashtable->curbatch = -1;
-					break;
-
-				default:
-					elog(ERROR, "unexpected batch phase %d",
-						 BarrierPhase(batch_barrier));
-			}
-		}
-		batchno = (batchno + 1) % hashtable->nbatch;
-	} while (batchno != start_batchno);
-
 	return false;
 }
 
+
 /*
  * ExecHashJoinSaveTuple
  *		save a tuple to a batch file.
@@ -1396,6 +1884,8 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 	/* Execute outer plan, writing all tuples to shared tuplestores. */
 	for (;;)
 	{
+		tupleMetadata metadata;
+
 		slot = ExecProcNode(outerState);
 		if (TupIsNull(slot))
 			break;
@@ -1413,8 +1903,11 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 
 			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
 									  &batchno);
-			sts_puttuple(hashtable->batches[batchno].outer_tuples,
-						 &hashvalue, mintup);
+			metadata.hashvalue = hashvalue;
+			SharedTuplestoreAccessor *accessor = hashtable->batches[batchno].outer_tuples;
+
+			metadata.tupleid = sts_increment_tuplenum(accessor);
+			sts_puttuple(accessor, &metadata, mintup);
 
 			if (shouldFree)
 				heap_free_minimal_tuple(mintup);
@@ -1463,6 +1956,7 @@ ExecHashJoinInitializeDSM(HashJoinState *state, ParallelContext *pcxt)
 	 * and space_allowed.
 	 */
 	pstate->nbatch = 0;
+	pstate->batch_increases = 0;
 	pstate->space_allowed = 0;
 	pstate->batches = InvalidDsaPointer;
 	pstate->old_batches = InvalidDsaPointer;
@@ -1502,7 +1996,7 @@ ExecHashJoinReInitializeDSM(HashJoinState *state, ParallelContext *cxt)
 	/*
 	 * It would be possible to reuse the shared hash table in single-batch
 	 * cases by resetting and then fast-forwarding build_barrier to
-	 * PHJ_BUILD_DONE and batch 0's batch_barrier to PHJ_BATCH_PROBING, but
+	 * PHJ_BUILD_DONE and batch 0's batch_barrier to PHJ_BATCH_CHUNKING, but
 	 * currently shared hash tables are already freed by now (by the last
 	 * participant to detach from the batch).  We could consider keeping it
 	 * around for single-batch joins.  We'd also need to adjust
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7169509a79..eeddf0009c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3767,6 +3767,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_BATCH_LOADING:
 			event_name = "Hash/Batch/Loading";
 			break;
+		case WAIT_EVENT_HASH_BATCH_PROBING:
+			event_name = "Hash/Batch/Probing";
+			break;
 		case WAIT_EVENT_HASH_BUILD_ALLOCATING:
 			event_name = "Hash/Build/Allocating";
 			break;
@@ -3779,6 +3782,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
 			event_name = "Hash/Build/HashingOuter";
 			break;
+		case WAIT_EVENT_HASH_BUILD_CREATE_OUTER_MATCH_STATUS_BITMAP_FILES:
+			event_name = "Hash/Build/CreateOuterMatchStatusBitmapFiles";
+			break;
 		case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
 			event_name = "Hash/GrowBatches/Allocating";
 			break;
@@ -3803,6 +3809,21 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
 			event_name = "Hash/GrowBuckets/Reinserting";
 			break;
+		case WAIT_EVENT_HASH_CHUNK_ELECTING:
+			event_name = "Hash/Chunk/Electing";
+			break;
+		case WAIT_EVENT_HASH_CHUNK_LOADING:
+			event_name = "Hash/Chunk/Loading";
+			break;
+		case WAIT_EVENT_HASH_CHUNK_PROBING:
+			event_name = "Hash/Chunk/Probing";
+			break;
+		case WAIT_EVENT_HASH_CHUNK_DONE:
+			event_name = "Hash/Chunk/Done";
+			break;
+		case WAIT_EVENT_HASH_ADVANCE_CHUNK:
+			event_name = "Hash/Chunk/Final";
+			break;
 		case WAIT_EVENT_LOGICAL_SYNC_DATA:
 			event_name = "LogicalSyncData";
 			break;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 35e8f12e62..cb49329d3f 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -269,6 +269,57 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
 	return file;
 }
 
+/*
+ * Open a shared file created by any backend if it exists, otherwise return NULL
+ */
+BufFile *
+BufFileOpenSharedIfExists(SharedFileSet *fileset, const char *name)
+{
+	BufFile    *file;
+	char		segment_name[MAXPGPATH];
+	Size		capacity = 16;
+	File	   *files;
+	int			nfiles = 0;
+
+	files = palloc(sizeof(File) * capacity);
+
+	/*
+	 * We don't know how many segments there are, so we'll probe the
+	 * filesystem to find out.
+	 */
+	for (;;)
+	{
+		/* See if we need to expand our file segment array. */
+		if (nfiles + 1 > capacity)
+		{
+			capacity *= 2;
+			files = repalloc(files, sizeof(File) * capacity);
+		}
+		/* Try to load a segment. */
+		SharedSegmentName(segment_name, name, nfiles);
+		files[nfiles] = SharedFileSetOpen(fileset, segment_name);
+		if (files[nfiles] <= 0)
+			break;
+		++nfiles;
+
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	/*
+	 * If we didn't find any files at all, then no BufFile exists with this
+	 * name.
+	 */
+	if (nfiles == 0)
+		return NULL;
+	file = makeBufFileCommon(nfiles);
+	file->files = files;
+	file->readOnly = true;		/* Can't write to files opened this way */
+	file->fileset = fileset;
+	file->name = pstrdup(name);
+
+	return file;
+}
+
 /*
  * Open a file that was previously created in another backend (or this one)
  * with BufFileCreateShared in the same SharedFileSet using the same name.
@@ -843,3 +894,17 @@ BufFileAppend(BufFile *target, BufFile *source)
 
 	return startBlock;
 }
+
+BufFile *
+BufFileRewindIfExists(BufFile *bufFile)
+{
+	if (bufFile != NULL)
+	{
+		if (BufFileSeek(bufFile, 0, 0L, SEEK_SET))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not rewind hash-join temporary file: %m")));
+		return bufFile;
+	}
+	return NULL;
+}
diff --git a/src/backend/storage/ipc/barrier.c b/src/backend/storage/ipc/barrier.c
index 3e200e02cc..58455dda1c 100644
--- a/src/backend/storage/ipc/barrier.c
+++ b/src/backend/storage/ipc/barrier.c
@@ -195,6 +195,91 @@ BarrierArriveAndWait(Barrier *barrier, uint32 wait_event_info)
 	return elected;
 }
 
+/*
+ * Arrive at this barrier, wait for all other attached participants to arrive
+ * too and then return.  Sets the current phase to next_phase.  The caller must
+ * be attached.
+ *
+ * While waiting, pg_stat_activity shows a wait_event_type and wait_event
+ * controlled by the wait_event_info passed in, which should be a value from
+ * one of the WaitEventXXX enums defined in pgstat.h.
+ *
+ * Return true in one arbitrarily chosen participant.  Return false in all
+ * others.  The return code can be used to elect one participant to execute a
+ * phase of work that must be done serially while other participants wait.
+ */
+bool
+BarrierArriveExplicitAndWait(Barrier *barrier, int next_phase, uint32 wait_event_info)
+{
+	bool		release = false;
+	bool		elected;
+	int			start_phase;
+
+	SpinLockAcquire(&barrier->mutex);
+	start_phase = barrier->phase;
+	++barrier->arrived;
+	if (barrier->arrived == barrier->participants)
+	{
+		release = true;
+		barrier->arrived = 0;
+		barrier->phase = next_phase;
+		barrier->elected = next_phase;
+	}
+	SpinLockRelease(&barrier->mutex);
+
+	/*
+	 * If we were the last expected participant to arrive, we can release our
+	 * peers and return true to indicate that this backend has been elected to
+	 * perform any serial work.
+	 */
+	if (release)
+	{
+		ConditionVariableBroadcast(&barrier->condition_variable);
+
+		return true;
+	}
+
+	/*
+	 * Otherwise we have to wait for the last participant to arrive and
+	 * advance the phase.
+	 */
+	elected = false;
+	ConditionVariablePrepareToSleep(&barrier->condition_variable);
+	for (;;)
+	{
+		/*
+		 * We know that phase must either be start_phase, indicating that we
+		 * need to keep waiting, or next_phase, indicating that the last
+		 * participant that we were waiting for has either arrived or detached
+		 * so that the next phase has begun.  The phase cannot advance any
+		 * further than that without this backend's participation, because
+		 * this backend is attached.
+		 */
+		SpinLockAcquire(&barrier->mutex);
+		Assert(barrier->phase == start_phase || barrier->phase == next_phase);
+		release = barrier->phase == next_phase;
+		if (release && barrier->elected != next_phase)
+		{
+			/*
+			 * Usually the backend that arrives last and releases the other
+			 * backends is elected to return true (see above), so that it can
+			 * begin processing serial work while it has a CPU timeslice.
+			 * However, if the barrier advanced because someone detached, then
+			 * one of the backends that is awoken will need to be elected.
+			 */
+			barrier->elected = barrier->phase;
+			elected = true;
+		}
+		SpinLockRelease(&barrier->mutex);
+		if (release)
+			break;
+		ConditionVariableSleep(&barrier->condition_variable, wait_event_info);
+	}
+	ConditionVariableCancelSleep();
+
+	return elected;
+}
+
 /*
  * Arrive at this barrier, but detach rather than waiting.  Returns true if
  * the caller was the last to detach.
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index c3ab494a45..3cd2ec2e2e 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -60,6 +60,8 @@ typedef struct SharedTuplestoreParticipant
 struct SharedTuplestore
 {
 	int			nparticipants;	/* Number of participants that can write. */
+	pg_atomic_uint32 ntuples;
+			  //TODO:does this belong elsewhere
 	int			flags;			/* Flag bits from SHARED_TUPLESTORE_XXX */
 	size_t		meta_data_size; /* Size of per-tuple header. */
 	char		name[NAMEDATALEN];	/* A name for this tuplestore. */
@@ -92,10 +94,15 @@ struct SharedTuplestoreAccessor
 	BlockNumber write_page;		/* The next page to write to. */
 	char	   *write_pointer;	/* Current write pointer within chunk. */
 	char	   *write_end;		/* One past the end of the current chunk. */
+
+	/* Bitmap of matched outer tuples (currently only used for hashjoin). */
+	BufFile    *outer_match_status_file;
 };
 
 static void sts_filename(char *name, SharedTuplestoreAccessor *accessor,
 						 int participant);
+static void
+			sts_bitmap_filename(char *name, SharedTuplestoreAccessor *accessor, int participant);
 
 /*
  * Return the amount of shared memory required to hold SharedTuplestore for a
@@ -137,6 +144,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 	Assert(my_participant_number < participants);
 
 	sts->nparticipants = participants;
+	pg_atomic_init_u32(&sts->ntuples, 1);
 	sts->meta_data_size = meta_data_size;
 	sts->flags = flags;
 
@@ -166,6 +174,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 	accessor->sts = sts;
 	accessor->fileset = fileset;
 	accessor->context = CurrentMemoryContext;
+	accessor->outer_match_status_file = NULL;
 
 	return accessor;
 }
@@ -343,6 +352,7 @@ sts_puttuple(SharedTuplestoreAccessor *accessor, void *meta_data,
 			sts_flush_chunk(accessor);
 		}
 
+		/* TODO: exercise this code with a test (over-sized tuple) */
 		/* It may still not be enough in the case of a gigantic tuple. */
 		if (accessor->write_pointer + size >= accessor->write_end)
 		{
@@ -621,6 +631,129 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	return NULL;
 }
 
+/*  TODO: fix signedness */
+int
+sts_increment_tuplenum(SharedTuplestoreAccessor *accessor)
+{
+	return pg_atomic_fetch_add_u32(&accessor->sts->ntuples, 1);
+}
+
+void
+sts_make_outer_match_status_file(SharedTuplestoreAccessor *accessor)
+{
+	uint32		tuplenum = pg_atomic_read_u32(&accessor->sts->ntuples);
+
+	/* don't make the outer match status file if there are no tuples */
+	if (tuplenum == 0)
+		return;
+
+	char		name[MAXPGPATH];
+
+	sts_bitmap_filename(name, accessor, accessor->participant);
+
+	accessor->outer_match_status_file = BufFileCreateShared(accessor->fileset, name);
+
+	/* TODO: check this math. tuplenumber will be too high. */
+	uint32		num_to_write = tuplenum / 8 + 1;
+
+	unsigned char byteToWrite = 0;
+
+	BufFileWrite(accessor->outer_match_status_file, &byteToWrite, num_to_write);
+
+	if (BufFileSeek(accessor->outer_match_status_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+}
+
+void
+sts_set_outer_match_status(SharedTuplestoreAccessor *accessor, uint32 tuplenum)
+{
+	BufFile    *parallel_outer_matchstatuses = accessor->outer_match_status_file;
+	unsigned char current_outer_byte;
+
+	BufFileSeek(parallel_outer_matchstatuses, 0, tuplenum / 8, SEEK_SET);
+	BufFileRead(parallel_outer_matchstatuses, &current_outer_byte, 1);
+
+	current_outer_byte |= 1U << (tuplenum % 8);
+
+	if (BufFileSeek(parallel_outer_matchstatuses, 0, -1, SEEK_CUR) != 0)
+		elog(ERROR, "there is a problem with outer match status file. pid %i.", MyProcPid);
+	BufFileWrite(parallel_outer_matchstatuses, &current_outer_byte, 1);
+}
+
+void
+sts_close_outer_match_status_file(SharedTuplestoreAccessor *accessor)
+{
+	BufFileClose(accessor->outer_match_status_file);
+}
+
+BufFile *
+sts_combine_outer_match_status_files(SharedTuplestoreAccessor *accessor)
+{
+	/* TODO: this tries to close an outer match status file for */
+	/* each participant in the tuplestore. technically, only participants */
+	/* in the barrier could have outer match status files, however, */
+	/* all but one participant continue on and detach from the barrier */
+	/* so we won't have a reliable way to close only files for those attached */
+	/* to the barrier */
+	BufFile   **statuses = palloc(sizeof(BufFile *) * accessor->sts->nparticipants);
+
+	/*
+	 * Open the bitmap shared BufFile from each participant. TODO: explain why
+	 * file can be NULLs
+	 */
+	int			statuses_length = 0;
+
+	for (int i = 0; i < accessor->sts->nparticipants; i++)
+	{
+		char		bitmap_filename[MAXPGPATH];
+
+		sts_bitmap_filename(bitmap_filename, accessor, i);
+		BufFile    *file = BufFileOpenSharedIfExists(accessor->fileset, bitmap_filename);
+
+		if (file != NULL)
+			statuses[statuses_length++] = file;
+	}
+
+	BufFile    *combined_bitmap_file = BufFileCreateTemp(false);
+
+	for (int64 cur = 0; cur < BufFileSize(statuses[0]); cur++)
+		//make it while not
+			EOF
+		{
+			unsigned char combined_byte = 0;
+
+			for (int i = 0; i < statuses_length; i++)
+			{
+				unsigned char read_byte;
+
+				BufFileRead(statuses[i], &read_byte, 1);
+				combined_byte |= read_byte;
+			}
+
+			BufFileWrite(combined_bitmap_file, &combined_byte, 1);
+		}
+
+	if (BufFileSeek(combined_bitmap_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	for (int i = 0; i < statuses_length; i++)
+		BufFileClose(statuses[i]);
+	pfree(statuses);
+
+	return combined_bitmap_file;
+}
+
+
+static void
+sts_bitmap_filename(char *name, SharedTuplestoreAccessor *accessor, int participant)
+{
+	snprintf(name, MAXPGPATH, "%s.p%d.bitmap", accessor->sts->name, participant);
+}
+
 /*
  * Create the name used for the BufFile that a given participant will write.
  */
diff --git a/src/include/executor/adaptiveHashjoin.h b/src/include/executor/adaptiveHashjoin.h
new file mode 100644
index 0000000000..030a04c5c0
--- /dev/null
+++ b/src/include/executor/adaptiveHashjoin.h
@@ -0,0 +1,9 @@
+#ifndef ADAPTIVE_HASHJOIN_H
+#define ADAPTIVE_HASHJOIN_H
+
+
+extern bool ExecParallelHashJoinNewChunk(HashJoinState *hjstate, bool advance_from_probing);
+extern bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
+
+
+#endif							/* ADAPTIVE_HASHJOIN_H */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 79b634e8ed..3e4f4bd574 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -148,11 +148,27 @@ typedef struct HashMemoryChunkData *HashMemoryChunk;
  * followed by variable-sized objects, they are arranged in contiguous memory
  * but not accessed directly as an array.
  */
+/*  TODO: maybe remove lock from ParallelHashJoinBatch and use pstate->lock */
+/*  and the PHJBatchAccessor to coordinate access to the PHJ batch similar to */
+/*  other users of that lock */
 typedef struct ParallelHashJoinBatch
 {
 	dsa_pointer buckets;		/* array of hash table buckets */
 	Barrier		batch_barrier;	/* synchronization for joining this batch */
 
+	/* Parallel Adaptive Hash Join members */
+
+	/*
+	 * after finishing build phase, parallel_hashloop_fallback cannot change,
+	 * and does not require a lock to read
+	 */
+	bool		parallel_hashloop_fallback;
+	int			total_num_chunks;
+	int			current_chunk_num;
+	size_t		estimated_chunk_size;
+	Barrier		chunk_barrier;
+	LWLock		lock;
+
 	dsa_pointer chunks;			/* chunks of tuples loaded */
 	size_t		size;			/* size of buckets + chunks in memory */
 	size_t		estimated_size; /* size of buckets + chunks while writing */
@@ -243,6 +259,8 @@ typedef struct ParallelHashJoinState
 	int			nparticipants;
 	size_t		space_allowed;
 	size_t		total_tuples;	/* total number of inner tuples */
+	int			batch_increases;	/* TODO: make this an atomic so I don't
+									 * need the lock to increment it? */
 	LWLock		lock;			/* lock protecting the above */
 
 	Barrier		build_barrier;	/* synchronization for the build phases */
@@ -263,10 +281,16 @@ typedef struct ParallelHashJoinState
 /* The phases for probing each batch, used by for batch_barrier. */
 #define PHJ_BATCH_ELECTING				0
 #define PHJ_BATCH_ALLOCATING			1
-#define PHJ_BATCH_LOADING				2
-#define PHJ_BATCH_PROBING				3
+#define PHJ_BATCH_CHUNKING				2
+#define PHJ_BATCH_OUTER_MATCH_STATUS_PROCESSING 3
 #define PHJ_BATCH_DONE					4
 
+#define PHJ_CHUNK_ELECTING				0
+#define PHJ_CHUNK_LOADING				1
+#define PHJ_CHUNK_PROBING				2
+#define PHJ_CHUNK_DONE					3
+#define PHJ_CHUNK_FINAL					4
+
 /* The phases of batch growth while hashing, for grow_batches_barrier. */
 #define PHJ_GROW_BATCHES_ELECTING		0
 #define PHJ_GROW_BATCHES_ALLOCATING		1
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 1336fde6b4..dfc221e6a1 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -40,9 +40,8 @@ extern void ExecHashTableInsert(HashJoinTable hashtable,
 extern void ExecParallelHashTableInsert(HashJoinTable hashtable,
 										TupleTableSlot *slot,
 										uint32 hashvalue);
-extern void ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable,
-													TupleTableSlot *slot,
-													uint32 hashvalue);
+extern void
+			ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable, TupleTableSlot *slot, uint32 hashvalue);
 extern bool ExecHashGetHashValue(HashJoinTable hashtable,
 								 ExprContext *econtext,
 								 List *hashkeys,
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index f7df70b5ab..9497b10972 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -129,6 +129,7 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+	uint32		tuplenum;
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -425,7 +426,7 @@ static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
 	slot->tts_ops->clear(slot);
-
+	slot->tuplenum = 0;
 	return slot;
 }
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5d5b38b879..93fe6dddb2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -14,6 +14,7 @@
 #ifndef EXECNODES_H
 #define EXECNODES_H
 
+#include <storage/buffile.h>
 #include "access/tupconvert.h"
 #include "executor/instrument.h"
 #include "fmgr.h"
@@ -1952,6 +1953,22 @@ typedef struct HashJoinState
 	int			hj_JoinState;
 	bool		hj_MatchedOuter;
 	bool		hj_OuterNotEmpty;
+
+	/* hashloop fallback */
+	bool		hashloop_fallback;
+	/* hashloop fallback inner side */
+	bool		hj_InnerFirstChunk;
+	bool		hj_InnerExhausted;
+	off_t		hj_InnerPageOffset;
+
+	/* hashloop fallback outer side */
+	unsigned char hj_OuterCurrentByte;
+	BufFile    *hj_OuterMatchStatusesFile;	/* serial AHJ */
+	int64		hj_OuterTupleCount;
+
+	/* parallel hashloop fallback outer side */
+	bool		last_worker;
+	BufFile    *combined_bitmap;
 } HashJoinState;
 
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index aecb6013f0..340086a7e7 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -815,6 +815,7 @@ typedef enum
  * it is waiting for a notification from another process.
  * ----------
  */
+/*  TODO: add WAIT_EVENT_HASH_BUILD_CREATE_OUTER_MATCH_STATUS_BITMAP_FILES? */
 typedef enum
 {
 	WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
@@ -827,10 +828,12 @@ typedef enum
 	WAIT_EVENT_HASH_BATCH_ALLOCATING,
 	WAIT_EVENT_HASH_BATCH_ELECTING,
 	WAIT_EVENT_HASH_BATCH_LOADING,
+	WAIT_EVENT_HASH_BATCH_PROBING,
 	WAIT_EVENT_HASH_BUILD_ALLOCATING,
 	WAIT_EVENT_HASH_BUILD_ELECTING,
 	WAIT_EVENT_HASH_BUILD_HASHING_INNER,
 	WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
+	WAIT_EVENT_HASH_BUILD_CREATE_OUTER_MATCH_STATUS_BITMAP_FILES,
 	WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
 	WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
 	WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
@@ -839,6 +842,11 @@ typedef enum
 	WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
 	WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
 	WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
+	WAIT_EVENT_HASH_CHUNK_ELECTING,
+	WAIT_EVENT_HASH_CHUNK_LOADING,
+	WAIT_EVENT_HASH_CHUNK_PROBING,
+	WAIT_EVENT_HASH_CHUNK_DONE,
+	WAIT_EVENT_HASH_ADVANCE_CHUNK,
 	WAIT_EVENT_LOGICAL_SYNC_DATA,
 	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
 	WAIT_EVENT_MQ_INTERNAL,
diff --git a/src/include/storage/barrier.h b/src/include/storage/barrier.h
index d71927cc2f..a3c867024c 100644
--- a/src/include/storage/barrier.h
+++ b/src/include/storage/barrier.h
@@ -36,6 +36,7 @@ typedef struct Barrier
 
 extern void BarrierInit(Barrier *barrier, int num_workers);
 extern bool BarrierArriveAndWait(Barrier *barrier, uint32 wait_event_info);
+extern bool BarrierArriveExplicitAndWait(Barrier *barrier, int next_phase, uint32 wait_event_info);
 extern bool BarrierArriveAndDetach(Barrier *barrier);
 extern int	BarrierAttach(Barrier *barrier);
 extern bool BarrierDetach(Barrier *barrier);
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index 60433f35b4..f790f7e121 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -48,7 +48,10 @@ extern long BufFileAppend(BufFile *target, BufFile *source);
 
 extern BufFile *BufFileCreateShared(SharedFileSet *fileset, const char *name);
 extern void BufFileExportShared(BufFile *file);
+extern BufFile *BufFileOpenSharedIfExists(SharedFileSet *fileset, const char *name);
 extern BufFile *BufFileOpenShared(SharedFileSet *fileset, const char *name);
 extern void BufFileDeleteShared(SharedFileSet *fileset, const char *name);
 
+extern BufFile *BufFileRewindIfExists(BufFile *bufFile);
+
 #endif							/* BUFFILE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 8fda8e4f78..793f660eb4 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -212,6 +212,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_LOCK_MANAGER,
 	LWTRANCHE_PREDICATE_LOCK_MANAGER,
 	LWTRANCHE_PARALLEL_HASH_JOIN,
+	LWTRANCHE_PARALLEL_HASH_JOIN_BATCH,
 	LWTRANCHE_PARALLEL_QUERY_DSA,
 	LWTRANCHE_SESSION_DSA,
 	LWTRANCHE_SESSION_RECORD_TABLE,
diff --git a/src/include/utils/sharedtuplestore.h b/src/include/utils/sharedtuplestore.h
index 9754504cc5..6152ac163d 100644
--- a/src/include/utils/sharedtuplestore.h
+++ b/src/include/utils/sharedtuplestore.h
@@ -22,6 +22,19 @@ typedef struct SharedTuplestore SharedTuplestore;
 
 struct SharedTuplestoreAccessor;
 typedef struct SharedTuplestoreAccessor SharedTuplestoreAccessor;
+struct tupleMetadata;
+typedef struct tupleMetadata tupleMetadata;
+
+/*  TODO: conflicting types for tupleid with accessor->sts->ntuples (uint32) */
+/*  TODO: use a union for tupleid (uint32) (make this a uint64) and chunk number (int) */
+struct tupleMetadata
+{
+	uint32		hashvalue;
+	int			tupleid;		/* tuple id on outer side and chunk number for
+								 * inner side */
+}			__attribute__((packed));
+
+/*  TODO: make sure I can get rid of packed now that using sizeof(struct) */
 
 /*
  * A flag indicating that the tuplestore will only be scanned once, so backing
@@ -58,4 +71,13 @@ extern void sts_puttuple(SharedTuplestoreAccessor *accessor,
 extern MinimalTuple sts_parallel_scan_next(SharedTuplestoreAccessor *accessor,
 										   void *meta_data);
 
+
+extern int	sts_increment_tuplenum(SharedTuplestoreAccessor *accessor);
+
+extern void sts_make_outer_match_status_file(SharedTuplestoreAccessor *accessor);
+extern void sts_set_outer_match_status(SharedTuplestoreAccessor *accessor, uint32 tuplenum);
+extern void sts_close_outer_match_status_file(SharedTuplestoreAccessor *accessor);
+extern BufFile *sts_combine_outer_match_status_files(SharedTuplestoreAccessor *accessor);
+
+
 #endif							/* SHAREDTUPLESTORE_H */
diff --git a/src/test/regress/expected/adaptive_hj.out b/src/test/regress/expected/adaptive_hj.out
new file mode 100644
index 0000000000..fe24acd255
--- /dev/null
+++ b/src/test/regress/expected/adaptive_hj.out
@@ -0,0 +1,1233 @@
+-- TODO: remove some of these tests and make the test file faster
+create schema adaptive_hj;
+set search_path=adaptive_hj;
+drop table if exists t1;
+NOTICE:  table "t1" does not exist, skipping
+drop table if exists t2;
+NOTICE:  table "t2" does not exist, skipping
+create table t1(a int);
+create table t2(b int);
+-- serial setup
+set work_mem=64;
+set enable_mergejoin to off;
+-- TODO: make this function general
+create or replace function explain_multi_batch() returns setof text language plpgsql as
+$$
+declare ln text;
+begin
+    for ln in
+        explain (analyze, summary off, timing off, costs off)
+		select count(*) from t1 left outer join t2 on a = b
+    loop
+        ln := regexp_replace(ln, 'Memory Usage: \S*',  'Memory Usage: xxx');
+        return next ln;
+    end loop;
+end;
+$$;
+-- Serial_Test_1 reset
+-- TODO: refactor into procedure or change to drop table
+update pg_class set reltuples = 0, relpages = 0 where relname = 't2';
+update pg_class set reltuples = 0, relpages = 0 where relname = 't1';
+delete from pg_statistic where starelid = 't2'::regclass;
+delete from pg_statistic where starelid = 't1'::regclass;
+-- Serial_Test_1 setup
+truncate table t1;
+insert into t1 values(1),(2);
+insert into t1 select i from generate_series(1,10)i;
+insert into t1 select 2 from generate_series(1,5)i;
+truncate table t2;
+insert into t2 values(2),(3),(11);
+insert into t2 select i from generate_series(2,10)i;
+insert into t2 select 2 from generate_series(2,7)i;
+-- Serial_Test_1.1
+-- TODO: automate the checking for expected number of chunks (explain option?)
+-- spills in 4 batches
+-- batch 1 falls back with 2 chunks with no unmatched tuples
+-- batch 2 falls back with 2 chunks with 2 unmatched tuples emitted at EOB 
+-- batch 3 falls back with 5 chunks with no unmatched tuples
+-- batch 4 does not fall back with no unmatched tuples
+select * from explain_multi_batch();
+                    explain_multi_batch                     
+------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Left Join (actual rows=67 loops=1)
+         Hash Cond: (t1.a = t2.b)
+         ->  Seq Scan on t1 (actual rows=17 loops=1)
+         ->  Hash (actual rows=18 loops=1)
+               Buckets: 2048  Batches: 4  Memory Usage: xxx
+               ->  Seq Scan on t2 (actual rows=18 loops=1)
+(7 rows)
+
+select * from t1 left outer join t2 on a = b order by b, a;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+  1 |   
+  1 |   
+(67 rows)
+
+select * from t1, t2 where a = b order by b;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+(65 rows)
+
+select * from t1 right outer join t2 on a = b order by a, b;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+    | 11
+(66 rows)
+
+select * from t1 full outer join t2 on a = b order by b, a;
+ a  | b  
+----+----
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  2 |  2
+  3 |  3
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+    | 11
+  1 |   
+  1 |   
+(68 rows)
+
+-- Serial_Test_1.2 setup
+analyze t1; analyze t2;
+-- Serial_Test_1.2
+-- doesn't spill (happens to do a hash right join)
+select * from explain_multi_batch();
+                    explain_multi_batch                     
+------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Right Join (actual rows=67 loops=1)
+         Hash Cond: (t2.b = t1.a)
+         ->  Seq Scan on t2 (actual rows=18 loops=1)
+         ->  Hash (actual rows=17 loops=1)
+               Buckets: 1024  Batches: 1  Memory Usage: xxx
+               ->  Seq Scan on t1 (actual rows=17 loops=1)
+(7 rows)
+
+-- Serial_Test_2 reset
+update pg_class set reltuples = 0, relpages = 0 where relname = 't2';
+update pg_class set reltuples = 0, relpages = 0 where relname = 't1';
+delete from pg_statistic where starelid = 't2'::regclass;
+delete from pg_statistic where starelid = 't1'::regclass;
+-- Serial_Test_2 setup:
+truncate table t1;
+insert into t1 values (1),(2),(2),(3);
+truncate table t2;
+insert into t2 values(2),(2),(3),(3),(4);
+-- Serial_Test_2.1
+-- spills in 4 batches
+-- batch 1 falls back with 2 chunks with no unmatched tuples
+-- batch 2 does not fall back with 1 unmatched tuple
+-- batch 3 does not fall back with no unmatched tuples
+-- batch 4 does not fall back with no unmatched tuples
+select * from explain_multi_batch();
+                    explain_multi_batch                     
+------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Left Join (actual rows=7 loops=1)
+         Hash Cond: (t1.a = t2.b)
+         ->  Seq Scan on t1 (actual rows=4 loops=1)
+         ->  Hash (actual rows=5 loops=1)
+               Buckets: 2048  Batches: 4  Memory Usage: xxx
+               ->  Seq Scan on t2 (actual rows=5 loops=1)
+(7 rows)
+
+select * from t1 left outer join t2 on a = b order by b, a;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 1 |  
+(7 rows)
+
+select * from t1 right outer join t2 on a = b order by a, b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+   | 4
+(7 rows)
+
+-- TODO: check coverage for emitting ummatched inner tuples
+-- Serial_Test_2.1.a
+-- results checking for inner join
+select * from t1 left outer join t2 on a = b order by b, a;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 1 |  
+(7 rows)
+
+select * from t1, t2 where a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+(6 rows)
+
+select * from t1 right outer join t2 on a = b order by a, b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+   | 4
+(7 rows)
+
+select * from t1 full outer join t2 on a = b order by b, a;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+   | 4
+ 1 |  
+(8 rows)
+
+select * from t1, t2 where a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+(6 rows)
+
+-- Serial_Test_2.2
+analyze t1; analyze t2;
+-- doesn't spill (happens to do a hash right join)
+select * from explain_multi_batch();
+                    explain_multi_batch                     
+------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Right Join (actual rows=7 loops=1)
+         Hash Cond: (t2.b = t1.a)
+         ->  Seq Scan on t2 (actual rows=5 loops=1)
+         ->  Hash (actual rows=4 loops=1)
+               Buckets: 1024  Batches: 1  Memory Usage: xxx
+               ->  Seq Scan on t1 (actual rows=4 loops=1)
+(7 rows)
+
+-- Serial_Test_3 reset
+update pg_class set reltuples = 0, relpages = 0 where relname = 't2';
+update pg_class set reltuples = 0, relpages = 0 where relname = 't1';
+delete from pg_statistic where starelid = 't2'::regclass;
+delete from pg_statistic where starelid = 't1'::regclass;
+-- Serial_Test_3 setup:
+truncate table t1;
+insert into t1 values(1),(1);
+insert into t1 select 2 from generate_series(1,7)i;
+insert into t1 select i from generate_series(3,10)i;
+truncate table t2;
+insert into t2 select 2 from generate_series(1,7)i;
+insert into t2 values(3),(3);
+insert into t2 select i from generate_series(5,9)i;
+-- Serial_Test_3.1
+-- spills in 4 batches
+-- batch 1 falls back with 2 chunks with 1 unmatched tuple
+-- batch 2 does not fall back with 2 unmatched tuples
+-- batch 3 falls back with 4 chunks with 1 unmatched tuple
+-- batch 4 does not fall back with no unmatched tuples
+select * from explain_multi_batch();
+                    explain_multi_batch                     
+------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Left Join (actual rows=60 loops=1)
+         Hash Cond: (t1.a = t2.b)
+         ->  Seq Scan on t1 (actual rows=17 loops=1)
+         ->  Hash (actual rows=14 loops=1)
+               Buckets: 2048  Batches: 4  Memory Usage: xxx
+               ->  Seq Scan on t2 (actual rows=14 loops=1)
+(7 rows)
+
+select * from t1 left outer join t2 on a = b order by b, a;
+ a  | b 
+----+---
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  3 | 3
+  3 | 3
+  5 | 5
+  6 | 6
+  7 | 7
+  8 | 8
+  9 | 9
+  1 |  
+  1 |  
+  4 |  
+ 10 |  
+(60 rows)
+
+select * from t1, t2 where a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+select * from t1 right outer join t2 on a = b order by a, b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+select * from t1 full outer join t2 on a = b order by b, a;
+ a  | b 
+----+---
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  2 | 2
+  3 | 3
+  3 | 3
+  5 | 5
+  6 | 6
+  7 | 7
+  8 | 8
+  9 | 9
+  1 |  
+  1 |  
+  4 |  
+ 10 |  
+(60 rows)
+
+select * from t1, t2 where a = b order by b;
+ a | b 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+-- Serial_Test_3.2 
+-- swap join order
+select * from t2 left outer join t1 on a = b order by a, b;
+ b | a 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+select * from t2, t1 where a = b order by a;
+ b | a 
+---+---
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 2 | 2
+ 3 | 3
+ 3 | 3
+ 5 | 5
+ 6 | 6
+ 7 | 7
+ 8 | 8
+ 9 | 9
+(56 rows)
+
+select * from t2 right outer join t1 on a = b order by b, a;
+ b | a  
+---+----
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 3 |  3
+ 3 |  3
+ 5 |  5
+ 6 |  6
+ 7 |  7
+ 8 |  8
+ 9 |  9
+   |  1
+   |  1
+   |  4
+   | 10
+(60 rows)
+
+select * from t2 full outer join t1 on a = b order by a, b;
+ b | a  
+---+----
+   |  1
+   |  1
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 2 |  2
+ 3 |  3
+ 3 |  3
+   |  4
+ 5 |  5
+ 6 |  6
+ 7 |  7
+ 8 |  8
+ 9 |  9
+   | 10
+(60 rows)
+
+-- Serial_Test_3.3 setup
+analyze t1; analyze t2;
+-- Serial_Test_3.3
+-- doesn't spill
+select * from explain_multi_batch();
+                    explain_multi_batch                     
+------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Left Join (actual rows=60 loops=1)
+         Hash Cond: (t1.a = t2.b)
+         ->  Seq Scan on t1 (actual rows=17 loops=1)
+         ->  Hash (actual rows=14 loops=1)
+               Buckets: 1024  Batches: 1  Memory Usage: xxx
+               ->  Seq Scan on t2 (actual rows=14 loops=1)
+(7 rows)
+
+-- Serial_Test_4 setup
+drop table t1;
+create table t1(b int);
+insert into t1 select i from generate_series(1,111)i;
+insert into t1 select 2 from generate_series(1,180)i;
+analyze t1;
+drop table t2;
+create table t2(a int);
+insert into t2 select i from generate_series(20,25000)i;
+insert into t2 select 2 from generate_series(1,100)i;
+analyze t2;
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+-- Serial_Test_4.1
+-- spills in 32 batches
+--batch 0 does not fall back with 1 unmatched outer tuple (15)
+--batch 1 falls back with 396 chunks.
+--batch 2 falls back with 402 chunks with 1 unmatched outer tuple (1)
+--batch 3 falls back with 389 chunks with 1 unmatched outer tuple (8)
+--batch 4 falls back with 409 chunks with no unmatched outer tuples
+--batch 5 falls back with 366 chunks with 1 unmatched outer tuple (4)
+--batch 6 falls back with 407 chunks with 1 unmatched outer tuple (11)
+--batch 7 falls back with 382 chunks with unmatched outer tuple (10)
+--batch 8 falls back with 413 chunks with no unmatched outer tuples
+--batch 9 falls back with 371 chunks with 1 unmatched outer tuple (3)
+--batch 10 falls back with 389 chunks with no unmatched outer tuples
+--batch 11 falls back with 408 chunks with no unmatched outer tuples
+--batch 12 falls back with 387 chunks with no unmatched outer tuples
+--batch 13 falls back with 402 chunks with 1 unmatched outer tuple (18) 
+--batch 14 falls back with 369 chunks with 1 unmatched outer tuple (9)
+--batch 15 falls back with 387 chunks with no unmatched outer tuples
+--batch 16 falls back with 365 chunks with no unmatched outer tuples
+--batch 17 falls back with 403 chunks with 2 unmatched outer tuples (14,19)
+--batch 18 falls back with 375 chunks with no unmatched outer tuples
+--batch 19 falls back with 384 chunks with no unmatched outer tuples
+--batch 20 falls back with 377 chunks with 1 unmatched outer tuple (12)
+--batch 22 falls back with 401 chunks with no unmatched outer tuples
+--batch 23 falls back with 396 chunks with no unmatched outer tuples
+--batch 24 falls back with 387 chunks with 1 unmatched outer tuple (5)
+--batch 25 falls back with 399 chunks with 1 unmatched outer tuple (7)
+--batch 26 falls back with 387 chunks.
+--batch 27 falls back with 442 chunks.
+--batch 28 falls back with 385 chunks with 1 unmatched outer tuple (17)
+--batch 29 falls back with 375 chunks.
+--batch 30 falls back with 404 chunks with 1 unmatched outer tuple (6)
+--batch 31 falls back with 396 chunks with 2 unmatched outer tuples (13,16)
+select * from explain_multi_batch();
+                                     explain_multi_batch                                      
+----------------------------------------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+   ->  Hash Left Join (actual rows=18210 loops=1)
+         Hash Cond: (t1.b = t2.a)
+         ->  Seq Scan on t1 (actual rows=291 loops=1)
+         ->  Hash (actual rows=25081 loops=1)
+               Buckets: 2048 (originally 1024)  Batches: 32 (originally 1)  Memory Usage: xxx
+               ->  Seq Scan on t2 (actual rows=25081 loops=1)
+(7 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+ 18210
+(1 row)
+
+select count(a) from t1 left outer join t2 on a = b;
+ count 
+-------
+ 18192
+(1 row)
+
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+ 18192
+(1 row)
+
+-- used to give wrong results because there is a whole batch of outer which is
+-- empty and so the inner doesn't emit unmatched tuples with ROJ
+select count(*) from t1 right outer join t2 on a = b;
+ count 
+-------
+ 43081
+(1 row)
+
+select count(*) from t1 full outer join t2 on a = b; 
+ count 
+-------
+ 43099
+(1 row)
+
+-- Test_6 non-negligible amount of data test case
+-- TODO: doesn't finish with my code when it is set to be serial
+-- it does finish when it is parallel -- the serial version is either simply too
+-- slow or has a bug -- I tried it with less data and it did finish, so it must
+-- just be really slow
+-- inner join shouldn't even need to make the unmatched files
+-- it finishes eventually if I decrease data amount
+--drop table simple;
+--create table simple as
+ -- select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+--alter table simple set (parallel_workers = 2);
+--analyze simple;
+--
+--drop table extremely_skewed;
+--create table extremely_skewed (id int, t text);
+--alter table extremely_skewed set (autovacuum_enabled = 'false');
+--alter table extremely_skewed set (parallel_workers = 2);
+--analyze extremely_skewed;
+--insert into extremely_skewed
+--  select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+--  from generate_series(1, 20000);
+--update pg_class
+--  set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+--  where relname = 'extremely_skewed';
+--set work_mem=64;
+--set enable_mergejoin to off;
+--explain (analyze, costs off, timing off)
+  --select * from simple r join extremely_skewed s using (id);
+--select * from explain_multi_batch();
+drop table t1;
+drop table t2;
+drop function explain_multi_batch();
+reset enable_mergejoin;
+reset work_mem;
+reset search_path;
+drop schema adaptive_hj;
diff --git a/src/test/regress/expected/parallel_adaptive_hj.out b/src/test/regress/expected/parallel_adaptive_hj.out
new file mode 100644
index 0000000000..e5e7f9aa4f
--- /dev/null
+++ b/src/test/regress/expected/parallel_adaptive_hj.out
@@ -0,0 +1,343 @@
+create schema parallel_adaptive_hj;
+set search_path=parallel_adaptive_hj;
+-- TODO: anti-semi-join and semi-join tests
+-- TODO: check if test2 and 3 are different at all
+-- TODO: add test for parallel-oblivious parallel hash join
+-- TODO: make this function general
+create or replace function explain_parallel_multi_batch() returns setof text language plpgsql as
+$$
+declare ln text;
+begin
+    for ln in
+        explain (analyze, summary off, timing off, costs off)
+		select count(*) from t1 left outer join t2 on a = b
+    loop
+        ln := regexp_replace(ln, 'Memory Usage: \S*',  'Memory Usage: xxx');
+        return next ln;
+    end loop;
+end;
+$$;
+-- parallel setup
+set enable_nestloop to off;
+set enable_mergejoin to off;
+set  min_parallel_table_scan_size = 0;
+set  parallel_setup_cost = 0;
+set  enable_parallel_hash = on;
+set  enable_hashjoin = on;
+set  max_parallel_workers_per_gather = 1;
+set  work_mem = 64;
+-- Parallel_Test_1 setup
+drop table if exists t1;
+NOTICE:  table "t1" does not exist, skipping
+create table t1(a int);
+insert into t1 select i from generate_series(1,11)i;
+insert into t1 select 2 from generate_series(1,18)i;
+analyze t1;
+drop table if exists t2;
+NOTICE:  table "t2" does not exist, skipping
+create table t2(b int);
+insert into t2 select i from generate_series(4,2500)i;
+insert into t2 select 2 from generate_series(1,10)i;
+analyze t2;
+alter table t2 set (autovacuum_enabled = 'false');
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+-- Parallel_Test_1.1
+-- spills in 4 batches
+-- 1 resize of nbatches
+-- no batch falls back
+select * from explain_parallel_multi_batch();
+                                      explain_parallel_multi_batch                                       
+---------------------------------------------------------------------------------------------------------
+ Finalize Aggregate (actual rows=1 loops=1)
+   ->  Gather (actual rows=2 loops=1)
+         Workers Planned: 1
+         Workers Launched: 1
+         ->  Partial Aggregate (actual rows=1 loops=2)
+               ->  Parallel Hash Left Join (actual rows=100 loops=2)
+                     Hash Cond: (t1.a = t2.b)
+                     ->  Parallel Seq Scan on t1 (actual rows=29 loops=1)
+                     ->  Parallel Hash (actual rows=1254 loops=2)
+                           Buckets: 1024 (originally 1024)  Batches: 4 (originally 1)  Memory Usage: xxx
+                           ->  Parallel Seq Scan on t2 (actual rows=2507 loops=1)
+(11 rows)
+
+-- need an aggregate to exercise the code but still want to know if we are
+-- emitting the right unmatched outer tuples
+select count(a) from t1 left outer join t2 on a = b;
+ count 
+-------
+   200
+(1 row)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+   200
+(1 row)
+
+-- Parallel_Test_1.1.a
+-- results checking for inner join
+-- doesn't fall back
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+   198
+(1 row)
+
+-- Parallel_Test_1.1.b
+-- results checking for right outer join
+-- doesn't exercise the fallback code but just checking results
+select count(*) from t1 right outer join t2 on a = b;
+ count 
+-------
+  2687
+(1 row)
+
+-- Parallel_Test_1.1.c
+-- results checking for full outer join
+select count(*) from t1 full outer join t2 on a = b;
+ count 
+-------
+  2689
+(1 row)
+
+-- Parallel_Test_1.2
+-- spill and doesn't have to resize nbatches
+analyze t2;
+select * from explain_parallel_multi_batch();
+                           explain_parallel_multi_batch                           
+----------------------------------------------------------------------------------
+ Finalize Aggregate (actual rows=1 loops=1)
+   ->  Gather (actual rows=2 loops=1)
+         Workers Planned: 1
+         Workers Launched: 1
+         ->  Partial Aggregate (actual rows=1 loops=2)
+               ->  Parallel Hash Left Join (actual rows=100 loops=2)
+                     Hash Cond: (t1.a = t2.b)
+                     ->  Parallel Seq Scan on t1 (actual rows=29 loops=1)
+                     ->  Parallel Hash (actual rows=1254 loops=2)
+                           Buckets: 2048  Batches: 4  Memory Usage: xxx
+                           ->  Parallel Seq Scan on t2 (actual rows=2507 loops=1)
+(11 rows)
+
+select count(a) from t1 left outer join t2 on a = b;
+ count 
+-------
+   200
+(1 row)
+
+-- Parallel_Test_1.3
+-- doesn't spill
+-- does resize nbuckets
+set work_mem = '4MB';
+select * from explain_parallel_multi_batch();
+                           explain_parallel_multi_batch                           
+----------------------------------------------------------------------------------
+ Finalize Aggregate (actual rows=1 loops=1)
+   ->  Gather (actual rows=2 loops=1)
+         Workers Planned: 1
+         Workers Launched: 1
+         ->  Partial Aggregate (actual rows=1 loops=2)
+               ->  Parallel Hash Left Join (actual rows=100 loops=2)
+                     Hash Cond: (t1.a = t2.b)
+                     ->  Parallel Seq Scan on t1 (actual rows=29 loops=1)
+                     ->  Parallel Hash (actual rows=1254 loops=2)
+                           Buckets: 4096  Batches: 1  Memory Usage: xxx
+                           ->  Parallel Seq Scan on t2 (actual rows=2507 loops=1)
+(11 rows)
+
+select count(a) from t1 left outer join t2 on a = b;
+ count 
+-------
+   200
+(1 row)
+
+set work_mem = 64;
+-- Parallel_Test_3
+-- big example
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i from generate_series(20,25000)i;
+insert into t2 select 2 from generate_series(1,100)i;
+analyze t2;
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,111)i;
+insert into t1 select 2 from generate_series(1,180)i;
+analyze t1;
+select * from explain_parallel_multi_batch();
+                                       explain_parallel_multi_batch                                       
+----------------------------------------------------------------------------------------------------------
+ Finalize Aggregate (actual rows=1 loops=1)
+   ->  Gather (actual rows=2 loops=1)
+         Workers Planned: 1
+         Workers Launched: 1
+         ->  Partial Aggregate (actual rows=1 loops=2)
+               ->  Parallel Hash Left Join (actual rows=9105 loops=2)
+                     Hash Cond: (t1.a = t2.b)
+                     ->  Parallel Seq Scan on t1 (actual rows=146 loops=2)
+                     ->  Parallel Hash (actual rows=12540 loops=2)
+                           Buckets: 1024 (originally 1024)  Batches: 16 (originally 1)  Memory Usage: xxx
+                           ->  Parallel Seq Scan on t2 (actual rows=12540 loops=2)
+(11 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+ 18210
+(1 row)
+
+-- TODO: check what each of these is exercising -- chunk num, etc and write that
+-- down
+-- also, note that this example did reveal with ROJ that it wasn't working, so
+-- maybe keep that but it is not parallel
+-- make sure the plans make sense for the code we are writing
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+ 18210
+(1 row)
+
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+ 18192
+(1 row)
+
+select count(*) from t1 right outer join t2 on a = b;
+ count 
+-------
+ 43081
+(1 row)
+
+select count(*) from t1 full outer join t2 on a = b;
+ count 
+-------
+ 43099
+(1 row)
+
+-- Parallel_Test_4
+-- spill and resize nbatches 2x
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i from generate_series(4,1000)i;
+insert into t2 select 2 from generate_series(1,4000)i;
+analyze t2;
+alter table t2 set (autovacuum_enabled = 'false');
+update pg_class
+set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+where relname = 't2';
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,11)i;
+insert into t1 select 2 from generate_series(1,18)i;
+insert into t1 values(500);
+analyze t1;
+select * from explain_parallel_multi_batch();
+                                       explain_parallel_multi_batch                                       
+----------------------------------------------------------------------------------------------------------
+ Finalize Aggregate (actual rows=1 loops=1)
+   ->  Gather (actual rows=2 loops=1)
+         Workers Planned: 1
+         Workers Launched: 1
+         ->  Partial Aggregate (actual rows=1 loops=2)
+               ->  Parallel Hash Left Join (actual rows=38006 loops=2)
+                     Hash Cond: (t1.a = t2.b)
+                     ->  Parallel Seq Scan on t1 (actual rows=15 loops=2)
+                     ->  Parallel Hash (actual rows=2498 loops=2)
+                           Buckets: 1024 (originally 1024)  Batches: 16 (originally 1)  Memory Usage: xxx
+                           ->  Parallel Seq Scan on t2 (actual rows=2498 loops=2)
+(11 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count 
+-------
+ 76011
+(1 row)
+
+select count(*) from t1, t2 where a = b;
+ count 
+-------
+ 76009
+(1 row)
+
+select count(*) from t1 right outer join t2 on a = b;
+ count 
+-------
+ 76997
+(1 row)
+
+select count(*) from t1 full outer join t2 on a = b;
+ count 
+-------
+ 76999
+(1 row)
+
+select count(a) from t1 left outer join t2 on a = b;
+ count 
+-------
+ 76011
+(1 row)
+
+-- Parallel_Test_5
+-- revealed race condition because two workers are working on a chunked batch
+-- only 2 unmatched tuples
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i%1111 from generate_series(200,10000)i;
+delete from t2 where b = 115;
+delete from t2 where b = 200;
+insert into t2 select 2 from generate_series(1,4000);
+analyze t2;
+alter table t2 set (autovacuum_enabled = 'false');
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,111)i;
+insert into t1 values(115);
+insert into t1 values(200);
+insert into t1 select 2 from generate_series(1,180)i;
+analyze t1;
+select * from explain_parallel_multi_batch();
+                                       explain_parallel_multi_batch                                       
+----------------------------------------------------------------------------------------------------------
+ Finalize Aggregate (actual rows=1 loops=1)
+   ->  Gather (actual rows=2 loops=1)
+         Workers Planned: 1
+         Workers Launched: 1
+         ->  Partial Aggregate (actual rows=1 loops=2)
+               ->  Parallel Hash Left Join (actual rows=363166 loops=2)
+                     Hash Cond: (t1.a = t2.b)
+                     ->  Parallel Seq Scan on t1 (actual rows=146 loops=2)
+                     ->  Parallel Hash (actual rows=6892 loops=2)
+                           Buckets: 1024 (originally 1024)  Batches: 32 (originally 1)  Memory Usage: xxx
+                           ->  Parallel Seq Scan on t2 (actual rows=6892 loops=2)
+(11 rows)
+
+select count(*) from t1 left outer join t2 on a = b;
+ count  
+--------
+ 726331
+(1 row)
+
+-- without count(*), can't reproduce desired plan so can't rely on results
+select count(*) from t1 left outer join t2 on a = b;
+ count  
+--------
+ 726331
+(1 row)
+
+drop table if exists t1;
+drop table if exists t2;
+drop function explain_parallel_multi_batch();
+reset enable_mergejoin;
+reset work_mem;
+reset search_path;
+drop schema parallel_adaptive_hj;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..518dd6d021 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 adaptive_hj parallel_adaptive_hj
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/post_schedule b/src/test/regress/post_schedule
new file mode 100644
index 0000000000..7824ecf7bf
--- /dev/null
+++ b/src/test/regress/post_schedule
@@ -0,0 +1,8 @@
+test: object_address
+test: tablesample
+test: groupingsets
+test: drop_operator
+test: password
+test: identity
+test: generated
+test: join_hash
diff --git a/src/test/regress/pre_schedule b/src/test/regress/pre_schedule
new file mode 100644
index 0000000000..4105b0fa03
--- /dev/null
+++ b/src/test/regress/pre_schedule
@@ -0,0 +1,120 @@
+# src/test/regress/serial_schedule
+# This should probably be in an order similar to parallel_schedule.
+test: tablespace
+test: boolean
+test: char
+test: name
+test: varchar
+test: text
+test: int2
+test: int4
+test: int8
+test: oid
+test: float4
+test: float8
+test: bit
+test: numeric
+test: txid
+test: uuid
+test: enum
+test: money
+test: rangetypes
+test: pg_lsn
+test: regproc
+test: strings
+test: numerology
+test: point
+test: lseg
+test: line
+test: box
+test: path
+test: polygon
+test: circle
+test: date
+test: time
+test: timetz
+test: timestamp
+test: timestamptz
+test: interval
+test: inet
+test: macaddr
+test: macaddr8
+test: tstypes
+test: geometry
+test: horology
+test: regex
+test: oidjoins
+test: type_sanity
+test: opr_sanity
+test: misc_sanity
+test: comments
+test: expressions
+test: create_function_1
+test: create_type
+test: create_table
+test: create_function_2
+test: copy
+test: copyselect
+test: copydml
+test: insert
+test: insert_conflict
+test: create_misc
+test: create_operator
+test: create_procedure
+test: create_index
+test: create_index_spgist
+test: create_view
+test: index_including
+test: index_including_gist
+test: create_aggregate
+test: create_function_3
+test: create_cast
+test: constraints
+test: triggers
+test: select
+test: inherit
+test: typed_table
+test: vacuum
+test: drop_if_exists
+test: updatable_views
+test: roleattributes
+test: create_am
+test: hash_func
+test: errors
+test: sanity_check
+test: select_into
+test: select_distinct
+test: select_distinct_on
+test: select_implicit
+test: select_having
+test: subselect
+test: union
+test: case
+test: join
+test: adaptive_hj
+test: parallel_adaptive_hj
+test: aggregates
+test: transactions
+ignore: random
+test: random
+test: portals
+test: arrays
+test: btree_index
+test: hash_index
+test: update
+test: delete
+test: namespace
+test: prepared_xacts
+test: brin
+test: gin
+test: gist
+test: spgist
+test: privileges
+test: init_privs
+test: security_label
+test: collate
+test: matview
+test: lock
+test: replica_identity
+test: rowsecurity
+
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..15867f3196 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -91,6 +91,8 @@ test: subselect
 test: union
 test: case
 test: join
+test: adaptive_hj
+test: parallel_adaptive_hj
 test: aggregates
 test: transactions
 ignore: random
diff --git a/src/test/regress/sql/adaptive_hj.sql b/src/test/regress/sql/adaptive_hj.sql
new file mode 100644
index 0000000000..a5af798ea8
--- /dev/null
+++ b/src/test/regress/sql/adaptive_hj.sql
@@ -0,0 +1,240 @@
+-- TODO: remove some of these tests and make the test file faster
+create schema adaptive_hj;
+set search_path=adaptive_hj;
+drop table if exists t1;
+drop table if exists t2;
+create table t1(a int);
+create table t2(b int);
+
+-- serial setup
+set work_mem=64;
+set enable_mergejoin to off;
+-- TODO: make this function general
+create or replace function explain_multi_batch() returns setof text language plpgsql as
+$$
+declare ln text;
+begin
+    for ln in
+        explain (analyze, summary off, timing off, costs off)
+		select count(*) from t1 left outer join t2 on a = b
+    loop
+        ln := regexp_replace(ln, 'Memory Usage: \S*',  'Memory Usage: xxx');
+        return next ln;
+    end loop;
+end;
+$$;
+
+-- Serial_Test_1 reset
+-- TODO: refactor into procedure or change to drop table
+update pg_class set reltuples = 0, relpages = 0 where relname = 't2';
+update pg_class set reltuples = 0, relpages = 0 where relname = 't1';
+delete from pg_statistic where starelid = 't2'::regclass;
+delete from pg_statistic where starelid = 't1'::regclass;
+
+-- Serial_Test_1 setup
+truncate table t1;
+insert into t1 values(1),(2);
+insert into t1 select i from generate_series(1,10)i;
+insert into t1 select 2 from generate_series(1,5)i;
+truncate table t2;
+insert into t2 values(2),(3),(11);
+insert into t2 select i from generate_series(2,10)i;
+insert into t2 select 2 from generate_series(2,7)i;
+
+-- Serial_Test_1.1
+-- TODO: automate the checking for expected number of chunks (explain option?)
+-- spills in 4 batches
+-- batch 1 falls back with 2 chunks with no unmatched tuples
+-- batch 2 falls back with 2 chunks with 2 unmatched tuples emitted at EOB 
+-- batch 3 falls back with 5 chunks with no unmatched tuples
+-- batch 4 does not fall back with no unmatched tuples
+select * from explain_multi_batch();
+select * from t1 left outer join t2 on a = b order by b, a;
+select * from t1, t2 where a = b order by b;
+select * from t1 right outer join t2 on a = b order by a, b;
+select * from t1 full outer join t2 on a = b order by b, a;
+
+-- Serial_Test_1.2 setup
+analyze t1; analyze t2;
+
+-- Serial_Test_1.2
+-- doesn't spill (happens to do a hash right join)
+select * from explain_multi_batch();
+
+-- Serial_Test_2 reset
+update pg_class set reltuples = 0, relpages = 0 where relname = 't2';
+update pg_class set reltuples = 0, relpages = 0 where relname = 't1';
+delete from pg_statistic where starelid = 't2'::regclass;
+delete from pg_statistic where starelid = 't1'::regclass;
+
+-- Serial_Test_2 setup:
+truncate table t1;
+insert into t1 values (1),(2),(2),(3);
+truncate table t2;
+insert into t2 values(2),(2),(3),(3),(4);
+
+-- Serial_Test_2.1
+-- spills in 4 batches
+-- batch 1 falls back with 2 chunks with no unmatched tuples
+-- batch 2 does not fall back with 1 unmatched tuple
+-- batch 3 does not fall back with no unmatched tuples
+-- batch 4 does not fall back with no unmatched tuples
+select * from explain_multi_batch();
+select * from t1 left outer join t2 on a = b order by b, a;
+select * from t1 right outer join t2 on a = b order by a, b;
+
+-- TODO: check coverage for emitting ummatched inner tuples
+-- Serial_Test_2.1.a
+-- results checking for inner join
+select * from t1 left outer join t2 on a = b order by b, a;
+select * from t1, t2 where a = b order by b;
+select * from t1 right outer join t2 on a = b order by a, b;
+select * from t1 full outer join t2 on a = b order by b, a;
+select * from t1, t2 where a = b order by b;
+
+-- Serial_Test_2.2
+analyze t1; analyze t2;
+-- doesn't spill (happens to do a hash right join)
+select * from explain_multi_batch();
+
+-- Serial_Test_3 reset
+update pg_class set reltuples = 0, relpages = 0 where relname = 't2';
+update pg_class set reltuples = 0, relpages = 0 where relname = 't1';
+delete from pg_statistic where starelid = 't2'::regclass;
+delete from pg_statistic where starelid = 't1'::regclass;
+
+
+-- Serial_Test_3 setup:
+truncate table t1;
+insert into t1 values(1),(1);
+insert into t1 select 2 from generate_series(1,7)i;
+insert into t1 select i from generate_series(3,10)i;
+truncate table t2;
+insert into t2 select 2 from generate_series(1,7)i;
+insert into t2 values(3),(3);
+insert into t2 select i from generate_series(5,9)i;
+
+-- Serial_Test_3.1
+-- spills in 4 batches
+-- batch 1 falls back with 2 chunks with 1 unmatched tuple
+-- batch 2 does not fall back with 2 unmatched tuples
+-- batch 3 falls back with 4 chunks with 1 unmatched tuple
+-- batch 4 does not fall back with no unmatched tuples
+select * from explain_multi_batch();
+select * from t1 left outer join t2 on a = b order by b, a;
+select * from t1, t2 where a = b order by b;
+select * from t1 right outer join t2 on a = b order by a, b;
+select * from t1 full outer join t2 on a = b order by b, a;
+select * from t1, t2 where a = b order by b;
+
+-- Serial_Test_3.2 
+-- swap join order
+select * from t2 left outer join t1 on a = b order by a, b;
+select * from t2, t1 where a = b order by a;
+select * from t2 right outer join t1 on a = b order by b, a;
+select * from t2 full outer join t1 on a = b order by a, b;
+
+-- Serial_Test_3.3 setup
+analyze t1; analyze t2;
+
+-- Serial_Test_3.3
+-- doesn't spill
+select * from explain_multi_batch();
+
+-- Serial_Test_4 setup
+drop table t1;
+create table t1(b int);
+insert into t1 select i from generate_series(1,111)i;
+insert into t1 select 2 from generate_series(1,180)i;
+analyze t1;
+
+drop table t2;
+create table t2(a int);
+insert into t2 select i from generate_series(20,25000)i;
+insert into t2 select 2 from generate_series(1,100)i;
+analyze t2;
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+
+-- Serial_Test_4.1
+-- spills in 32 batches
+--batch 0 does not fall back with 1 unmatched outer tuple (15)
+--batch 1 falls back with 396 chunks.
+--batch 2 falls back with 402 chunks with 1 unmatched outer tuple (1)
+--batch 3 falls back with 389 chunks with 1 unmatched outer tuple (8)
+--batch 4 falls back with 409 chunks with no unmatched outer tuples
+--batch 5 falls back with 366 chunks with 1 unmatched outer tuple (4)
+--batch 6 falls back with 407 chunks with 1 unmatched outer tuple (11)
+--batch 7 falls back with 382 chunks with unmatched outer tuple (10)
+--batch 8 falls back with 413 chunks with no unmatched outer tuples
+--batch 9 falls back with 371 chunks with 1 unmatched outer tuple (3)
+--batch 10 falls back with 389 chunks with no unmatched outer tuples
+--batch 11 falls back with 408 chunks with no unmatched outer tuples
+--batch 12 falls back with 387 chunks with no unmatched outer tuples
+--batch 13 falls back with 402 chunks with 1 unmatched outer tuple (18) 
+--batch 14 falls back with 369 chunks with 1 unmatched outer tuple (9)
+--batch 15 falls back with 387 chunks with no unmatched outer tuples
+--batch 16 falls back with 365 chunks with no unmatched outer tuples
+--batch 17 falls back with 403 chunks with 2 unmatched outer tuples (14,19)
+--batch 18 falls back with 375 chunks with no unmatched outer tuples
+--batch 19 falls back with 384 chunks with no unmatched outer tuples
+--batch 20 falls back with 377 chunks with 1 unmatched outer tuple (12)
+--batch 22 falls back with 401 chunks with no unmatched outer tuples
+--batch 23 falls back with 396 chunks with no unmatched outer tuples
+--batch 24 falls back with 387 chunks with 1 unmatched outer tuple (5)
+--batch 25 falls back with 399 chunks with 1 unmatched outer tuple (7)
+--batch 26 falls back with 387 chunks.
+--batch 27 falls back with 442 chunks.
+--batch 28 falls back with 385 chunks with 1 unmatched outer tuple (17)
+--batch 29 falls back with 375 chunks.
+--batch 30 falls back with 404 chunks with 1 unmatched outer tuple (6)
+--batch 31 falls back with 396 chunks with 2 unmatched outer tuples (13,16)
+select * from explain_multi_batch();
+select count(*) from t1 left outer join t2 on a = b;
+select count(a) from t1 left outer join t2 on a = b;
+select count(*) from t1, t2 where a = b;
+-- used to give wrong results because there is a whole batch of outer which is
+-- empty and so the inner doesn't emit unmatched tuples with ROJ
+select count(*) from t1 right outer join t2 on a = b;
+select count(*) from t1 full outer join t2 on a = b; 
+
+-- Test_6 non-negligible amount of data test case
+-- TODO: doesn't finish with my code when it is set to be serial
+-- it does finish when it is parallel -- the serial version is either simply too
+-- slow or has a bug -- I tried it with less data and it did finish, so it must
+-- just be really slow
+-- inner join shouldn't even need to make the unmatched files
+-- it finishes eventually if I decrease data amount
+
+--drop table simple;
+--create table simple as
+ -- select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+--alter table simple set (parallel_workers = 2);
+--analyze simple;
+--
+--drop table extremely_skewed;
+--create table extremely_skewed (id int, t text);
+--alter table extremely_skewed set (autovacuum_enabled = 'false');
+--alter table extremely_skewed set (parallel_workers = 2);
+--analyze extremely_skewed;
+--insert into extremely_skewed
+--  select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+--  from generate_series(1, 20000);
+--update pg_class
+--  set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+--  where relname = 'extremely_skewed';
+
+--set work_mem=64;
+--set enable_mergejoin to off;
+--explain (analyze, costs off, timing off)
+  --select * from simple r join extremely_skewed s using (id);
+--select * from explain_multi_batch();
+
+drop table t1;
+drop table t2;
+drop function explain_multi_batch();
+reset enable_mergejoin;
+reset work_mem;
+reset search_path;
+drop schema adaptive_hj;
diff --git a/src/test/regress/sql/parallel_adaptive_hj.sql b/src/test/regress/sql/parallel_adaptive_hj.sql
new file mode 100644
index 0000000000..3071c5f82e
--- /dev/null
+++ b/src/test/regress/sql/parallel_adaptive_hj.sql
@@ -0,0 +1,182 @@
+create schema parallel_adaptive_hj;
+set search_path=parallel_adaptive_hj;
+
+-- TODO: anti-semi-join and semi-join tests
+
+-- TODO: check if test2 and 3 are different at all
+
+-- TODO: add test for parallel-oblivious parallel hash join
+
+-- TODO: make this function general
+create or replace function explain_parallel_multi_batch() returns setof text language plpgsql as
+$$
+declare ln text;
+begin
+    for ln in
+        explain (analyze, summary off, timing off, costs off)
+		select count(*) from t1 left outer join t2 on a = b
+    loop
+        ln := regexp_replace(ln, 'Memory Usage: \S*',  'Memory Usage: xxx');
+        return next ln;
+    end loop;
+end;
+$$;
+
+-- parallel setup
+set enable_nestloop to off;
+set enable_mergejoin to off;
+set  min_parallel_table_scan_size = 0;
+set  parallel_setup_cost = 0;
+set  enable_parallel_hash = on;
+set  enable_hashjoin = on;
+set  max_parallel_workers_per_gather = 1;
+set  work_mem = 64;
+
+-- Parallel_Test_1 setup
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,11)i;
+insert into t1 select 2 from generate_series(1,18)i;
+analyze t1;
+
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i from generate_series(4,2500)i;
+insert into t2 select 2 from generate_series(1,10)i;
+analyze t2;
+alter table t2 set (autovacuum_enabled = 'false');
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+
+-- Parallel_Test_1.1
+-- spills in 4 batches
+-- 1 resize of nbatches
+-- no batch falls back
+select * from explain_parallel_multi_batch();
+-- need an aggregate to exercise the code but still want to know if we are
+-- emitting the right unmatched outer tuples
+select count(a) from t1 left outer join t2 on a = b;
+select count(*) from t1 left outer join t2 on a = b;
+
+-- Parallel_Test_1.1.a
+-- results checking for inner join
+-- doesn't fall back
+select count(*) from t1, t2 where a = b;
+-- Parallel_Test_1.1.b
+-- results checking for right outer join
+-- doesn't exercise the fallback code but just checking results
+select count(*) from t1 right outer join t2 on a = b;
+-- Parallel_Test_1.1.c
+-- results checking for full outer join
+select count(*) from t1 full outer join t2 on a = b;
+
+-- Parallel_Test_1.2
+-- spill and doesn't have to resize nbatches
+analyze t2;
+select * from explain_parallel_multi_batch();
+select count(a) from t1 left outer join t2 on a = b;
+
+-- Parallel_Test_1.3
+-- doesn't spill
+-- does resize nbuckets
+set work_mem = '4MB';
+select * from explain_parallel_multi_batch();
+select count(a) from t1 left outer join t2 on a = b;
+set work_mem = 64;
+
+
+-- Parallel_Test_3
+-- big example
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i from generate_series(20,25000)i;
+insert into t2 select 2 from generate_series(1,100)i;
+analyze t2;
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,111)i;
+insert into t1 select 2 from generate_series(1,180)i;
+analyze t1;
+
+select * from explain_parallel_multi_batch();
+select count(*) from t1 left outer join t2 on a = b;
+
+-- TODO: check what each of these is exercising -- chunk num, etc and write that
+-- down
+-- also, note that this example did reveal with ROJ that it wasn't working, so
+-- maybe keep that but it is not parallel
+-- make sure the plans make sense for the code we are writing
+select count(*) from t1 left outer join t2 on a = b;
+select count(*) from t1, t2 where a = b;
+select count(*) from t1 right outer join t2 on a = b;
+select count(*) from t1 full outer join t2 on a = b;
+
+-- Parallel_Test_4
+-- spill and resize nbatches 2x
+
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i from generate_series(4,1000)i;
+insert into t2 select 2 from generate_series(1,4000)i;
+analyze t2;
+alter table t2 set (autovacuum_enabled = 'false');
+update pg_class
+set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+where relname = 't2';
+
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,11)i;
+insert into t1 select 2 from generate_series(1,18)i;
+insert into t1 values(500);
+analyze t1;
+
+select * from explain_parallel_multi_batch();
+select count(*) from t1 left outer join t2 on a = b;
+select count(*) from t1, t2 where a = b;
+select count(*) from t1 right outer join t2 on a = b;
+select count(*) from t1 full outer join t2 on a = b;
+select count(a) from t1 left outer join t2 on a = b;
+
+-- Parallel_Test_5
+-- revealed race condition because two workers are working on a chunked batch
+-- only 2 unmatched tuples
+
+drop table if exists t2;
+create table t2(b int);
+insert into t2 select i%1111 from generate_series(200,10000)i;
+delete from t2 where b = 115;
+delete from t2 where b = 200;
+insert into t2 select 2 from generate_series(1,4000);
+analyze t2;
+alter table t2 set (autovacuum_enabled = 'false');
+update pg_class
+  set reltuples = 10, relpages = pg_relation_size('t2') / 8192
+  where relname = 't2';
+
+drop table if exists t1;
+create table t1(a int);
+insert into t1 select i from generate_series(1,111)i;
+insert into t1 values(115);
+insert into t1 values(200);
+insert into t1 select 2 from generate_series(1,180)i;
+analyze t1;
+
+select * from explain_parallel_multi_batch();
+select count(*) from t1 left outer join t2 on a = b;
+
+-- without count(*), can't reproduce desired plan so can't rely on results
+select count(*) from t1 left outer join t2 on a = b;
+
+drop table if exists t1;
+drop table if exists t2;
+drop function explain_parallel_multi_batch();
+reset enable_mergejoin;
+reset work_mem;
+reset search_path;
+drop schema parallel_adaptive_hj;
-- 
2.20.1 (Apple Git-117)

#47

Melanie Plageman

melanieplageman@gmail.com

over 5 years ago

In reply to: Melanie Plageman (#46)

1 attachment(s)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

I've attached a patch which should address some of the previous feedback
about code complexity. Two of my co-workers and I wrote what is
essentially a new prototype of the idea. It uses the main state machine
to route emitting unmatched tuples instead of introducing a separate
state. The logic for falling back is also more developed.

In addition to many assorted TODOs in the code, there are a few major
projects left:
- Batch 0 falling back
- Stripe barrier deadlock
- Performance improvements and testing

I will address the stripe barrier deadlock here. David is going to send
a separate email about batch 0 falling back.

There is a deadlock hazard in parallel hashjoin (pointed out by Thomas
Munro in the past). Workers attached to the stripe_barrier emit tuples
and then wait on that barrier.
I believe that that can be addressed starting with this
relatively unoptimized solution:
- after probing a stripe in a batch, a worker sets the status of that
batch to "tentatively done" and saves the stripe_barrier phase
- if that worker is not the only worker attached to that batch, it
detaches from both stripe and batch barriers and moves on to other
batches
- if that worker is the only worker attached to the batch, it will
proceed to load the next stripe of that batch, and, once it has
finished loading, it will set the status of the batch back to "not
done" for itself
- when the other worker encounters that batch again, if the
stripe_barrier phase has not moved forward, it will mark that batch as
done for itself. if the stripe_barrier phase has moved forward, it can
join in in probing this batch for the current stripe.

Attachments:

v6-0001-Implement-Adaptive-Hashjoin.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Implement-Adaptive-Hashjoin.patchDownload

From 330652a844f637238b37bce2c97de412430812c8 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 26 Feb 2020 09:18:29 -0800
Subject: [PATCH v6] Implement Adaptive Hashjoin

If the inner side tuples of a hashjoin will not fit in memory, the
hashjoin can be executed in multiple batches. If the statistics on the
inner side relation are accurate, planner chooses a multi-batch
strategy and sets the number of batches.
The query executor measures the real size of the hashtable and increases
the number of batches if the hashtable grows too large.

The number of batches is always a power of two, so an increase in the
number of batches doubles it.

Serial hashjoin measures batch size lazily -- waiting until it is
loading a batch to determine if it will fit in memory.

Parallel hashjoin, on the other hand, completes all changes to the
number of batches during the build phase. If it doubles the number of
batches, it dumps all the tuples out, reassigns them to batches,
measures each batch, and checks that it will fit in the space allowed.

In both cases, the executor currently makes a best effort. If a
particular batch won't fit in memory, and, upon changing the number of
batches none of the tuples move to a new batch, the executor disables
growth in the number of batches globally. After growth is disabled, all
batches that would have previously triggered an increase in the number
of batches instead exceed the space allowed.

There is no mechanism to perform a hashjoin within memory constraints if
a run of tuples hash to the same batch. Also, hashjoin will continue to
double the number of batches if *some* tuples move each time -- even if
the batch will never fit in memory -- resulting in an explosion in the
number of batches (affecting performance negatively for multiple
reasons).

Adaptive hashjoin is a mechanism to process a run of inner side tuples
with join keys which hash to the same batch in a manner that is
efficient and respects the space allowed.

When an offending batch causes the number of batches to be doubled and
some percentage of the tuples would not move to a new batch, that batch
can be marked to "fall back". This mechanism replaces serial hashjoin's
"grow_enabled" flag and replaces part of the functionality of parallel
hashjoin's "growth = PHJ_GROWTH_DISABLED" flag. However, instead of
disabling growth in the number of batches for all batches, it only
prevents this batch from causing another increase in the number of
batches.

When the inner side of this batch is loaded into memory, stripes of
arbitrary tuples totaling work_mem in size are loaded into the
hashtable. After probing this stripe, the outer side batch is rewound
and the next stripe is loaded. Each stripe of inner is probed until all
tuples have been processed.

Tuples that match are emitted (depending on the join semantics of the
particular join type) during probing of a stripe. In order to make
left outer join work, unmatched tuples cannot be emitted NULL-extended
until all stripes have been probed. To address this, a bitmap is created
with a bit for each tuple of the outer side. If a tuple on the outer
side matches a tuple from the inner, the corresponding bit is set. At
the end of probing all stripes, the executor scans the bitmap and emits
unmatched outer tuples.

TODOs:
- Batch 0 falling back
- Implement stripe_barrier deadlock fix
- Fix semi-join
- Stripe instrumentation for parallel adaptive hashjoin
- Do benchmarking and experiment with different fallback threshholds
  (currently hardcoded to 80% but more parameterizable than before)
- Assorted TODOs in the code

Co-authored-by: Jesse Zhang <sbjesse@gmail.com>
Co-authored-by: David Kimura <dkimura@pivotal.io>
---
 src/backend/commands/explain.c            |  43 +-
 src/backend/executor/nodeHash.c           | 306 +++++--
 src/backend/executor/nodeHashjoin.c       | 652 ++++++++++++---
 src/backend/postmaster/pgstat.c           |  13 +-
 src/backend/utils/sort/Makefile           |   1 +
 src/backend/utils/sort/sharedbits.c       | 285 +++++++
 src/backend/utils/sort/sharedtuplestore.c | 112 ++-
 src/include/commands/explain.h            |   1 +
 src/include/executor/hashjoin.h           |  47 +-
 src/include/executor/instrument.h         |   7 +
 src/include/executor/nodeHash.h           |   1 +
 src/include/executor/tuptable.h           |   2 +
 src/include/nodes/execnodes.h             |   5 +
 src/include/pgstat.h                      |   5 +-
 src/include/utils/sharedbits.h            |  39 +
 src/include/utils/sharedtuplestore.h      |  19 +
 src/test/regress/expected/join_hash.out   | 945 +++++++++++++++++++++-
 src/test/regress/sql/join_hash.sql        | 127 +++
 18 files changed, 2444 insertions(+), 166 deletions(-)
 create mode 100644 src/backend/utils/sort/sharedbits.c
 create mode 100644 src/include/utils/sharedbits.h

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7ae6131676..fc26341244 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -184,6 +184,8 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 			es->wal = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "settings") == 0)
 			es->settings = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "usage") == 0)
+			es->usage = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "timing") == 0)
 		{
 			timing_set = true;
@@ -312,6 +314,7 @@ NewExplainState(void)
 
 	/* Set default options (most fields can be left as zeroes). */
 	es->costs = true;
+	es->usage = true;
 	/* Prepare output buffer. */
 	es->str = makeStringInfo();
 
@@ -3026,22 +3029,50 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		else if (hinstrument.nbatch_original != hinstrument.nbatch ||
 				 hinstrument.nbuckets_original != hinstrument.nbuckets)
 		{
+			ListCell   *lc;
+
 			ExplainIndentText(es);
 			appendStringInfo(es->str,
-							 "Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 "Buckets: %d (originally %d)  Batches: %d (originally %d)",
 							 hinstrument.nbuckets,
 							 hinstrument.nbuckets_original,
 							 hinstrument.nbatch,
-							 hinstrument.nbatch_original,
-							 spacePeakKb);
+							 hinstrument.nbatch_original);
+			if (es->usage)
+				appendStringInfo(es->str, "  Memory Usage: %ldkB\n", spacePeakKb);
+			else
+				appendStringInfo(es->str, "\n");
+
+			foreach(lc, hinstrument.fallback_batches_stats)
+			{
+				FallbackBatchStats *fbs = lfirst(lc);
+
+				ExplainIndentText(es);
+				appendStringInfo(es->str, "Batch: %d  Stripes: %d\n", fbs->batchno, fbs->numstripes);
+			}
 		}
 		else
 		{
+			ListCell   *lc;
+
 			ExplainIndentText(es);
 			appendStringInfo(es->str,
-							 "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
-							 hinstrument.nbuckets, hinstrument.nbatch,
-							 spacePeakKb);
+							 "Buckets: %d  Batches: %d",
+							 hinstrument.nbuckets, hinstrument.nbatch);
+			if (es->usage)
+				appendStringInfo(es->str, "  Memory Usage: %ldkB\n", spacePeakKb);
+			else
+				appendStringInfo(es->str, "\n");
+			foreach(lc, hinstrument.fallback_batches_stats)
+			{
+				FallbackBatchStats *fbs = lfirst(lc);
+
+				ExplainIndentText(es);
+				appendStringInfo(es->str,
+								 "Batch: %d  Stripes: %d\n",
+								 fbs->batchno,
+								 fbs->numstripes);
+			}
 		}
 	}
 }
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 5da13ada72..6ecbc76ab5 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -80,7 +80,6 @@ static bool ExecParallelHashTuplePrealloc(HashJoinTable hashtable,
 static void ExecParallelHashMergeCounters(HashJoinTable hashtable);
 static void ExecParallelHashCloseBatchAccessors(HashJoinTable hashtable);
 
-
 /* ----------------------------------------------------------------
  *		ExecHash
  *
@@ -321,6 +320,27 @@ MultiExecParallelHash(HashState *node)
 				 * skew).
 				 */
 				pstate->growth = PHJ_GROWTH_DISABLED;
+
+				/*
+				 * In the current design, batch 0 cannot fall back. That
+				 * behavior is an artifact of the existing design where batch
+				 * 0 fills the initial hash table and as an optimization it
+				 * doesn't need a batch file. But, there is no real reason
+				 * that batch 0 shouldn't be allowed to spill.
+				 *
+				 * Consider a hash table where majority of tuples with
+				 * hashvalue 0. These tuples will never relocate no matter how
+				 * many batches exist. If you cannot exceed work_mem, then you
+				 * will be stuck infinitely trying to double the number of
+				 * batches in order to accommodate the tuples that can only
+				 * ever be in batch 0. So, we allow it to be set to fall back
+				 * during the build phase to avoid excessive batch increases
+				 * but we don't check it when loading the actual tuples, so we
+				 * may exceed space_allowed. We set it back to false here so
+				 * that it isn't true during any of the checks that may happen
+				 * during probing.
+				 */
+				hashtable->batches[0].shared->hashloop_fallback = false;
 			}
 	}
 
@@ -495,12 +515,14 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 	hashtable->curbatch = 0;
 	hashtable->nbatch_original = nbatch;
 	hashtable->nbatch_outstart = nbatch;
-	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
 	hashtable->partialTuples = 0;
 	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
+	hashtable->hashloop_fallback = NULL;
+	hashtable->fallback_batches_stats = NULL;
+	hashtable->curstripe = -1;
 	hashtable->spaceUsed = 0;
 	hashtable->spacePeak = 0;
 	hashtable->spaceAllowed = space_allowed;
@@ -572,6 +594,8 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 			palloc0(nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			palloc0(nbatch * sizeof(BufFile *));
+		hashtable->hashloop_fallback = (BufFile **)
+			palloc0(nbatch * sizeof(BufFile *));
 		/* The files will not be opened until needed... */
 		/* ... but make sure we have temp tablespaces established for them */
 		PrepareTempTablespaces();
@@ -866,6 +890,8 @@ ExecHashTableDestroy(HashJoinTable hashtable)
 				BufFileClose(hashtable->innerBatchFile[i]);
 			if (hashtable->outerBatchFile[i])
 				BufFileClose(hashtable->outerBatchFile[i]);
+			if (hashtable->hashloop_fallback[i])
+				BufFileClose(hashtable->hashloop_fallback[i]);
 		}
 	}
 
@@ -876,6 +902,9 @@ ExecHashTableDestroy(HashJoinTable hashtable)
 	pfree(hashtable);
 }
 
+/* Threshhold for tuple relocation during batch split for parallel and serial */
+#define MAX_RELOCATION 0.8
+
 /*
  * ExecHashIncreaseNumBatches
  *		increase the original number of batches in order to reduce
@@ -886,14 +915,18 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 {
 	int			oldnbatch = hashtable->nbatch;
 	int			curbatch = hashtable->curbatch;
+	int			childbatch;
 	int			nbatch;
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
 	HashMemoryChunk oldchunks;
+	int			curbatch_outgoing_tuples;
+	int			childbatch_outgoing_tuples;
+	int			target_batch;
+	FallbackBatchStats *fallback_batch_stats;
 
-	/* do nothing if we've decided to shut off growth */
-	if (!hashtable->growEnabled)
+	if (hashtable->hashloop_fallback && hashtable->hashloop_fallback[curbatch])
 		return;
 
 	/* safety check to avoid overflow */
@@ -917,6 +950,8 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			palloc0(nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			palloc0(nbatch * sizeof(BufFile *));
+		hashtable->hashloop_fallback = (BufFile **)
+			palloc0(nbatch * sizeof(BufFile *));
 		/* time to establish the temp tablespaces, too */
 		PrepareTempTablespaces();
 	}
@@ -927,10 +962,14 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			repalloc(hashtable->innerBatchFile, nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			repalloc(hashtable->outerBatchFile, nbatch * sizeof(BufFile *));
+		hashtable->hashloop_fallback = (BufFile **)
+			repalloc(hashtable->hashloop_fallback, nbatch * sizeof(BufFile *));
 		MemSet(hashtable->innerBatchFile + oldnbatch, 0,
 			   (nbatch - oldnbatch) * sizeof(BufFile *));
 		MemSet(hashtable->outerBatchFile + oldnbatch, 0,
 			   (nbatch - oldnbatch) * sizeof(BufFile *));
+		MemSet(hashtable->hashloop_fallback + oldnbatch, 0,
+			   (nbatch - oldnbatch) * sizeof(BufFile *));
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -942,6 +981,8 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 * no longer of the current batch.
 	 */
 	ninmemory = nfreed = 0;
+	curbatch_outgoing_tuples = childbatch_outgoing_tuples = 0;
+	childbatch = (1U << (my_log2(hashtable->nbatch) - 1)) | hashtable->curbatch;
 
 	/* If know we need to resize nbuckets, we can do it while rebatching. */
 	if (hashtable->nbuckets_optimal != hashtable->nbuckets)
@@ -999,6 +1040,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 				/* and add it back to the appropriate bucket */
 				copyTuple->next.unshared = hashtable->buckets.unshared[bucketno];
 				hashtable->buckets.unshared[bucketno] = copyTuple;
+				curbatch_outgoing_tuples++;
 			}
 			else
 			{
@@ -1010,6 +1052,16 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 
 				hashtable->spaceUsed -= hashTupleSize;
 				nfreed++;
+
+				/*
+				 * TODO: what to do about tuples that don't go to the child
+				 * batch or stay in the current batch? (this is why we are
+				 * counting tuples to child and curbatch with two diff
+				 * variables in case the tuples go to a batch that isn't the
+				 * child)
+				 */
+				if (batchno == childbatch)
+					childbatch_outgoing_tuples++;
 			}
 
 			/* next tuple in this chunk */
@@ -1030,21 +1082,33 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 #endif
 
 	/*
-	 * If we dumped out either all or none of the tuples in the table, disable
-	 * further expansion of nbatch.  This situation implies that we have
-	 * enough tuples of identical hashvalues to overflow spaceAllowed.
-	 * Increasing nbatch will not fix it since there's no way to subdivide the
-	 * group any more finely. We have to just gut it out and hope the server
-	 * has enough RAM.
+	 * For now we do not support fallback in batch 0 as it is a special case
+	 * and assumed to fit in hashtable.
+	 */
+	if (curbatch == 0)
+		return;
+
+	/*
+	 * The same batch should not be marked to fall back more than once
 	 */
-	if (nfreed == 0 || nfreed == ninmemory)
-	{
-		hashtable->growEnabled = false;
 #ifdef HJDEBUG
-		printf("Hashjoin %p: disabling further increase of nbatch\n",
-			   hashtable);
+	if ((childbatch_outgoing_tuples / (float) ninmemory) >= 0.8)
+		printf("childbatch %i targeted to fallback.", childbatch);
+	if ((curbatch_outgoing_tuples / (float) ninmemory) >= 0.8)
+		printf("curbatch %i targeted to fallback.", curbatch);
 #endif
-	}
+	if ((childbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION && childbatch > 0)
+		target_batch = childbatch;
+	else if ((curbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION && curbatch > 0)
+		target_batch = curbatch;
+	else
+		return;
+	hashtable->hashloop_fallback[target_batch] = BufFileCreateTemp(false);
+
+	fallback_batch_stats = palloc0(sizeof(FallbackBatchStats));
+	fallback_batch_stats->batchno = target_batch;
+	fallback_batch_stats->numstripes = 0;
+	hashtable->fallback_batches_stats = lappend(hashtable->fallback_batches_stats, fallback_batch_stats);
 }
 
 /*
@@ -1213,7 +1277,6 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 									 WAIT_EVENT_HASH_GROW_BATCHES_DECIDING))
 			{
 				bool		space_exhausted = false;
-				bool		extreme_skew_detected = false;
 
 				/* Make sure that we have the current dimensions and buckets. */
 				ExecParallelHashEnsureBatchAccessors(hashtable);
@@ -1224,27 +1287,50 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 				{
 					ParallelHashJoinBatch *batch = hashtable->batches[i].shared;
 
+					/*
+					 * All batches were just created anew during
+					 * repartitioning
+					 */
+					Assert(!batch->hashloop_fallback);
+
+					/*
+					 * At the time of repartitioning, each batch updates its
+					 * estimated_size to reflect the size of the batch file on
+					 * disk. It is also updated when increasing preallocated
+					 * space in ExecParallelHashTuplePrealloc().  However,
+					 * batch 0 does not store anything on disk so it has no
+					 * estimated_size.
+					 *
+					 * We still want to allow batch 0 to trigger batch growth.
+					 * In order to do that, for batch 0 check whether the
+					 * actual size exceeds space_allowed. It is a little
+					 * backwards at this point as we would have already
+					 * exceeded inserted the allowed space.
+					 */
 					if (batch->space_exhausted ||
-						batch->estimated_size > pstate->space_allowed)
+						batch->estimated_size > pstate->space_allowed ||
+						batch->size > pstate->space_allowed)
 					{
 						int			parent;
+						float		frac_moved;
 
 						space_exhausted = true;
 
-						/*
-						 * Did this batch receive ALL of the tuples from its
-						 * parent batch?  That would indicate that further
-						 * repartitioning isn't going to help (the hash values
-						 * are probably all the same).
-						 */
 						parent = i % pstate->old_nbatch;
-						if (batch->ntuples == hashtable->batches[parent].shared->old_ntuples)
-							extreme_skew_detected = true;
+						frac_moved = batch->ntuples / (float) hashtable->batches[parent].shared->old_ntuples;
+
+						if (frac_moved >= MAX_RELOCATION)
+						{
+							batch->hashloop_fallback = true;
+							space_exhausted = false;
+						}
 					}
+					if (space_exhausted)
+						break;
 				}
 
-				/* Don't keep growing if it's not helping or we'd overflow. */
-				if (extreme_skew_detected || hashtable->nbatch >= INT_MAX / 2)
+				/* Don't keep growing if we'd overflow. */
+				if (hashtable->nbatch >= INT_MAX / 2)
 					pstate->growth = PHJ_GROWTH_DISABLED;
 				else if (space_exhausted)
 					pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
@@ -1311,11 +1397,28 @@ ExecParallelHashRepartitionFirst(HashJoinTable hashtable)
 			{
 				size_t		tuple_size =
 				MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
+				tupleMetadata metadata;
 
 				/* It belongs in a later batch. */
+				ParallelHashJoinBatch *batch = hashtable->batches[batchno].shared;
+
+				LWLockAcquire(&batch->lock, LW_EXCLUSIVE);
+
+				if (batch->estimated_stripe_size + tuple_size > hashtable->parallel_state->space_allowed)
+				{
+					batch->maximum_stripe_number++;
+					batch->estimated_stripe_size = 0;
+				}
+
+				batch->estimated_stripe_size += tuple_size;
+
+				metadata.hashvalue = hashTuple->hashvalue;
+				metadata.stripe = batch->maximum_stripe_number;
+				LWLockRelease(&batch->lock);
+
 				hashtable->batches[batchno].estimated_size += tuple_size;
-				sts_puttuple(hashtable->batches[batchno].inner_tuples,
-							 &hashTuple->hashvalue, tuple);
+
+				sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata, tuple);
 			}
 
 			/* Count this tuple. */
@@ -1363,27 +1466,41 @@ ExecParallelHashRepartitionRest(HashJoinTable hashtable)
 	for (i = 1; i < old_nbatch; ++i)
 	{
 		MinimalTuple tuple;
-		uint32		hashvalue;
+		tupleMetadata metadata;
 
 		/* Scan one partition from the previous generation. */
 		sts_begin_parallel_scan(old_inner_tuples[i]);
-		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i], &hashvalue)))
+
+		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i], &metadata.hashvalue)))
 		{
 			size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 			int			bucketno;
 			int			batchno;
+			ParallelHashJoinBatch *batch;
 
 			/* Decide which partition it goes to in the new generation. */
-			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
+			ExecHashGetBucketAndBatch(hashtable, metadata.hashvalue, &bucketno,
 									  &batchno);
 
 			hashtable->batches[batchno].estimated_size += tuple_size;
 			++hashtable->batches[batchno].ntuples;
 			++hashtable->batches[i].old_ntuples;
 
+			batch = hashtable->batches[batchno].shared;
+
 			/* Store the tuple its new batch. */
-			sts_puttuple(hashtable->batches[batchno].inner_tuples,
-						 &hashvalue, tuple);
+			LWLockAcquire(&batch->lock, LW_EXCLUSIVE);
+
+			if (batch->estimated_stripe_size + tuple_size > pstate->space_allowed)
+			{
+				batch->maximum_stripe_number++;
+				batch->estimated_stripe_size = 0;
+			}
+			batch->estimated_stripe_size += tuple_size;
+			metadata.stripe = batch->maximum_stripe_number;
+			LWLockRelease(&batch->lock);
+			/* Store the tuple its new batch. */
+			sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata, tuple);
 
 			CHECK_FOR_INTERRUPTS();
 		}
@@ -1693,6 +1810,12 @@ retry:
 
 	if (batchno == 0)
 	{
+		/*
+		 * TODO: if spilling is enabled for batch 0 so that it can fall back,
+		 * we will need to stop loading batch 0 into the hashtable somewhere--
+		 * maybe here-- and switch to saving tuples to a file. Currently, this
+		 * will simply exceed the space allowed
+		 */
 		HashJoinTuple hashTuple;
 
 		/* Try to load it into memory. */
@@ -1715,10 +1838,17 @@ retry:
 	else
 	{
 		size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
+		ParallelHashJoinBatch *batch;
+		tupleMetadata metadata;
 
 		Assert(batchno > 0);
 
 		/* Try to preallocate space in the batch if necessary. */
+
+		/*
+		 * TODO: is it okay to only count the tuple when it doesn't fit in the
+		 * preallocated memory?
+		 */
 		if (hashtable->batches[batchno].preallocated < tuple_size)
 		{
 			if (!ExecParallelHashTuplePrealloc(hashtable, batchno, tuple_size))
@@ -1727,8 +1857,14 @@ retry:
 
 		Assert(hashtable->batches[batchno].preallocated >= tuple_size);
 		hashtable->batches[batchno].preallocated -= tuple_size;
-		sts_puttuple(hashtable->batches[batchno].inner_tuples, &hashvalue,
-					 tuple);
+		batch = hashtable->batches[batchno].shared;
+
+		metadata.hashvalue = hashvalue;
+		LWLockAcquire(&batch->lock, LW_SHARED);
+		metadata.stripe = batch->maximum_stripe_number;
+		LWLockRelease(&batch->lock);
+
+		sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata, tuple);
 	}
 	++hashtable->batches[batchno].ntuples;
 
@@ -2697,6 +2833,7 @@ ExecHashAccumInstrumentation(HashInstrumentation *instrument,
 									  hashtable->nbatch_original);
 	instrument->space_peak = Max(instrument->space_peak,
 								 hashtable->spacePeak);
+	instrument->fallback_batches_stats = hashtable->fallback_batches_stats;
 }
 
 /*
@@ -2850,6 +2987,8 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 	/* Check if it's time to grow batches or buckets. */
 	if (pstate->growth != PHJ_GROWTH_DISABLED)
 	{
+		ParallelHashJoinBatchAccessor batch = hashtable->batches[0];
+
 		Assert(curbatch == 0);
 		Assert(BarrierPhase(&pstate->build_barrier) == PHJ_BUILD_HASHING_INNER);
 
@@ -2858,8 +2997,13 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 		 * very large tuples or very low work_mem setting, we'll always allow
 		 * each backend to allocate at least one chunk.
 		 */
-		if (hashtable->batches[0].at_least_one_chunk &&
-			hashtable->batches[0].shared->size +
+
+		/*
+		 * TODO: get rid of this check for batch 0 and make it so that
+		 * batch 0 always has to keep trying to increase the number of batches
+		 */
+		if (!batch.shared->hashloop_fallback && batch.at_least_one_chunk &&
+			batch.shared->size +
 			chunk_size > pstate->space_allowed)
 		{
 			pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
@@ -2891,6 +3035,11 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 
 	/* We are cleared to allocate a new chunk. */
 	chunk_shared = dsa_allocate(hashtable->area, chunk_size);
+
+	/*
+	 * TODO: if batch 0 will have stripes, need to account for this memory
+	 * there
+	 */
 	hashtable->batches[curbatch].shared->size += chunk_size;
 	hashtable->batches[curbatch].at_least_one_chunk = true;
 
@@ -2960,20 +3109,35 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 	{
 		ParallelHashJoinBatchAccessor *accessor = &hashtable->batches[i];
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
+		SharedBits *sbits = ParallelHashJoinBatchOuterBits(shared, pstate->nparticipants);
 		char		name[MAXPGPATH];
+		char		sbname[MAXPGPATH];
+
+		shared->hashloop_fallback = false;
+		/* TODO: is it okay to use the same tranche for this lock? */
+		LWLockInitialize(&shared->lock, LWTRANCHE_PARALLEL_HASH_JOIN);
+		shared->maximum_stripe_number = 0;
+		shared->estimated_stripe_size = 0;
 
 		/*
 		 * All members of shared were zero-initialized.  We just need to set
 		 * up the Barrier.
 		 */
 		BarrierInit(&shared->batch_barrier, 0);
+		BarrierInit(&shared->stripe_barrier, 0);
+
+		/* Batch 0 doesn't need to be loaded. */
 		if (i == 0)
 		{
-			/* Batch 0 doesn't need to be loaded. */
 			BarrierAttach(&shared->batch_barrier);
-			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_PROBING)
+			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_STRIPING)
 				BarrierArriveAndWait(&shared->batch_barrier, 0);
 			BarrierDetach(&shared->batch_barrier);
+
+			BarrierAttach(&shared->stripe_barrier);
+			while (BarrierPhase(&shared->stripe_barrier) < PHJ_STRIPE_PROBING)
+				BarrierArriveAndWait(&shared->stripe_barrier, 0);
+			BarrierDetach(&shared->stripe_barrier);
 		}
 
 		/* Initialize accessor state.  All members were zero-initialized. */
@@ -2985,7 +3149,7 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 			sts_initialize(ParallelHashJoinBatchInner(shared),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
@@ -2995,10 +3159,13 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 													  pstate->nparticipants),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
+		snprintf(sbname, MAXPGPATH, "%s.bitmaps", name);
+		accessor->sba = sb_initialize(sbits, pstate->nparticipants,
+									  ParallelWorkerNumber + 1, &pstate->sbfileset, sbname);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -3047,8 +3214,8 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 	 * It's possible for a backend to start up very late so that the whole
 	 * join is finished and the shm state for tracking batches has already
 	 * been freed by ExecHashTableDetach().  In that case we'll just leave
-	 * hashtable->batches as NULL so that ExecParallelHashJoinNewBatch() gives
-	 * up early.
+	 * hashtable->batches as NULL so that ExecParallelHashJoinAdvanceBatch()
+	 * gives up early.
 	 */
 	if (!DsaPointerIsValid(pstate->batches))
 		return;
@@ -3070,6 +3237,7 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 	{
 		ParallelHashJoinBatchAccessor *accessor = &hashtable->batches[i];
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
+		SharedBits *sbits = ParallelHashJoinBatchOuterBits(shared, pstate->nparticipants);
 
 		accessor->shared = shared;
 		accessor->preallocated = 0;
@@ -3083,6 +3251,7 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 												  pstate->nparticipants),
 					   ParallelWorkerNumber + 1,
 					   &pstate->fileset);
+		accessor->sba = sb_attach(sbits, ParallelWorkerNumber + 1, &pstate->sbfileset);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -3149,6 +3318,7 @@ ExecHashTableDetachBatch(HashJoinTable hashtable)
 				dsa_free(hashtable->area, batch->buckets);
 				batch->buckets = InvalidDsaPointer;
 			}
+			sb_end_read(hashtable->batches[curbatch].sba);
 		}
 
 		/*
@@ -3165,6 +3335,18 @@ ExecHashTableDetachBatch(HashJoinTable hashtable)
 	}
 }
 
+bool
+ExecHashTableDetachStripe(HashJoinTable hashtable)
+{
+	int			curbatch = hashtable->curbatch;
+	ParallelHashJoinBatch *batch = hashtable->batches[curbatch].shared;
+	Barrier    *stripe_barrier = &batch->stripe_barrier;
+
+	BarrierDetach(stripe_barrier);
+	hashtable->curstripe = -1;
+	return false;
+}
+
 /*
  * Detach from all shared resources.  If we are last to detach, clean up.
  */
@@ -3350,13 +3532,35 @@ ExecParallelHashTuplePrealloc(HashJoinTable hashtable, int batchno, size_t size)
 	{
 		/*
 		 * We have determined that this batch would exceed the space budget if
-		 * loaded into memory.  Command all participants to help repartition.
+		 * loaded into memory.
 		 */
-		batch->shared->space_exhausted = true;
-		pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
-		LWLockRelease(&pstate->lock);
-
-		return false;
+		/* TODO: the nested lock is a deadlock waiting to happen. */
+		LWLockAcquire(&batch->shared->lock, LW_EXCLUSIVE);
+		if (!batch->shared->hashloop_fallback)
+		{
+			/*
+			 * This batch is not marked to fall back so command all
+			 * participants to help repartition.
+			 */
+			batch->shared->space_exhausted = true;
+			pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
+			LWLockRelease(&batch->shared->lock);
+			LWLockRelease(&pstate->lock);
+			return false;
+		}
+		else if (batch->shared->estimated_stripe_size + want +
+				 HASH_CHUNK_HEADER_SIZE > pstate->space_allowed)
+		{
+			/*
+			 * This batch is marked to fall back and the current (last) stripe
+			 * does not have enough space to handle the request so we must
+			 * increment the number of stripes in the batch and reset the size
+			 * of its new last stripe.
+			 */
+			batch->shared->maximum_stripe_number++;
+			batch->shared->estimated_stripe_size = 0;
+		}
+		LWLockRelease(&batch->shared->lock);
 	}
 
 	batch->at_least_one_chunk = true;
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index cc8edacdd0..516067f176 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -126,7 +126,7 @@
 #define HJ_SCAN_BUCKET			3
 #define HJ_FILL_OUTER_TUPLE		4
 #define HJ_FILL_INNER_TUPLES	5
-#define HJ_NEED_NEW_BATCH		6
+#define HJ_NEED_NEW_STRIPE      6
 
 /* Returns true if doing null-fill on outer relation */
 #define HJ_FILL_OUTER(hjstate)	((hjstate)->hj_NullInnerTupleSlot != NULL)
@@ -143,10 +143,91 @@ static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 												 BufFile *file,
 												 uint32 *hashvalue,
 												 TupleTableSlot *tupleSlot);
+static int	ExecHashJoinLoadStripe(HashJoinState *hjstate);
 static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
 static bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
+static bool ExecParallelHashJoinLoadStripe(HashJoinState *hjstate);
 static void ExecParallelHashJoinPartitionOuter(HashJoinState *node);
+static bool checkbit(HashJoinState *hjstate);
+static void set_match_bit(HashJoinState *hjstate);
 
+static pg_attribute_always_inline bool
+			IsHashloopFallback(HashJoinTable hashtable);
+
+#define UINT_BITS (sizeof(unsigned int) * CHAR_BIT)
+
+static void
+set_match_bit(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	BufFile    *statusFile = hashtable->hashloop_fallback[hashtable->curbatch];
+	int			tupindex = hjstate->hj_CurNumOuterTuples - 1;
+	size_t		unit_size = sizeof(hjstate->hj_CurOuterMatchStatus);
+	off_t		offset = tupindex / UINT_BITS * unit_size;
+
+	int			fileno;
+	off_t		cursor;
+
+	BufFileTell(statusFile, &fileno, &cursor);
+
+	/* Extend the statusFile if this is stripe zero. */
+	if (hashtable->curstripe == 0)
+	{
+		for (; cursor < offset + unit_size; cursor += unit_size)
+		{
+			hjstate->hj_CurOuterMatchStatus = 0;
+			BufFileWrite(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+		}
+	}
+
+	if (cursor != offset)
+		BufFileSeek(statusFile, 0, offset, SEEK_SET);
+
+	BufFileRead(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+	BufFileSeek(statusFile, 0, -unit_size, SEEK_CUR);
+
+	hjstate->hj_CurOuterMatchStatus |= 1U << tupindex % UINT_BITS;
+	BufFileWrite(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+}
+
+/* return true if bit is set and false if not */
+static bool
+checkbit(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			curbatch = hashtable->curbatch;
+	BufFile    *outer_match_statuses;
+
+	int			bitno = hjstate->hj_EmitOuterTupleId % UINT_BITS;
+
+	hjstate->hj_EmitOuterTupleId++;
+	outer_match_statuses = hjstate->hj_HashTable->hashloop_fallback[curbatch];
+
+	/*
+	 * if current chunk of bitmap is exhausted, read next chunk of bitmap from
+	 * outer_match_status_file
+	 */
+	if (bitno == 0)
+		BufFileRead(outer_match_statuses, &hjstate->hj_CurOuterMatchStatus,
+					sizeof(hjstate->hj_CurOuterMatchStatus));
+
+	/*
+	 * check if current tuple's match bit is set in outer match status file
+	 */
+	return hjstate->hj_CurOuterMatchStatus & (1U << bitno);
+}
+
+static bool
+IsHashloopFallback(HashJoinTable hashtable)
+{
+	if (hashtable->parallel_state)
+		return hashtable->batches[hashtable->curbatch].shared->hashloop_fallback;
+
+	if (!hashtable->hashloop_fallback)
+		return false;
+
+	return hashtable->hashloop_fallback[hashtable->curbatch];
+}
 
 /* ----------------------------------------------------------------
  *		ExecHashJoinImpl
@@ -290,6 +371,12 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				hashNode->hashtable = hashtable;
 				(void) MultiExecProcNode((PlanState *) hashNode);
 
+				/*
+				 * After building the hashtable, stripe 0 of batch 0 will have
+				 * been loaded.
+				 */
+				hashtable->curstripe = 0;
+
 				/*
 				 * If the inner relation is completely empty, and we're not
 				 * doing a left outer join, we can quit without scanning the
@@ -333,12 +420,11 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 
 					/* Each backend should now select a batch to work on. */
 					hashtable->curbatch = -1;
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
 
-					continue;
+					if (!ExecParallelHashJoinNewBatch(node))
+						return NULL;
 				}
-				else
-					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
 				/* FALL THRU */
 
@@ -365,12 +451,18 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 						node->hj_JoinState = HJ_FILL_INNER_TUPLES;
 					}
 					else
-						node->hj_JoinState = HJ_NEED_NEW_BATCH;
+						node->hj_JoinState = HJ_NEED_NEW_STRIPE;
 					continue;
 				}
 
 				econtext->ecxt_outertuple = outerTupleSlot;
-				node->hj_MatchedOuter = false;
+
+				/*
+				 * Don't reset hj_MatchedOuter after the first stripe as it
+				 * would cancel out whatever we found before
+				 */
+				if (node->hj_HashTable->curstripe == 0)
+					node->hj_MatchedOuter = false;
 
 				/*
 				 * Find the corresponding bucket for this tuple in the main
@@ -386,9 +478,15 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				/*
 				 * The tuple might not belong to the current batch (where
 				 * "current batch" includes the skew buckets if any).
+				 *
+				 * This should only be done once per tuple per batch. If a
+				 * batch "falls back", its inner side will be split into
+				 * stripes. Any displaced outer tuples should only be
+				 * relocated while probing the first stripe of the inner side.
 				 */
 				if (batchno != hashtable->curbatch &&
-					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO)
+					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO &&
+					node->hj_HashTable->curstripe == 0)
 				{
 					bool		shouldFree;
 					MinimalTuple mintuple = ExecFetchSlotMinimalTuple(outerTupleSlot,
@@ -410,6 +508,13 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					continue;
 				}
 
+				/*
+				 * While probing the phantom stripe, don't increment
+				 * hj_CurNumOuterTuples or extend the bitmap
+				 */
+				if (!parallel && hashtable->curstripe != -2)
+					node->hj_CurNumOuterTuples++;
+
 				/* OK, let's scan the bucket for matches */
 				node->hj_JoinState = HJ_SCAN_BUCKET;
 
@@ -455,6 +560,14 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				{
 					node->hj_MatchedOuter = true;
 
+					if (HJ_FILL_OUTER(node) && IsHashloopFallback(hashtable))
+					{
+						if (parallel)
+							sb_setbit(hashtable->batches[hashtable->curbatch].sba, econtext->ecxt_outertuple->tts_tuplenum);
+						else
+							set_match_bit(node);
+					}
+
 					if (parallel)
 					{
 						/*
@@ -508,6 +621,22 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				 */
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
+				if (IsHashloopFallback(hashtable) && HJ_FILL_OUTER(node))
+				{
+					if (hashtable->curstripe != -2)
+						continue;
+
+					if (parallel)
+					{
+						ParallelHashJoinBatchAccessor *accessor =
+						&node->hj_HashTable->batches[node->hj_HashTable->curbatch];
+
+						node->hj_MatchedOuter = sb_checkbit(accessor->sba, econtext->ecxt_outertuple->tts_tuplenum);
+					}
+					else
+						node->hj_MatchedOuter = checkbit(node);
+				}
+
 				if (!node->hj_MatchedOuter &&
 					HJ_FILL_OUTER(node))
 				{
@@ -534,7 +663,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				if (!ExecScanHashTableForUnmatched(node, econtext))
 				{
 					/* no more unmatched tuples */
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					node->hj_JoinState = HJ_NEED_NEW_STRIPE;
 					continue;
 				}
 
@@ -550,19 +679,23 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					InstrCountFiltered2(node, 1);
 				break;
 
-			case HJ_NEED_NEW_BATCH:
+			case HJ_NEED_NEW_STRIPE:
 
 				/*
-				 * Try to advance to next batch.  Done if there are no more.
+				 * Try to advance to next stripe. Then try to advance to the
+				 * next batch if there are no more stripes in this batch. Done
+				 * if there are no more batches.
 				 */
 				if (parallel)
 				{
-					if (!ExecParallelHashJoinNewBatch(node))
+					if (!ExecParallelHashJoinLoadStripe(node) &&
+						!ExecParallelHashJoinNewBatch(node))
 						return NULL;	/* end of parallel-aware join */
 				}
 				else
 				{
-					if (!ExecHashJoinNewBatch(node))
+					if (!ExecHashJoinLoadStripe(node) &&
+						!ExecHashJoinNewBatch(node))
 						return NULL;	/* end of parallel-oblivious join */
 				}
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
@@ -751,6 +884,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate->hj_JoinState = HJ_BUILD_HASHTABLE;
 	hjstate->hj_MatchedOuter = false;
 	hjstate->hj_OuterNotEmpty = false;
+	hjstate->hj_CurNumOuterTuples = 0;
+	hjstate->hj_CurOuterMatchStatus = 0;
 
 	return hjstate;
 }
@@ -917,15 +1052,24 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 	}
 	else if (curbatch < hashtable->nbatch)
 	{
+		tupleMetadata metadata;
 		MinimalTuple tuple;
 
 		tuple = sts_parallel_scan_next(hashtable->batches[curbatch].outer_tuples,
-									   hashvalue);
+									   &metadata);
+		*hashvalue = metadata.hashvalue;
+
 		if (tuple != NULL)
 		{
 			ExecForceStoreMinimalTuple(tuple,
 									   hjstate->hj_OuterTupleSlot,
 									   false);
+
+			/*
+			 * TODO: should we use tupleid instead of position in the serial
+			 * case too?
+			 */
+			hjstate->hj_OuterTupleSlot->tts_tuplenum = metadata.tupleid;
 			slot = hjstate->hj_OuterTupleSlot;
 			return slot;
 		}
@@ -949,24 +1093,37 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	int			nbatch;
 	int			curbatch;
-	BufFile    *innerFile;
-	TupleTableSlot *slot;
-	uint32		hashvalue;
+	BufFile    *innerFile = NULL;
+	BufFile    *outerFile = NULL;
 
 	nbatch = hashtable->nbatch;
 	curbatch = hashtable->curbatch;
 
-	if (curbatch > 0)
+	/*
+	 * We no longer need the previous outer batch file; close it right away to
+	 * free disk space.
+	 */
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
 	{
-		/*
-		 * We no longer need the previous outer batch file; close it right
-		 * away to free disk space.
-		 */
-		if (hashtable->outerBatchFile[curbatch])
-			BufFileClose(hashtable->outerBatchFile[curbatch]);
+		BufFileClose(hashtable->outerBatchFile[curbatch]);
 		hashtable->outerBatchFile[curbatch] = NULL;
 	}
-	else						/* we just finished the first batch */
+	if (IsHashloopFallback(hashtable))
+	{
+		BufFileClose(hashtable->hashloop_fallback[curbatch]);
+		hashtable->hashloop_fallback[curbatch] = NULL;
+	}
+
+	/*
+	 * We are surely done with the inner batch file now
+	 */
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[curbatch])
+	{
+		BufFileClose(hashtable->innerBatchFile[curbatch]);
+		hashtable->innerBatchFile[curbatch] = NULL;
+	}
+
+	if (curbatch == 0)			/* we just finished the first batch */
 	{
 		/*
 		 * Reset some of the skew optimization state variables, since we no
@@ -1030,45 +1187,68 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 		return false;			/* no more batches */
 
 	hashtable->curbatch = curbatch;
+	hashtable->curstripe = -1;
+	hjstate->hj_CurNumOuterTuples = 0;
 
-	/*
-	 * Reload the hash table with the new inner batch (which could be empty)
-	 */
-	ExecHashTableReset(hashtable);
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[curbatch])
+		innerFile = hashtable->innerBatchFile[curbatch];
+
+	if (innerFile && BufFileSeek(innerFile, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	/* Need to rewind outer when this is the first stripe of a new batch */
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
+		outerFile = hashtable->outerBatchFile[curbatch];
+
+	if (outerFile && BufFileSeek(outerFile, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	ExecHashJoinLoadStripe(hjstate);
+	return true;
+}
 
-	innerFile = hashtable->innerBatchFile[curbatch];
+static inline void
+InstrIncrBatchStripes(List *fallback_batches_stats, int curbatch)
+{
+	ListCell   *lc;
 
-	if (innerFile != NULL)
+	foreach(lc, fallback_batches_stats)
 	{
-		if (BufFileSeek(innerFile, 0, 0L, SEEK_SET))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not rewind hash-join temporary file: %m")));
+		FallbackBatchStats *fallback_batch_stats = lfirst(lc);
 
-		while ((slot = ExecHashJoinGetSavedTuple(hjstate,
-												 innerFile,
-												 &hashvalue,
-												 hjstate->hj_HashTupleSlot)))
+		if (fallback_batch_stats->batchno == curbatch)
 		{
-			/*
-			 * NOTE: some tuples may be sent to future batches.  Also, it is
-			 * possible for hashtable->nbatch to be increased here!
-			 */
-			ExecHashTableInsert(hashtable, slot, hashvalue);
+			fallback_batch_stats->numstripes++;
+			break;
 		}
-
-		/*
-		 * after we build the hash table, the inner batch file is no longer
-		 * needed
-		 */
-		BufFileClose(innerFile);
-		hashtable->innerBatchFile[curbatch] = NULL;
 	}
+}
+
+/*
+ * Returns false when the inner batch file is exhausted
+ */
+static int
+ExecHashJoinLoadStripe(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			curbatch = hashtable->curbatch;
+	TupleTableSlot *slot;
+	uint32		hashvalue;
+	bool		loaded_inner = false;
+
+	if (hashtable->curstripe == -2)
+		return false;
 
 	/*
 	 * Rewind outer batch file (if present), so that we can start reading it.
+	 * TODO: This is only necessary if this is not the first stripe of the
+	 * batch
 	 */
-	if (hashtable->outerBatchFile[curbatch] != NULL)
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
 	{
 		if (BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET))
 			ereport(ERROR,
@@ -1076,9 +1256,78 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 					 errmsg("could not rewind hash-join temporary file: %m")));
 	}
 
-	return true;
+	hashtable->curstripe++;
+
+	if (!hashtable->innerBatchFile || !hashtable->innerBatchFile[curbatch])
+		return false;
+
+	/*
+	 * Reload the hash table with the new inner stripe
+	 */
+	ExecHashTableReset(hashtable);
+
+	while ((slot = ExecHashJoinGetSavedTuple(hjstate,
+											 hashtable->innerBatchFile[curbatch],
+											 &hashvalue,
+											 hjstate->hj_HashTupleSlot)))
+	{
+		/*
+		 * NOTE: some tuples may be sent to future batches.  Also, it is
+		 * possible for hashtable->nbatch to be increased here!
+		 */
+		uint32		hashTupleSize;
+		/*
+		 * TODO: wouldn't it be cool if this returned the size of the tuple
+		 * inserted
+		 */
+		ExecHashTableInsert(hashtable, slot, hashvalue);
+		loaded_inner = true;
+
+		if (!IsHashloopFallback(hashtable))
+			continue;
+
+		hashTupleSize = slot->tts_ops->get_minimal_tuple(slot)->t_len + HJTUPLE_OVERHEAD;
+
+		if (hashtable->spaceUsed + hashTupleSize +
+			hashtable->nbuckets_optimal * sizeof(HashJoinTuple)
+			> hashtable->spaceAllowed)
+			break;
+	}
+
+	/*
+	 * if we didn't load anything and it is a FOJ/LOJ fallback batch, we will
+	 * transition to emit unmatched outer tuples next. we want to know how
+	 * many tuples were in the batch in that case, so don't zero it out then
+	 */
+
+	/*
+	 * if we loaded anything into the hashtable or it is the phantom stripe,
+	 * must proceed to probing
+	 */
+	if (loaded_inner)
+	{
+		hjstate->hj_CurNumOuterTuples = 0;
+		InstrIncrBatchStripes(hashtable->fallback_batches_stats, curbatch);
+		return true;
+	}
+
+	if (IsHashloopFallback(hashtable) && HJ_FILL_OUTER(hjstate))
+	{
+		/*
+		 * if we didn't load anything and it is a fallback batch, we will
+		 * prepare to emit outer tuples during the phantom stripe probing
+		 */
+		hashtable->curstripe = -2;
+		hjstate->hj_EmitOuterTupleId = 0;
+		hjstate->hj_CurOuterMatchStatus = 0;
+		BufFileSeek(hashtable->hashloop_fallback[curbatch], 0, 0, SEEK_SET);
+		BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET);
+		return true;
+	}
+	return false;
 }
 
+
 /*
  * Choose a batch to work on, and attach to it.  Returns true if successful,
  * false if there are no more batches.
@@ -1101,10 +1350,18 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 	/*
 	 * If we were already attached to a batch, remember not to bother checking
 	 * it again, and detach from it (possibly freeing the hash table if we are
-	 * last to detach).
+	 * last to detach). curbatch is set when the batch_barrier phase is either
+	 * PHJ_BATCH_LOADING or PHJ_BATCH_STRIPING (note that the
+	 * PHJ_BATCH_LOADING case will fall through to the PHJ_BATCH_STRIPING
+	 * case). The PHJ_BATCH_STRIPING case returns to the caller. So when this
+	 * function is reentered with a curbatch >= 0 then we must be done
+	 * probing.
 	 */
+
 	if (hashtable->curbatch >= 0)
 	{
+		if (IsHashloopFallback(hashtable))
+			sb_end_write(hashtable->batches[hashtable->curbatch].sba);
 		hashtable->batches[hashtable->curbatch].done = true;
 		ExecHashTableDetachBatch(hashtable);
 	}
@@ -1119,13 +1376,8 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 		hashtable->nbatch;
 	do
 	{
-		uint32		hashvalue;
-		MinimalTuple tuple;
-		TupleTableSlot *slot;
-
 		if (!hashtable->batches[batchno].done)
 		{
-			SharedTuplestoreAccessor *inner_tuples;
 			Barrier    *batch_barrier =
 			&hashtable->batches[batchno].shared->batch_barrier;
 
@@ -1136,7 +1388,15 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 					/* One backend allocates the hash table. */
 					if (BarrierArriveAndWait(batch_barrier,
 											 WAIT_EVENT_HASH_BATCH_ELECTING))
+					{
 						ExecParallelHashTableAlloc(hashtable, batchno);
+
+						/*
+						 * one worker needs to 0 out the read_pages of all the
+						 * participants in the new batch
+						 */
+						sts_reinitialize(hashtable->batches[batchno].inner_tuples);
+					}
 					/* Fall through. */
 
 				case PHJ_BATCH_ALLOCATING:
@@ -1145,40 +1405,15 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 										 WAIT_EVENT_HASH_BATCH_ALLOCATING);
 					/* Fall through. */
 
-				case PHJ_BATCH_LOADING:
-					/* Start (or join in) loading tuples. */
-					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					inner_tuples = hashtable->batches[batchno].inner_tuples;
-					sts_begin_parallel_scan(inner_tuples);
-					while ((tuple = sts_parallel_scan_next(inner_tuples,
-														   &hashvalue)))
-					{
-						ExecForceStoreMinimalTuple(tuple,
-												   hjstate->hj_HashTupleSlot,
-												   false);
-						slot = hjstate->hj_HashTupleSlot;
-						ExecParallelHashTableInsertCurrentBatch(hashtable, slot,
-																hashvalue);
-					}
-					sts_end_parallel_scan(inner_tuples);
-					BarrierArriveAndWait(batch_barrier,
-										 WAIT_EVENT_HASH_BATCH_LOADING);
-					/* Fall through. */
-
-				case PHJ_BATCH_PROBING:
+				case PHJ_BATCH_STRIPING:
 
-					/*
-					 * This batch is ready to probe.  Return control to
-					 * caller. We stay attached to batch_barrier so that the
-					 * hash table stays alive until everyone's finished
-					 * probing it, but no participant is allowed to wait at
-					 * this barrier again (or else a deadlock could occur).
-					 * All attached participants must eventually call
-					 * BarrierArriveAndDetach() so that the final phase
-					 * PHJ_BATCH_DONE can be reached.
-					 */
 					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					sts_begin_parallel_scan(hashtable->batches[batchno].outer_tuples);
+					sts_begin_parallel_scan(hashtable->batches[batchno].inner_tuples);
+					if (hashtable->batches[batchno].shared->hashloop_fallback)
+						sb_initialize_accessor(hashtable->batches[hashtable->curbatch].sba,
+											   sts_get_tuplenum(hashtable->batches[hashtable->curbatch].outer_tuples));
+					hashtable->curstripe = -1;
+					ExecParallelHashJoinLoadStripe(hjstate);
 					return true;
 
 				case PHJ_BATCH_DONE:
@@ -1203,6 +1438,220 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 	return false;
 }
 
+
+
+/*
+ * Returns true if ready to probe and false if the inner is exhausted
+ * (there are no more stripes)
+ */
+bool
+ExecParallelHashJoinLoadStripe(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			batchno = hashtable->curbatch;
+	ParallelHashJoinBatch *batch = hashtable->batches[batchno].shared;
+	Barrier    *stripe_barrier = &batch->stripe_barrier;
+	SharedTuplestoreAccessor *outer_tuples;
+	SharedTuplestoreAccessor *inner_tuples;
+	ParallelHashJoinBatchAccessor *accessor;
+	dsa_pointer_atomic *buckets;
+
+	outer_tuples = hashtable->batches[batchno].outer_tuples;
+	inner_tuples = hashtable->batches[batchno].inner_tuples;
+
+	if (hashtable->curstripe >= 0)
+	{
+		BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_PROBING);
+	}
+	else if (hashtable->curstripe == -1)
+	{
+		int			phase = BarrierAttach(stripe_barrier);
+
+		/*
+		 * If a worker enters this phase machine on a stripe number greater
+		 * than the batch's maximum stripe number, then: 1) The batch is done,
+		 * or 2) The batch is on the phantom stripe that's used for hashloop
+		 * fallback Either way the worker can't contribute so just detach and
+		 * move on.
+		 */
+		if (PHJ_STRIPE_NUMBER(phase) > batch->maximum_stripe_number)
+			return ExecHashTableDetachStripe(hashtable);
+
+		hashtable->curstripe = PHJ_STRIPE_NUMBER(phase);
+	}
+	else if (hashtable->curstripe == -2)
+	{
+		sts_end_parallel_scan(outer_tuples);
+		sb_end_read(hashtable->batches[batchno].sba);
+		return ExecHashTableDetachStripe(hashtable);
+	}
+
+	/*
+	 * The outer side is exhausted and either 1) the current stripe of the
+	 * inner side is exhausted and it is time to advance the stripe 2) the
+	 * last stripe of the inner side is exhausted and it is time to advance
+	 * the batch
+	 */
+	for (;;)
+	{
+		int			phase = BarrierPhase(stripe_barrier);
+
+		switch (PHJ_STRIPE_PHASE(phase))
+		{
+			case PHJ_STRIPE_ELECTING:
+				if (BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_ELECTING))
+				{
+					sts_reinitialize(outer_tuples);
+
+					/*
+					 * set the rewound flag back to false to prepare for the
+					 * next stripe
+					 */
+					sts_reset_rewound(inner_tuples);
+				}
+
+				/* Fall through. */
+
+			case PHJ_STRIPE_RESETTING:
+				/* TODO: not needed for phantom stripe */
+				BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_RESETTING);
+
+			case PHJ_STRIPE_LOADING:
+				{
+					MinimalTuple tuple;
+					tupleMetadata metadata;
+
+					/*
+					 * Start (or join in) loading the next stripe of inner
+					 * tuples.
+					 */
+
+					/*
+					 * I'm afraid there potential issue if a worker joins in
+					 * this phase and doesn't do the actions and resetting of
+					 * variables in sts_resume_parallel_scan. that is, if it
+					 * doesn't reset start_page and read_next_page in between
+					 * stripes. For now, call it. However, I think it might be
+					 * able to be removed.
+					 */
+
+					/*
+					 * TODO: sts_resume_parallel_scan() is overkill for stripe
+					 * 0 of each batch
+					 */
+					sts_resume_parallel_scan(inner_tuples);
+
+					while ((tuple = sts_parallel_scan_next(inner_tuples, &metadata)))
+					{
+						/* The tuple is from a previous stripe. Skip it */
+						if (metadata.stripe < PHJ_STRIPE_NUMBER(phase))
+							continue;
+
+						/*
+						 * tuple from future. time to back out read_page. end
+						 * of stripe
+						 */
+						if (metadata.stripe > PHJ_STRIPE_NUMBER(phase))
+						{
+							sts_parallel_scan_rewind(inner_tuples);
+							continue;
+						}
+
+						ExecForceStoreMinimalTuple(tuple, hjstate->hj_HashTupleSlot, false);
+						ExecParallelHashTableInsertCurrentBatch(
+																hashtable,
+																hjstate->hj_HashTupleSlot,
+																metadata.hashvalue);
+					}
+					BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOADING);
+					/* Fall through. */
+				}
+
+			case PHJ_STRIPE_PROBING:
+
+				/*
+				 * do this again here in case a worker began the scan and then
+				 * entered after loading before probing
+				 */
+				sts_end_parallel_scan(inner_tuples);
+				sts_begin_parallel_scan(outer_tuples);
+				return true;
+
+			case PHJ_STRIPE_DONE:
+
+				if (PHJ_STRIPE_NUMBER(phase) >= batch->maximum_stripe_number)
+				{
+					/*
+					 * Handle the phantom stripe case.
+					 */
+					if (batch->hashloop_fallback && HJ_FILL_OUTER(hjstate))
+						goto fallback_stripe;
+
+					/* Return if this is the last stripe */
+					return ExecHashTableDetachStripe(hashtable);
+				}
+
+				/* this, effectively, increments the stripe number */
+				if (BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOADING))
+				{
+					/*
+					 * reset inner's hashtable and recycle the existing bucket array.
+					 */
+					buckets = (dsa_pointer_atomic *)
+						dsa_get_address(hashtable->area, batch->buckets);
+
+					for (size_t i = 0; i < hashtable->nbuckets; ++i)
+						dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
+				}
+
+				hashtable->curstripe++;
+				continue;
+
+			default:
+				elog(ERROR, "unexpected stripe phase %d. pid %i. batch %i.", BarrierPhase(stripe_barrier), MyProcPid, batchno);
+		}
+	}
+
+fallback_stripe:
+	accessor = &hashtable->batches[hashtable->curbatch];
+	sb_end_write(accessor->sba);
+
+	/* Ensure that only a single worker is attached to the barrier */
+	if (!BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOADING))
+		return ExecHashTableDetachStripe(hashtable);
+
+
+	/* No one except the last worker will run this code */
+	hashtable->curstripe = -2;
+
+	/*
+	 * reset inner's hashtable and recycle the existing bucket array.
+	 */
+	buckets = (dsa_pointer_atomic *)
+		dsa_get_address(hashtable->area, batch->buckets);
+
+	for (size_t i = 0; i < hashtable->nbuckets; ++i)
+		dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
+
+	/*
+	 * If all workers (including this one) have finished probing the batch,
+	 * one worker is elected to Loop through the outer match status files from
+	 * all workers that were attached to this batch Combine them into one
+	 * bitmap Use the bitmap, loop through the outer batch file again, and
+	 * emit unmatched tuples All workers will detach from the batch barrier
+	 * and the last worker will clean up the hashtable. All workers except the
+	 * last worker will end their scans of the outer and inner side. The last
+	 * worker will end its scan of the inner side
+	 */
+
+	sb_combine(accessor->sba);
+	sts_reinitialize(outer_tuples);
+
+	sts_begin_parallel_scan(outer_tuples);
+
+	return true;
+}
+
 /*
  * ExecHashJoinSaveTuple
  *		save a tuple to a batch file.
@@ -1372,6 +1821,9 @@ ExecReScanHashJoin(HashJoinState *node)
 	node->hj_MatchedOuter = false;
 	node->hj_FirstOuterTupleSlot = NULL;
 
+	node->hj_CurNumOuterTuples = 0;
+	node->hj_CurOuterMatchStatus = 0;
+
 	/*
 	 * if chgParam of subnode is not null then plan will be re-scanned by
 	 * first ExecProcNode.
@@ -1402,7 +1854,6 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 	ExprContext *econtext = hjstate->js.ps.ps_ExprContext;
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	TupleTableSlot *slot;
-	uint32		hashvalue;
 	int			i;
 
 	Assert(hjstate->hj_FirstOuterTupleSlot == NULL);
@@ -1410,6 +1861,8 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 	/* Execute outer plan, writing all tuples to shared tuplestores. */
 	for (;;)
 	{
+		tupleMetadata metadata;
+
 		slot = ExecProcNode(outerState);
 		if (TupIsNull(slot))
 			break;
@@ -1418,17 +1871,23 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 								 hjstate->hj_OuterHashKeys,
 								 true,	/* outer tuple */
 								 HJ_FILL_OUTER(hjstate),
-								 &hashvalue))
+								 &metadata.hashvalue))
 		{
 			int			batchno;
 			int			bucketno;
 			bool		shouldFree;
+			SharedTuplestoreAccessor *accessor;
+
 			MinimalTuple mintup = ExecFetchSlotMinimalTuple(slot, &shouldFree);
 
-			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
+			ExecHashGetBucketAndBatch(hashtable, metadata.hashvalue, &bucketno,
 									  &batchno);
-			sts_puttuple(hashtable->batches[batchno].outer_tuples,
-						 &hashvalue, mintup);
+			accessor = hashtable->batches[batchno].outer_tuples;
+
+			/* cannot count on deterministic order of tupleids */
+			metadata.tupleid = sts_increment_ntuples(accessor);
+
+			sts_puttuple(hashtable->batches[batchno].outer_tuples, &metadata.hashvalue, mintup);
 
 			if (shouldFree)
 				heap_free_minimal_tuple(mintup);
@@ -1494,6 +1953,7 @@ ExecHashJoinInitializeDSM(HashJoinState *state, ParallelContext *pcxt)
 
 	/* Set up the space we'll use for shared temporary files. */
 	SharedFileSetInit(&pstate->fileset, pcxt->seg);
+	SharedFileSetInit(&pstate->sbfileset, pcxt->seg);
 
 	/* Initialize the shared state in the hash node. */
 	hashNode = (HashState *) innerPlanState(state);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 50eea2e8a8..02ca9654ec 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3780,8 +3780,17 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_BATCH_ELECTING:
 			event_name = "Hash/Batch/Electing";
 			break;
-		case WAIT_EVENT_HASH_BATCH_LOADING:
-			event_name = "Hash/Batch/Loading";
+		case WAIT_EVENT_HASH_STRIPE_ELECTING:
+			event_name = "Hash/Stripe/Electing";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_RESETTING:
+			event_name = "Hash/Stripe/RESETTING";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_LOADING:
+			event_name = "Hash/Stripe/Loading";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_PROBING:
+			event_name = "Hash/Stripe/Probing";
 			break;
 		case WAIT_EVENT_HASH_BUILD_ALLOCATING:
 			event_name = "Hash/Build/Allocating";
diff --git a/src/backend/utils/sort/Makefile b/src/backend/utils/sort/Makefile
index 7ac3659261..f11fe85aeb 100644
--- a/src/backend/utils/sort/Makefile
+++ b/src/backend/utils/sort/Makefile
@@ -16,6 +16,7 @@ override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS)
 
 OBJS = \
 	logtape.o \
+	sharedbits.o \
 	sharedtuplestore.o \
 	sortsupport.o \
 	tuplesort.o \
diff --git a/src/backend/utils/sort/sharedbits.c b/src/backend/utils/sort/sharedbits.c
new file mode 100644
index 0000000000..37df04844e
--- /dev/null
+++ b/src/backend/utils/sort/sharedbits.c
@@ -0,0 +1,285 @@
+#include "postgres.h"
+#include "storage/buffile.h"
+#include "utils/sharedbits.h"
+
+/*
+ * TODO: put a comment about not currently supporting parallel scan of the SharedBits
+ * To support parallel scan, need to introduce many more mechanisms
+ */
+
+/* Per-participant shared state */
+struct SharedBitsParticipant
+{
+	bool		present;
+	bool		writing;
+};
+
+/* Shared control object */
+struct SharedBits
+{
+	int			nparticipants;	/* Number of participants that can write. */
+	int64		nbits;
+	char		name[NAMEDATALEN];	/* A name for this bitstore. */
+
+	SharedBitsParticipant participants[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/* backend-local state */
+struct SharedBitsAccessor
+{
+	int			participant;
+	SharedBits *bits;
+	SharedFileSet *fileset;
+	BufFile    *write_file;
+	BufFile    *combined;
+};
+
+SharedBitsAccessor *
+sb_attach(SharedBits *sbits, int my_participant_number, SharedFileSet *fileset)
+{
+	SharedBitsAccessor *accessor = palloc0(sizeof(SharedBitsAccessor));
+
+	accessor->participant = my_participant_number;
+	accessor->bits = sbits;
+	accessor->fileset = fileset;
+	accessor->write_file = NULL;
+	accessor->combined = NULL;
+	return accessor;
+}
+
+SharedBitsAccessor *
+sb_initialize(SharedBits *sbits,
+			  int participants,
+			  int my_participant_number,
+			  SharedFileSet *fileset,
+			  char *name)
+{
+	SharedBitsAccessor *accessor;
+
+	sbits->nparticipants = participants;
+	strcpy(sbits->name, name);
+	sbits->nbits = 0;			/* TODO: maybe delete this */
+
+	accessor = palloc0(sizeof(SharedBitsAccessor));
+	accessor->participant = my_participant_number;
+	accessor->bits = sbits;
+	accessor->fileset = fileset;
+	accessor->write_file = NULL;
+	accessor->combined = NULL;
+	return accessor;
+}
+
+/*  TODO: is "initialize_accessor" a clear enough API for this? (making the file)? */
+void
+sb_initialize_accessor(SharedBitsAccessor *accessor, uint32 nbits)
+{
+	char		name[MAXPGPATH];
+	uint32		num_to_write;
+
+	snprintf(name, MAXPGPATH, "%s.p%d.bitmap", accessor->bits->name, accessor->participant);
+
+	accessor->write_file =
+		BufFileCreateShared(accessor->fileset, name);
+
+	accessor->bits->participants[accessor->participant].present = true;
+	/* TODO: check this math. tuplenumber will be too high? */
+	num_to_write = nbits / 8 + 1;
+
+	/*
+	 * TODO: add tests that could exercise a problem with junk being written
+	 * to bitmap
+	 */
+
+	/*
+	 * TODO: is there a better way to write the bytes to the file without
+	 * calling BufFileWrite() like this? palloc()ing an undetermined number of
+	 * bytes feels like it is against the spirit of this patch to begin with,
+	 * but the many function calls seem expensive
+	 */
+	for (int i = 0; i < num_to_write; i++)
+	{
+		unsigned char byteToWrite = 0;
+
+		BufFileWrite(accessor->write_file, &byteToWrite, 1);
+	}
+
+	if (BufFileSeek(accessor->write_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+}
+
+size_t
+sb_estimate(int participants)
+{
+	return offsetof(SharedBits, participants) + participants * sizeof(SharedBitsParticipant);
+}
+
+
+void
+sb_setbit(SharedBitsAccessor *accessor, uint64 bit)
+{
+	SharedBitsParticipant *const participant =
+	&accessor->bits->participants[accessor->participant];
+
+	/* TODO: use an unsigned int instead of a byte */
+	unsigned char current_outer_byte;
+
+	Assert(accessor->write_file);
+
+	if (!participant->writing)
+	{
+		participant->writing = true;
+	}
+
+	BufFileSeek(accessor->write_file, 0, bit / 8, SEEK_SET);
+	BufFileRead(accessor->write_file, &current_outer_byte, 1);
+
+	current_outer_byte |= 1U << (bit % 8);
+
+	BufFileSeek(accessor->write_file, 0, -1, SEEK_CUR);
+	BufFileWrite(accessor->write_file, &current_outer_byte, 1);
+}
+
+bool
+sb_checkbit(SharedBitsAccessor *accessor, uint32 n)
+{
+	bool		match;
+	uint32		bytenum = n / 8;
+	unsigned char bit = n % 8;
+	unsigned char byte_to_check = 0;
+
+	Assert(accessor->combined);
+
+	/* seek to byte to check */
+	if (BufFileSeek(accessor->combined,
+					0,
+					bytenum,
+					SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg(
+						"could not rewind shared outer temporary file: %m")));
+	/* read byte containing ntuple bit */
+	if (BufFileRead(accessor->combined, &byte_to_check, 1) == 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg(
+						"could not read byte in outer match status bitmap: %m.")));
+	/* if bit is set */
+	match = ((byte_to_check) >> bit) & 1;
+
+	return match;
+}
+
+BufFile *
+sb_combine(SharedBitsAccessor *accessor)
+{
+	/*
+	 * TODO: this tries to close an outer match status file for each
+	 * participant in the tuplestore. technically, only participants in the
+	 * barrier could have outer match status files, however, all but one
+	 * participant continue on and detach from the barrier so we won't have a
+	 * reliable way to close only files for those attached to the barrier
+	 */
+	BufFile   **statuses;
+	BufFile    *combined_bitmap_file;
+	int			statuses_length;
+
+	int			nbparticipants = 0;
+
+	for (int l = 0; l < accessor->bits->nparticipants; l++)
+	{
+		SharedBitsParticipant participant = accessor->bits->participants[l];
+
+		if (participant.present)
+		{
+			Assert(!participant.writing);
+			nbparticipants++;
+		}
+	}
+	statuses = palloc(sizeof(BufFile *) * nbparticipants);
+
+	/*
+	 * Open the bitmap shared BufFile from each participant. TODO: explain why
+	 * file can be NULLs
+	 */
+	statuses_length = 0;
+
+	for (int i = 0; i < accessor->bits->nparticipants; i++)
+	{
+		char		bitmap_filename[MAXPGPATH];
+		BufFile    *file;
+
+		/* TODO: make a function that will do this */
+		snprintf(bitmap_filename, MAXPGPATH, "%s.p%d.bitmap", accessor->bits->name, i);
+
+		if (!accessor->bits->participants[i].present)
+			continue;
+		file = BufFileOpenShared(accessor->fileset, bitmap_filename);
+
+		Assert(file);
+
+		statuses[statuses_length++] = file;
+	}
+
+	combined_bitmap_file = BufFileCreateTemp(false);
+
+	for (int64 cur = 0; cur < BufFileSize(statuses[0]); cur++)	/* make it while not EOF */
+	{
+		/*
+		 * TODO: make this use an unsigned int instead of a byte so it isn't
+		 * so slow
+		 */
+		unsigned char combined_byte = 0;
+
+		for (int i = 0; i < statuses_length; i++)
+		{
+			unsigned char read_byte;
+
+			BufFileRead(statuses[i], &read_byte, 1);
+			combined_byte |= read_byte;
+		}
+
+		BufFileWrite(combined_bitmap_file, &combined_byte, 1);
+	}
+
+	if (BufFileSeek(combined_bitmap_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	for (int i = 0; i < statuses_length; i++)
+		BufFileClose(statuses[i]);
+	pfree(statuses);
+
+	accessor->combined = combined_bitmap_file;
+	return combined_bitmap_file;
+}
+
+void
+sb_end_write(SharedBitsAccessor *sba)
+{
+	SharedBitsParticipant
+			   *const participant = &sba->bits->participants[sba->participant];
+
+	participant->writing = false;
+
+	/*
+	 * TODO: this should not be needed if flow is correct. need to fix that
+	 * and get rid of this check
+	 */
+	if (sba->write_file)
+		BufFileClose(sba->write_file);
+	sba->write_file = NULL;
+}
+
+void
+sb_end_read(SharedBitsAccessor *accessor)
+{
+	if (accessor->combined == NULL)
+		return;
+
+	BufFileClose(accessor->combined);
+	accessor->combined = NULL;
+}
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index c3ab494a45..0e3b3de2b6 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -52,6 +52,7 @@ typedef struct SharedTuplestoreParticipant
 {
 	LWLock		lock;
 	BlockNumber read_page;		/* Page number for next read. */
+	bool		rewound;
 	BlockNumber npages;			/* Number of pages written. */
 	bool		writing;		/* Used only for assertions. */
 } SharedTuplestoreParticipant;
@@ -60,6 +61,7 @@ typedef struct SharedTuplestoreParticipant
 struct SharedTuplestore
 {
 	int			nparticipants;	/* Number of participants that can write. */
+	pg_atomic_uint32 ntuples;	/* Number of tuples in this tuplestore. */
 	int			flags;			/* Flag bits from SHARED_TUPLESTORE_XXX */
 	size_t		meta_data_size; /* Size of per-tuple header. */
 	char		name[NAMEDATALEN];	/* A name for this tuplestore. */
@@ -85,6 +87,8 @@ struct SharedTuplestoreAccessor
 	char	   *read_buffer;	/* A buffer for loading tuples. */
 	size_t		read_buffer_size;
 	BlockNumber read_next_page; /* Lowest block we'll consider reading. */
+	BlockNumber start_page;		/* page to reset p->read_page to if back out
+								 * required */
 
 	/* State for writing. */
 	SharedTuplestoreChunk *write_chunk; /* Buffer for writing. */
@@ -137,6 +141,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 	Assert(my_participant_number < participants);
 
 	sts->nparticipants = participants;
+	pg_atomic_init_u32(&sts->ntuples, 1);
 	sts->meta_data_size = meta_data_size;
 	sts->flags = flags;
 
@@ -158,6 +163,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 		LWLockInitialize(&sts->participants[i].lock,
 						 LWTRANCHE_SHARED_TUPLESTORE);
 		sts->participants[i].read_page = 0;
+		sts->participants[i].rewound = false;
 		sts->participants[i].writing = false;
 	}
 
@@ -277,6 +283,45 @@ sts_begin_parallel_scan(SharedTuplestoreAccessor *accessor)
 	accessor->read_participant = accessor->participant;
 	accessor->read_file = NULL;
 	accessor->read_next_page = 0;
+	accessor->start_page = 0;
+}
+
+void
+sts_resume_parallel_scan(SharedTuplestoreAccessor *accessor)
+{
+	int			i PG_USED_FOR_ASSERTS_ONLY;
+	SharedTuplestoreParticipant *p;
+
+	/* End any existing scan that was in progress. */
+	sts_end_parallel_scan(accessor);
+
+	/*
+	 * Any backend that might have written into this shared tuplestore must
+	 * have called sts_end_write(), so that all buffers are flushed and the
+	 * files have stopped growing.
+	 */
+	for (i = 0; i < accessor->sts->nparticipants; ++i)
+		Assert(!accessor->sts->participants[i].writing);
+
+	/*
+	 * We will start out reading the file that THIS backend wrote.  There may
+	 * be some caching locality advantage to that.
+	 */
+
+	/*
+	 * TODO: does this still apply in the multi-stripe case? It seems like if
+	 * a participant file is exhausted for the current stripe it might be
+	 * better to remember that
+	 */
+	accessor->read_participant = accessor->participant;
+	accessor->read_file = NULL;
+	p = &accessor->sts->participants[accessor->read_participant];
+
+	/* TODO: find a better solution than this for resuming the parallel scan */
+	LWLockAcquire(&p->lock, LW_SHARED);
+	accessor->start_page = p->read_page;
+	LWLockRelease(&p->lock);
+	accessor->read_next_page = 0;
 }
 
 /*
@@ -295,6 +340,7 @@ sts_end_parallel_scan(SharedTuplestoreAccessor *accessor)
 		BufFileClose(accessor->read_file);
 		accessor->read_file = NULL;
 	}
+	accessor->start_page = 0;
 }
 
 /*
@@ -531,7 +577,13 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	for (;;)
 	{
 		/* Can we read more tuples from the current chunk? */
-		if (accessor->read_ntuples < accessor->read_ntuples_available)
+		/*
+		 * Added a check for accessor->read_file being present here, as it
+		 * became relevant for adaptive hashjoin. Not sure if this has other
+		 * consequences for correctness
+		 */
+
+		if (accessor->read_ntuples < accessor->read_ntuples_available && accessor->read_file)
 			return sts_read_tuple(accessor, meta_data);
 
 		/* Find the location of a new chunk to read. */
@@ -541,7 +593,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 		/* We can skip directly past overflow pages we know about. */
 		if (p->read_page < accessor->read_next_page)
 			p->read_page = accessor->read_next_page;
-		eof = p->read_page >= p->npages;
+		eof = p->read_page >= p->npages || p->rewound;
 		if (!eof)
 		{
 			/* Claim the next chunk. */
@@ -549,9 +601,22 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 			/* Advance the read head for the next reader. */
 			p->read_page += STS_CHUNK_PAGES;
 			accessor->read_next_page = p->read_page;
+
+			/*
+			 * initialize start_page to the read_page this participant will
+			 * start reading from
+			 */
+			accessor->start_page = read_page;
 		}
 		LWLockRelease(&p->lock);
 
+		if (!eof)
+		{
+			char		name[MAXPGPATH];
+
+			sts_filename(name, accessor, accessor->read_participant);
+		}
+
 		if (!eof)
 		{
 			SharedTuplestoreChunk chunk_header;
@@ -613,6 +678,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 			if (accessor->read_participant == accessor->participant)
 				break;
 			accessor->read_next_page = 0;
+			accessor->start_page = 0;
 
 			/* Go around again, so we can get a chunk from this file. */
 		}
@@ -621,6 +687,48 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	return NULL;
 }
 
+void
+sts_parallel_scan_rewind(SharedTuplestoreAccessor *accessor)
+{
+	SharedTuplestoreParticipant *p =
+	&accessor->sts->participants[accessor->read_participant];
+
+	/*
+	 * Only set the read_page back to the start of the sts_chunk this worker
+	 * was reading if some other worker has not already done so. It could be
+	 * the case that this worker saw a tuple from a future stripe and another
+	 * worker did also in its sts_chunk and it already set read_page to its
+	 * start_page If so, we want to set read_page to the lowest value to
+	 * ensure that we read all tuples from the stripe (don't miss tuples)
+	 */
+	LWLockAcquire(&p->lock, LW_EXCLUSIVE);
+	p->read_page = Min(p->read_page, accessor->start_page);
+	p->rewound = true;
+	LWLockRelease(&p->lock);
+
+	accessor->read_ntuples_available = 0;
+	accessor->read_next_page = 0;
+}
+
+void
+sts_reset_rewound(SharedTuplestoreAccessor *accessor)
+{
+	for (int i = 0; i < accessor->sts->nparticipants; ++i)
+		accessor->sts->participants[i].rewound = false;
+}
+
+uint32
+sts_increment_ntuples(SharedTuplestoreAccessor *accessor)
+{
+	return pg_atomic_fetch_add_u32(&accessor->sts->ntuples, 1);
+}
+
+uint32
+sts_get_tuplenum(SharedTuplestoreAccessor *accessor)
+{
+	return pg_atomic_read_u32(&accessor->sts->ntuples);
+}
+
 /*
  * Create the name used for the BufFile that a given participant will write.
  */
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index ba661d32a6..0ba9d856c8 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -46,6 +46,7 @@ typedef struct ExplainState
 	bool		timing;			/* print detailed node timing */
 	bool		summary;		/* print total planning and execution timing */
 	bool		settings;		/* print modified settings */
+	bool		usage;			/* print memory usage */
 	ExplainFormat format;		/* output format */
 	/* state for output formatting --- not reset for each new plan tree */
 	int			indent;			/* current indentation level */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 79b634e8ed..9ffcd84806 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -19,6 +19,7 @@
 #include "storage/barrier.h"
 #include "storage/buffile.h"
 #include "storage/lwlock.h"
+#include "utils/sharedbits.h"
 
 /* ----------------------------------------------------------------
  *				hash-join hash table structures
@@ -152,6 +153,7 @@ typedef struct ParallelHashJoinBatch
 {
 	dsa_pointer buckets;		/* array of hash table buckets */
 	Barrier		batch_barrier;	/* synchronization for joining this batch */
+	Barrier		stripe_barrier; /* synchronization for stripes */
 
 	dsa_pointer chunks;			/* chunks of tuples loaded */
 	size_t		size;			/* size of buckets + chunks in memory */
@@ -160,6 +162,17 @@ typedef struct ParallelHashJoinBatch
 	size_t		old_ntuples;	/* number of tuples before repartitioning */
 	bool		space_exhausted;
 
+	/* Adaptive HashJoin */
+
+	/*
+	 * after finishing build phase, hashloop_fallback cannot change, and does
+	 * not require a lock to read
+	 */
+	bool		hashloop_fallback;
+	int			maximum_stripe_number;
+	size_t		estimated_stripe_size;	/* size of last stripe in batch */
+	LWLock		lock;
+
 	/*
 	 * Variable-sized SharedTuplestore objects follow this struct in memory.
 	 * See the accessor macros below.
@@ -177,10 +190,17 @@ typedef struct ParallelHashJoinBatch
 	 ((char *) ParallelHashJoinBatchInner(batch) +						\
 	  MAXALIGN(sts_estimate(nparticipants))))
 
+/* Accessor for sharedbits following a ParallelHashJoinBatch. */
+#define ParallelHashJoinBatchOuterBits(batch, nparticipants) \
+	((SharedBits *)												\
+	 ((char *) ParallelHashJoinBatchOuter(batch, nparticipants) +						\
+	  MAXALIGN(sts_estimate(nparticipants))))
+
 /* Total size of a ParallelHashJoinBatch and tuplestores. */
 #define EstimateParallelHashJoinBatch(hashtable)						\
 	(MAXALIGN(sizeof(ParallelHashJoinBatch)) +							\
-	 MAXALIGN(sts_estimate((hashtable)->parallel_state->nparticipants)) * 2)
+	 MAXALIGN(sts_estimate((hashtable)->parallel_state->nparticipants)) * 2 + \
+	 MAXALIGN(sb_estimate((hashtable)->parallel_state->nparticipants)))
 
 /* Accessor for the nth ParallelHashJoinBatch given the base. */
 #define NthParallelHashJoinBatch(base, n)								\
@@ -207,6 +227,7 @@ typedef struct ParallelHashJoinBatchAccessor
 	bool		done;			/* flag to remember that a batch is done */
 	SharedTuplestoreAccessor *inner_tuples;
 	SharedTuplestoreAccessor *outer_tuples;
+	SharedBitsAccessor *sba;
 } ParallelHashJoinBatchAccessor;
 
 /*
@@ -251,6 +272,7 @@ typedef struct ParallelHashJoinState
 	pg_atomic_uint32 distributor;	/* counter for load balancing */
 
 	SharedFileSet fileset;		/* space for shared temporary files */
+	SharedFileSet sbfileset;
 } ParallelHashJoinState;
 
 /* The phases for building batches, used by build_barrier. */
@@ -263,9 +285,17 @@ typedef struct ParallelHashJoinState
 /* The phases for probing each batch, used by for batch_barrier. */
 #define PHJ_BATCH_ELECTING				0
 #define PHJ_BATCH_ALLOCATING			1
-#define PHJ_BATCH_LOADING				2
-#define PHJ_BATCH_PROBING				3
-#define PHJ_BATCH_DONE					4
+#define PHJ_BATCH_STRIPING				2
+#define PHJ_BATCH_DONE					3
+
+/* The phases for probing each stripe of each batch used with stripe barriers */
+#define PHJ_STRIPE_ELECTING				0
+#define PHJ_STRIPE_RESETTING			1
+#define PHJ_STRIPE_LOADING				2
+#define PHJ_STRIPE_PROBING				3
+#define PHJ_STRIPE_DONE				    4
+#define PHJ_STRIPE_NUMBER(n)            ((n) / 5)
+#define PHJ_STRIPE_PHASE(n)             ((n) % 5)
 
 /* The phases of batch growth while hashing, for grow_batches_barrier. */
 #define PHJ_GROW_BATCHES_ELECTING		0
@@ -313,8 +343,6 @@ typedef struct HashJoinTableData
 	int			nbatch_original;	/* nbatch when we started inner scan */
 	int			nbatch_outstart;	/* nbatch when we started outer scan */
 
-	bool		growEnabled;	/* flag to shut off nbatch increases */
-
 	double		totalTuples;	/* # tuples obtained from inner plan */
 	double		partialTuples;	/* # tuples obtained from inner plan by me */
 	double		skewTuples;		/* # tuples inserted into skew tuples */
@@ -329,6 +357,13 @@ typedef struct HashJoinTableData
 	BufFile   **innerBatchFile; /* buffered virtual temp file per batch */
 	BufFile   **outerBatchFile; /* buffered virtual temp file per batch */
 
+	/*
+	 * Adaptive hashjoin variables
+	 */
+	BufFile   **hashloop_fallback;	/* outer match status files if fall back */
+	List	   *fallback_batches_stats; /* per hashjoin batch statistics */
+	int			curstripe;		/* current stripe #; 0 on 1st pass, -2 on phantom stripe */
+
 	/*
 	 * Info about the datatype-specific hash functions for the datatypes being
 	 * hashed. These are arrays of the same length as the number of hash join
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 50d672b270..bcac88f7f3 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -14,6 +14,7 @@
 #define INSTRUMENT_H
 
 #include "portability/instr_time.h"
+#include "nodes/pg_list.h"
 
 
 typedef struct BufferUsage
@@ -39,6 +40,12 @@ typedef struct WalUsage
 	uint64		wal_bytes;		/* size of WAL records produced */
 } WalUsage;
 
+typedef struct FallbackBatchStats
+{
+	int			batchno;
+	int			numstripes;
+} FallbackBatchStats;
+
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
 typedef enum InstrumentOption
 {
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 64d2ce693c..f85308738b 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -31,6 +31,7 @@ extern void ExecParallelHashTableAlloc(HashJoinTable hashtable,
 extern void ExecHashTableDestroy(HashJoinTable hashtable);
 extern void ExecHashTableDetach(HashJoinTable hashtable);
 extern void ExecHashTableDetachBatch(HashJoinTable hashtable);
+extern bool ExecHashTableDetachStripe(HashJoinTable hashtable);
 extern void ExecParallelHashTableSetCurrentBatch(HashJoinTable hashtable,
 												 int batchno);
 
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index f7df70b5ab..0c0d87d1d3 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -129,6 +129,7 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+	uint32		tts_tuplenum;	/* a tuple id for use when ctid cannot be used */
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -425,6 +426,7 @@ static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
 	slot->tts_ops->clear(slot);
+	slot->tts_tuplenum = 0;		/* TODO: should this be done elsewhere? */
 
 	return slot;
 }
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4fee043bb2..41a4133c3a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1957,6 +1957,10 @@ typedef struct HashJoinState
 	int			hj_JoinState;
 	bool		hj_MatchedOuter;
 	bool		hj_OuterNotEmpty;
+	/* Adaptive Hashjoin variables */
+	int			hj_CurNumOuterTuples;	/* number of outer tuples in a batch */
+	unsigned int hj_CurOuterMatchStatus;
+	int			hj_EmitOuterTupleId;
 } HashJoinState;
 
 
@@ -2359,6 +2363,7 @@ typedef struct HashInstrumentation
 	int			nbatch;			/* number of batches at end of execution */
 	int			nbatch_original;	/* planned number of batches */
 	Size		space_peak;		/* peak memory usage in bytes */
+	List	   *fallback_batches_stats; /* per hashjoin batch stats */
 } HashInstrumentation;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9988..9ebdeeeb8a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -857,7 +857,10 @@ typedef enum
 	WAIT_EVENT_EXECUTE_GATHER,
 	WAIT_EVENT_HASH_BATCH_ALLOCATING,
 	WAIT_EVENT_HASH_BATCH_ELECTING,
-	WAIT_EVENT_HASH_BATCH_LOADING,
+	WAIT_EVENT_HASH_STRIPE_ELECTING,
+	WAIT_EVENT_HASH_STRIPE_RESETTING,
+	WAIT_EVENT_HASH_STRIPE_LOADING,
+	WAIT_EVENT_HASH_STRIPE_PROBING,
 	WAIT_EVENT_HASH_BUILD_ALLOCATING,
 	WAIT_EVENT_HASH_BUILD_ELECTING,
 	WAIT_EVENT_HASH_BUILD_HASHING_INNER,
diff --git a/src/include/utils/sharedbits.h b/src/include/utils/sharedbits.h
new file mode 100644
index 0000000000..de43279de8
--- /dev/null
+++ b/src/include/utils/sharedbits.h
@@ -0,0 +1,39 @@
+/*-------------------------------------------------------------------------
+ *
+ * sharedbits.h
+ *	  Simple mechanism for sharing bits between backends.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/sharedbits.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SHAREDBITS_H
+#define SHAREDBITS_H
+
+#include "storage/sharedfileset.h"
+
+struct SharedBits;
+typedef struct SharedBits SharedBits;
+
+struct SharedBitsParticipant;
+typedef struct SharedBitsParticipant SharedBitsParticipant;
+
+struct SharedBitsAccessor;
+typedef struct SharedBitsAccessor SharedBitsAccessor;
+
+extern SharedBitsAccessor *sb_attach(SharedBits *sbits, int my_participant_number, SharedFileSet *fileset);
+extern SharedBitsAccessor *sb_initialize(SharedBits *sbits, int participants, int my_participant_number, SharedFileSet *fileset, char *name);
+extern void sb_initialize_accessor(SharedBitsAccessor *accessor, uint32 nbits);
+extern size_t sb_estimate(int participants);
+
+extern void sb_setbit(SharedBitsAccessor *accessor, uint64 bit);
+extern bool sb_checkbit(SharedBitsAccessor *accessor, uint32 n);
+extern BufFile *sb_combine(SharedBitsAccessor *accessor);
+
+extern void sb_end_write(SharedBitsAccessor *sba);
+extern void sb_end_read(SharedBitsAccessor *accessor);
+
+#endif							/* SHAREDBITS_H */
diff --git a/src/include/utils/sharedtuplestore.h b/src/include/utils/sharedtuplestore.h
index 9754504cc5..99aead8a4a 100644
--- a/src/include/utils/sharedtuplestore.h
+++ b/src/include/utils/sharedtuplestore.h
@@ -22,6 +22,17 @@ typedef struct SharedTuplestore SharedTuplestore;
 
 struct SharedTuplestoreAccessor;
 typedef struct SharedTuplestoreAccessor SharedTuplestoreAccessor;
+struct tupleMetadata;
+typedef struct tupleMetadata tupleMetadata;
+struct tupleMetadata
+{
+	uint32		hashvalue;
+	union
+	{
+		uint32		tupleid;	/* tuple number or id on the outer side */
+		int			stripe;		/* stripe number for inner side */
+	};
+};
 
 /*
  * A flag indicating that the tuplestore will only be scanned once, so backing
@@ -49,6 +60,8 @@ extern void sts_reinitialize(SharedTuplestoreAccessor *accessor);
 
 extern void sts_begin_parallel_scan(SharedTuplestoreAccessor *accessor);
 
+extern void sts_resume_parallel_scan(SharedTuplestoreAccessor *accessor);
+
 extern void sts_end_parallel_scan(SharedTuplestoreAccessor *accessor);
 
 extern void sts_puttuple(SharedTuplestoreAccessor *accessor,
@@ -58,4 +71,10 @@ extern void sts_puttuple(SharedTuplestoreAccessor *accessor,
 extern MinimalTuple sts_parallel_scan_next(SharedTuplestoreAccessor *accessor,
 										   void *meta_data);
 
+extern void sts_parallel_scan_rewind(SharedTuplestoreAccessor *accessor);
+
+extern void sts_reset_rewound(SharedTuplestoreAccessor *accessor);
+extern uint32 sts_increment_ntuples(SharedTuplestoreAccessor *accessor);
+extern uint32 sts_get_tuplenum(SharedTuplestoreAccessor *accessor);
+
 #endif							/* SHAREDTUPLESTORE_H */
diff --git a/src/test/regress/expected/join_hash.out b/src/test/regress/expected/join_hash.out
index 3a91c144a2..98a90a85e4 100644
--- a/src/test/regress/expected/join_hash.out
+++ b/src/test/regress/expected/join_hash.out
@@ -443,7 +443,7 @@ $$
 $$);
  original | final 
 ----------+-------
-        1 |     2
+        1 |     4
 (1 row)
 
 rollback to settings;
@@ -478,7 +478,7 @@ $$
 $$);
  original | final 
 ----------+-------
-        1 |     2
+        1 |     4
 (1 row)
 
 rollback to settings;
@@ -1013,3 +1013,944 @@ WHERE
 (1 row)
 
 ROLLBACK;
+-- Serial Adaptive Hash Join
+CREATE TYPE stub AS (hash INTEGER, value CHAR(8098));
+CREATE FUNCTION stub_hash(item stub)
+RETURNS INTEGER AS $$
+DECLARE
+  batch_size INTEGER;
+BEGIN
+  batch_size := 4;
+  RETURN item.hash << (batch_size - 1);
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+CREATE FUNCTION stub_eq(item1 stub, item2 stub)
+RETURNS BOOLEAN AS $$
+BEGIN
+  RETURN item1.hash = item2.hash AND item1.value = item2.value;
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+CREATE OPERATOR = (
+  FUNCTION = stub_eq,
+  LEFTARG = stub,
+  RIGHTARG = stub,
+  COMMUTATOR = =,
+  HASHES, MERGES
+);
+CREATE OPERATOR CLASS stub_hash_ops
+DEFAULT FOR TYPE stub USING hash AS
+  OPERATOR 1 =(stub, stub),
+  FUNCTION 1 stub_hash(stub);
+CREATE TABLE probeside(a stub);
+ALTER TABLE probeside ALTER COLUMN a SET STORAGE PLAIN;
+-- non-fallback batch with unmatched outer tuple
+INSERT INTO probeside SELECT '(2, "")' FROM generate_series(1, 1);
+-- fallback batch unmatched outer tuple (in first stripe maybe)
+INSERT INTO probeside SELECT '(1, "unmatched outer tuple")' FROM generate_series(1, 1);
+-- fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(1, "")' FROM generate_series(1, 5);
+-- fallback batch unmatched outer tuple (in last stripe maybe)
+-- When numbatches=4, hash 5 maps to batch 1, but after numbatches doubles to
+-- 8 batches hash 5 maps to batch 5.
+INSERT INTO probeside SELECT '(5, "")' FROM generate_series(1, 1);
+-- non-fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(3, "")' FROM generate_series(1, 1);
+-- batch with 3 stripes where non-first/non-last stripe contains unmatched outer tuple
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 5);
+INSERT INTO probeside SELECT '(6, "unmatched outer tuple")' FROM generate_series(1, 1);
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 1);
+CREATE TABLE hashside_wide(a stub, id int);
+ALTER TABLE hashside_wide ALTER COLUMN a SET STORAGE PLAIN;
+-- falls back with an unmatched inner tuple that is in fist, middle, and last
+-- stripe
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in first stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in middle stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in last stripe")', 1 FROM generate_series(1, 1);
+-- doesn't fall back -- matched tuple
+INSERT INTO hashside_wide SELECT '(3, "")', 3 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(6, "")', 6 FROM generate_series(1, 20);
+ANALYZE probeside, hashside_wide;
+SET enable_nestloop TO off;
+SET enable_mergejoin TO off;
+SET work_mem = 64;
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash | btrim 
+------+-----------------------+----+------+-------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+(215 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Left Join (actual rows=215 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash | btrim | id | hash |                 btrim                  
+------+-------+----+------+----------------------------------------
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    3 |       |  3 |    3 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+      |       |  1 |    1 | unmatched inner tuple in first stripe
+      |       |  1 |    1 | unmatched inner tuple in last stripe
+      |       |  1 |    1 | unmatched inner tuple in middle stripe
+(214 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Right Join (actual rows=214 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+FULL OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash |                 btrim                  
+------+-----------------------+----+------+----------------------------------------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+      |                       |  1 |    1 | unmatched inner tuple in first stripe
+      |                       |  1 |    1 | unmatched inner tuple in last stripe
+      |                       |  1 |    1 | unmatched inner tuple in middle stripe
+(218 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+FULL OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Full Join (actual rows=218 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+/*
+-- semi-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+*/
+-- anti-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Anti Join (actual rows=4 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+ hash |         btrim         
+------+-----------------------
+    1 | unmatched outer tuple
+    2 | 
+    5 | 
+    6 | unmatched outer tuple
+(4 rows)
+
+-- Test spill of batch 0 gives correct results.
+CREATE TABLE probeside_batch0(a stub);
+ALTER TABLE probeside_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO probeside_batch0 SELECT '(0, "")' FROM generate_series(1, 13);
+INSERT INTO probeside_batch0 SELECT '(0, "unmatched outer")' FROM generate_series(1, 1);
+CREATE TABLE hashside_wide_batch0(a stub, id int);
+ALTER TABLE hashside_wide_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO hashside_wide_batch0 SELECT '(0, "")', 1 FROM generate_series(1, 9);
+ANALYZE probeside_batch0, hashside_wide_batch0;
+SELECT (probeside_batch0.a).hash, ((((probeside_batch0.a).hash << 7) >> 3) & 31) AS batchno, TRIM((probeside_batch0.a).value), hashside_wide_batch0.id, hashside_wide_batch0.ctid, (hashside_wide_batch0.a).hash, TRIM((hashside_wide_batch0.a).value)
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash | batchno |      btrim      | id | ctid  | hash | btrim 
+------+---------+-----------------+----+-------+------+-------
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 | unmatched outer |    |       |      | 
+(118 rows)
+
diff --git a/src/test/regress/sql/join_hash.sql b/src/test/regress/sql/join_hash.sql
index 68c1a8c7b6..1f70300d02 100644
--- a/src/test/regress/sql/join_hash.sql
+++ b/src/test/regress/sql/join_hash.sql
@@ -538,3 +538,130 @@ WHERE
     AND hjtest_1.a <> hjtest_2.b;
 
 ROLLBACK;
+
+-- Serial Adaptive Hash Join
+
+CREATE TYPE stub AS (hash INTEGER, value CHAR(8098));
+
+CREATE FUNCTION stub_hash(item stub)
+RETURNS INTEGER AS $$
+DECLARE
+  batch_size INTEGER;
+BEGIN
+  batch_size := 4;
+  RETURN item.hash << (batch_size - 1);
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+
+CREATE FUNCTION stub_eq(item1 stub, item2 stub)
+RETURNS BOOLEAN AS $$
+BEGIN
+  RETURN item1.hash = item2.hash AND item1.value = item2.value;
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+
+CREATE OPERATOR = (
+  FUNCTION = stub_eq,
+  LEFTARG = stub,
+  RIGHTARG = stub,
+  COMMUTATOR = =,
+  HASHES, MERGES
+);
+
+CREATE OPERATOR CLASS stub_hash_ops
+DEFAULT FOR TYPE stub USING hash AS
+  OPERATOR 1 =(stub, stub),
+  FUNCTION 1 stub_hash(stub);
+
+CREATE TABLE probeside(a stub);
+ALTER TABLE probeside ALTER COLUMN a SET STORAGE PLAIN;
+-- non-fallback batch with unmatched outer tuple
+INSERT INTO probeside SELECT '(2, "")' FROM generate_series(1, 1);
+-- fallback batch unmatched outer tuple (in first stripe maybe)
+INSERT INTO probeside SELECT '(1, "unmatched outer tuple")' FROM generate_series(1, 1);
+-- fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(1, "")' FROM generate_series(1, 5);
+-- fallback batch unmatched outer tuple (in last stripe maybe)
+-- When numbatches=4, hash 5 maps to batch 1, but after numbatches doubles to
+-- 8 batches hash 5 maps to batch 5.
+INSERT INTO probeside SELECT '(5, "")' FROM generate_series(1, 1);
+-- non-fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(3, "")' FROM generate_series(1, 1);
+-- batch with 3 stripes where non-first/non-last stripe contains unmatched outer tuple
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 5);
+INSERT INTO probeside SELECT '(6, "unmatched outer tuple")' FROM generate_series(1, 1);
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 1);
+
+CREATE TABLE hashside_wide(a stub, id int);
+ALTER TABLE hashside_wide ALTER COLUMN a SET STORAGE PLAIN;
+-- falls back with an unmatched inner tuple that is in fist, middle, and last
+-- stripe
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in first stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in middle stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in last stripe")', 1 FROM generate_series(1, 1);
+
+-- doesn't fall back -- matched tuple
+INSERT INTO hashside_wide SELECT '(3, "")', 3 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(6, "")', 6 FROM generate_series(1, 20);
+
+ANALYZE probeside, hashside_wide;
+
+SET enable_nestloop TO off;
+SET enable_mergejoin TO off;
+SET work_mem = 64;
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+FULL OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+FULL OUTER JOIN hashside_wide USING (a);
+
+/*
+-- semi-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+*/
+
+-- anti-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+
+-- Test spill of batch 0 gives correct results.
+CREATE TABLE probeside_batch0(a stub);
+ALTER TABLE probeside_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO probeside_batch0 SELECT '(0, "")' FROM generate_series(1, 13);
+INSERT INTO probeside_batch0 SELECT '(0, "unmatched outer")' FROM generate_series(1, 1);
+
+CREATE TABLE hashside_wide_batch0(a stub, id int);
+ALTER TABLE hashside_wide_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO hashside_wide_batch0 SELECT '(0, "")', 1 FROM generate_series(1, 9);
+ANALYZE probeside_batch0, hashside_wide_batch0;
+
+SELECT (probeside_batch0.a).hash, ((((probeside_batch0.a).hash << 7) >> 3) & 31) AS batchno, TRIM((probeside_batch0.a).value), hashside_wide_batch0.id, hashside_wide_batch0.ctid, (hashside_wide_batch0.a).hash, TRIM((hashside_wide_batch0.a).value)
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5;
-- 
2.20.1

#48

David Kimura

david.g.kimura@gmail.com

over 5 years ago

In reply to: Melanie Plageman (#47)

1 attachment(s)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Wed, Apr 29, 2020 at 4:39 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

In addition to many assorted TODOs in the code, there are a few major
projects left:
- Batch 0 falling back
- Stripe barrier deadlock
- Performance improvements and testing

Batch 0 never spills. That behavior is an artifact of the existing design that
as an optimization special cases batch 0 to fill the initial hash table. This
means it can skip loading and doesn't need to create a batch file.

However in the pathalogical case where all tuples hash to batch 0 there is no
way to redistribute those tuples to other batches. So, existing hash join
implementation allows work_mem to be exceeded for batch 0.

In adaptive hash join approach, there is another way to deal with a batch that
exceeds work_mem. If increasing the number of batches does not work then the
batch can be split into stripes that will not exceed work_mem. Doing this
requires spilling the excess tuples to batch files. Following patch adds logic
to create a batch 0 file for serial hash join so that even in pathalogical case
we do not need to exceed work_mem.

Thanks,
David

Attachments:

v6-0002-Implement-fallback-of-batch-0-for-serial-adaptive.patchapplication/octet-stream; name=v6-0002-Implement-fallback-of-batch-0-for-serial-adaptive.patchDownload

From eb8a463f9c952cb17a88d1666ab4dc2ccefa1b44 Mon Sep 17 00:00:00 2001
From: David Kimura <dkimura@pivotal.io>
Date: Wed, 29 Apr 2020 16:54:36 +0000
Subject: [PATCH v6 2/2] Implement fallback of batch 0 for serial adaptive hash
 join

There is some fuzzyness around concerns of different functions, specifically
ExecHashTableInsert() and ExecHashIncreaseNumBatches().  Existing model allows
insert to succeed and then later adjusts the number of batches or fallback. But
this doesn't address exceeding work_mem until after the fact. Instead this
change makes a decision of whether to insert into hashtable of batch file when
relocating tuples in between batches inside ExecHashIncreaseNumBatches().
---
 src/backend/executor/nodeHash.c     | 17 ++++++-----------
 src/backend/executor/nodeHashjoin.c | 24 ++++++++++++++++++++++++
 2 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6ecbc76ab5..ca8d8f475a 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -925,6 +925,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	int			childbatch_outgoing_tuples;
 	int			target_batch;
 	FallbackBatchStats *fallback_batch_stats;
+	size_t		currentBatchSize = 0;
 
 	if (hashtable->hashloop_fallback && hashtable->hashloop_fallback[curbatch])
 		return;
@@ -1029,7 +1030,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
 									  &bucketno, &batchno);
 
-			if (batchno == curbatch)
+			if (batchno == curbatch && (curbatch != 0 || currentBatchSize + hashTupleSize < hashtable->spaceAllowed))
 			{
 				/* keep tuple in memory - copy it into the new chunk */
 				HashJoinTuple copyTuple;
@@ -1041,11 +1042,12 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 				copyTuple->next.unshared = hashtable->buckets.unshared[bucketno];
 				hashtable->buckets.unshared[bucketno] = copyTuple;
 				curbatch_outgoing_tuples++;
+				currentBatchSize += hashTupleSize;
 			}
 			else
 			{
 				/* dump it out */
-				Assert(batchno > curbatch);
+				Assert(batchno > curbatch || currentBatchSize + hashTupleSize >= hashtable->spaceAllowed);
 				ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(hashTuple),
 									  hashTuple->hashvalue,
 									  &hashtable->innerBatchFile[batchno]);
@@ -1081,13 +1083,6 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 		   hashtable, nfreed, ninmemory, hashtable->spaceUsed);
 #endif
 
-	/*
-	 * For now we do not support fallback in batch 0 as it is a special case
-	 * and assumed to fit in hashtable.
-	 */
-	if (curbatch == 0)
-		return;
-
 	/*
 	 * The same batch should not be marked to fall back more than once
 	 */
@@ -1097,9 +1092,9 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	if ((curbatch_outgoing_tuples / (float) ninmemory) >= 0.8)
 		printf("curbatch %i targeted to fallback.", curbatch);
 #endif
-	if ((childbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION && childbatch > 0)
+	if ((childbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION)
 		target_batch = childbatch;
-	else if ((curbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION && curbatch > 0)
+	else if ((curbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION)
 		target_batch = curbatch;
 	else
 		return;
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 516067f176..735677ba81 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -507,6 +507,23 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					/* Loop around, staying in HJ_NEED_NEW_OUTER state */
 					continue;
 				}
+				if (batchno == 0 && node->hj_HashTable->curstripe == 0 && IsHashloopFallback(hashtable))
+				{
+					bool		shouldFree;
+					MinimalTuple mintuple = ExecFetchSlotMinimalTuple(outerTupleSlot,
+																	  &shouldFree);
+
+					/*
+					 * Need to save this outer tuple to a batch since batch 0
+					 * is fallback and we must later rewind.
+					 */
+					Assert(parallel_state == NULL);
+					ExecHashJoinSaveTuple(mintuple, hashvalue,
+										  &hashtable->outerBatchFile[batchno]);
+
+					if (shouldFree)
+						heap_free_minimal_tuple(mintuple);
+				}
 
 				/*
 				 * While probing the phantom stripe, don't increment
@@ -1255,6 +1272,13 @@ ExecHashJoinLoadStripe(HashJoinState *hjstate)
 					(errcode_for_file_access(),
 					 errmsg("could not rewind hash-join temporary file: %m")));
 	}
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[curbatch] && hashtable->curbatch == 0 && hashtable->curstripe == 0)
+	{
+		if (BufFileSeek(hashtable->innerBatchFile[curbatch], 0, 0L, SEEK_SET))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not rewind hash-join temporary file: %m")));
+	}
 
 	hashtable->curstripe++;
 
-- 
2.17.1

#49

Melanie Plageman

melanieplageman@gmail.com

over 5 years ago

In reply to: Melanie Plageman (#15)

1 attachment(s)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Tue, Apr 28, 2020 at 11:50 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 29/04/2020 05:03, Melanie Plageman wrote:

I've attached a patch which should address some of the previous feedback
about code complexity. Two of my co-workers and I wrote what is
essentially a new prototype of the idea. It uses the main state machine
to route emitting unmatched tuples instead of introducing a separate
state. The logic for falling back is also more developed.

I haven't looked at the patch in detail, but thanks for the commit
message; it describes very well what this is all about. It would be nice
to copy that explanation to the top comment in nodeHashJoin.c in some
form. I think we're missing a high level explanation of how the batching
works even before this new patch, and that commit message does a good
job at it.

Thanks for taking a look, Heikki!

I made a few edits to the message and threw it into a draft patch (on
top of master, of course). I didn't want to junk up peoples' inboxes, so
I didn't start a separate thread, but, it will be pretty hard to
collaboratively edit the comment/ever register it for a commitfest if it
is wedged into this thread. What do you think?

--
Melanie Plageman

Attachments:

v1-0001-Describe-hybrid-hash-join-implementation.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Describe-hybrid-hash-join-implementation.patchDownload

From 1deb1d777693ffcb73c96130ac51b282cd968577 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 30 Apr 2020 07:16:28 -0700
Subject: [PATCH v1] Describe hybrid hash join implementation

This is just a draft to spark conversation on what a good comment might
be like in this file on how the hybrid hash join algorithm is
implemented in Postgres. I'm pretty sure this is the accepted term for
this algorithm https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
---
 src/backend/executor/nodeHashjoin.c | 36 +++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index cc8edacdd0..86bfdaef7f 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -10,6 +10,42 @@
  * IDENTIFICATION
  *	  src/backend/executor/nodeHashjoin.c
  *
+ *   HYBRID HASH JOIN
+ *
+ *  If the inner side tuples of a hash join do not fit in memory, the hash join
+ *  can be executed in multiple batches.
+ *
+ *  If the statistics on the inner side relation are accurate, planner chooses a
+ *  multi-batch strategy and estimates the number of batches.
+ *
+ *  The query executor measures the real size of the hashtable and increases the
+ *  number of batches if the hashtable grows too large.
+ *
+ *  The number of batches is always a power of two, so an increase in the number
+ *  of batches doubles it.
+ *
+ *  Serial hash join measures batch size lazily -- waiting until it is loading a
+ *  batch to determine if it will fit in memory. While inserting tuples into the
+ *  hashtable, serial hash join will, if that tuple were to exceed work_mem,
+ *  dump out the hashtable and reassign them either to other batch files or the
+ *  current batch resident in the hashtable.
+ *
+ *  Parallel hash join, on the other hand, completes all changes to the number
+ *  of batches during the build phase. If it increases the number of batches, it
+ *  dumps out all the tuples from all batches and reassigns them to entirely new
+ *  batch files. Then it checks every batch to ensure it will fit in the space
+ *  budget for the query.
+ *
+ *  In both parallel and serial hash join, the executor currently makes a best
+ *  effort. If a particular batch will not fit in memory, it tries doubling the
+ *  number of batches. If after a batch increase, there is a batch which
+ *  retained all or none of its tuples, the executor disables growth in the
+ *  number of batches globally. After growth is disabled, all batches that would
+ *  have previously triggered an increase in the number of batches instead
+ *  exceed the space allowed.
+ *
+ *  TODO: should we discuss that tuples can only spill forward?
+ *
  * PARALLELISM
  *
  * Hash joins can participate in parallel query execution in several ways.  A
-- 
2.20.1

Import Notes

Reply to msg id not found: 37fe7ca4-79d9-6d11-a305-32b3fccdef7e@iki.fi

#50

Alvaro Herrera

alvherre@2ndquadrant.com

over 5 years ago

In reply to: Melanie Plageman (#49)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On 2020-Apr-30, Melanie Plageman wrote:

On Tue, Apr 28, 2020 at 11:50 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I haven't looked at the patch in detail, but thanks [...]

Thanks for taking a look, Heikki!

Hmm. We don't have Heikki's message in the archives. In fact, the last
message from Heikki we seem to have in any list is
cca4e4dc-32ac-b9ab-039d-98dcb5650791@iki.fi dated February 19 in
pgsql-bugs. I wonder if there's some problem between Heikki and the
lists.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#51

Thomas Munro

thomas.munro@gmail.com

over 5 years ago

In reply to: Melanie Plageman (#49)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Fri, May 1, 2020 at 2:30 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I made a few edits to the message and threw it into a draft patch (on
top of master, of course). I didn't want to junk up peoples' inboxes, so
I didn't start a separate thread, but, it will be pretty hard to
collaboratively edit the comment/ever register it for a commitfest if it
is wedged into this thread. What do you think?

+1, this is a good description and I'm sure you're right about the
name of the algorithm. It's a "hybrid" between a simple no partition
hash join, and partitioning like the Grace machine, since batch 0 is
processed directly without touching the disk.

You mention that PHJ finalises the number of batches during build
phase while SHJ can extend it later. There's also a difference in the
probe phase: although inner batch 0 is loaded into the hash table
directly and not written to disk during the build phase (= classic
hybrid, just like the serial algorithm), outer batch 0 *is* written
out to disk at the start of the probe phase (unlike classic hybrid at
least as we have it for serial hash join). That's because I couldn't
figure out how to begin emitting tuples before partitioning was
finished, without breaking the deadlock-avoidance programming rule
that you can't let the program counter escape from the node when
someone might wait for you. So maybe it's erm, a hybrid between
hybrid and Grace...

#52

David Kimura

david.g.kimura@gmail.com

over 5 years ago

In reply to: David Kimura (#48)

1 attachment(s)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Wed, Apr 29, 2020 at 4:44 PM David Kimura <david.g.kimura@gmail.com> wrote:

Following patch adds logic to create a batch 0 file for serial hash join so
that even in pathalogical case we do not need to exceed work_mem.

Updated the patch to spill batch 0 tuples after it is marked as fallback.

A couple questions from looking more at serial code:

1) Does the current pattern to repartition batches *after* the previous
hashtable insert exceeds work_mem still make sense?

In that case we'd allow ourselves to exceed work_mem by one tuple. If that
doesn't seem correct anymore then I think we can move the space exceeded
check in ExecHashTableInsert() *before* actual hashtable insert.

2) After batch 0 is marked fallback, does the logic to insert into its batch
file fit more in MultiExecPrivateHash() or ExecHashTableInsert()?

The latter already has logic to decide whether to insert into hashtable or
batchfile

Thanks,
David

Attachments:

v6-0002-Implement-fallback-of-batch-0-for-serial-adaptive.patchapplication/x-patch; name=v6-0002-Implement-fallback-of-batch-0-for-serial-adaptive.patchDownload

From f0a3bbed9c80ad304f6cea9ace33534be4f4c3cd Mon Sep 17 00:00:00 2001
From: David Kimura <dkimura@pivotal.io>
Date: Wed, 29 Apr 2020 16:54:36 +0000
Subject: [PATCH v6 2/2] Implement fallback of batch 0 for serial adaptive hash
 join

There is some fuzzyness around concerns of different functions, specifically
ExecHashTableInsert() and ExecHashIncreaseNumBatches().  Existing model allows
insert to succeed and then later adjusts the number of batches or fallback. But
this doesn't address exceeding work_mem until after the fact. Instead this
change makes a decision of whether to insert into hashtable of batch file when
relocating tuples in between batches inside ExecHashIncreaseNumBatches().
---
 src/backend/executor/nodeHash.c     | 43 +++++++++++++++++++++--------
 src/backend/executor/nodeHashjoin.c | 17 ++++++++++++
 2 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6ecbc76ab5..9340db9fb7 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -183,12 +183,36 @@ MultiExecPrivateHash(HashState *node)
 			else
 			{
 				/* Not subject to skew optimization, so insert normally */
-				ExecHashTableInsert(hashtable, slot, hashvalue);
+				int			bucketno;
+				int			batchno;
+				bool		shouldFree;
+				MinimalTuple tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+				ExecHashGetBucketAndBatch(hashtable, hashvalue,
+							  &bucketno, &batchno);
+				if (hashtable->hashloop_fallback && hashtable->hashloop_fallback[0])
+					ExecHashJoinSaveTuple(tuple,
+										  hashvalue,
+										  &hashtable->innerBatchFile[batchno]);
+				else
+					ExecHashTableInsert(hashtable, slot, hashvalue);
+
+				if (shouldFree)
+					heap_free_minimal_tuple(tuple);
+
 			}
 			hashtable->totalTuples += 1;
 		}
 	}
 
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[0])
+	{
+		if (BufFileSeek(hashtable->innerBatchFile[0], 0, 0L, SEEK_SET))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not rewind hash-join temporary file: %m")));
+	}
+
 	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
 	if (hashtable->nbuckets != hashtable->nbuckets_optimal)
 		ExecHashIncreaseNumBuckets(hashtable);
@@ -925,6 +949,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	int			childbatch_outgoing_tuples;
 	int			target_batch;
 	FallbackBatchStats *fallback_batch_stats;
+	size_t		currentBatchSize = 0;
 
 	if (hashtable->hashloop_fallback && hashtable->hashloop_fallback[curbatch])
 		return;
@@ -1029,7 +1054,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
 									  &bucketno, &batchno);
 
-			if (batchno == curbatch)
+			if (batchno == curbatch && (curbatch != 0 || currentBatchSize + hashTupleSize < hashtable->spaceAllowed))
 			{
 				/* keep tuple in memory - copy it into the new chunk */
 				HashJoinTuple copyTuple;
@@ -1041,11 +1066,12 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 				copyTuple->next.unshared = hashtable->buckets.unshared[bucketno];
 				hashtable->buckets.unshared[bucketno] = copyTuple;
 				curbatch_outgoing_tuples++;
+				currentBatchSize += hashTupleSize;
 			}
 			else
 			{
 				/* dump it out */
-				Assert(batchno > curbatch);
+				Assert(batchno > curbatch || currentBatchSize + hashTupleSize >= hashtable->spaceAllowed);
 				ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(hashTuple),
 									  hashTuple->hashvalue,
 									  &hashtable->innerBatchFile[batchno]);
@@ -1081,13 +1107,6 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 		   hashtable, nfreed, ninmemory, hashtable->spaceUsed);
 #endif
 
-	/*
-	 * For now we do not support fallback in batch 0 as it is a special case
-	 * and assumed to fit in hashtable.
-	 */
-	if (curbatch == 0)
-		return;
-
 	/*
 	 * The same batch should not be marked to fall back more than once
 	 */
@@ -1097,9 +1116,9 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	if ((curbatch_outgoing_tuples / (float) ninmemory) >= 0.8)
 		printf("curbatch %i targeted to fallback.", curbatch);
 #endif
-	if ((childbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION && childbatch > 0)
+	if ((childbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION)
 		target_batch = childbatch;
-	else if ((curbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION && curbatch > 0)
+	else if ((curbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION)
 		target_batch = curbatch;
 	else
 		return;
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 516067f176..8f3f4d4b44 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -507,6 +507,23 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					/* Loop around, staying in HJ_NEED_NEW_OUTER state */
 					continue;
 				}
+				if (batchno == 0 && node->hj_HashTable->curstripe == 0 && IsHashloopFallback(hashtable))
+				{
+					bool		shouldFree;
+					MinimalTuple mintuple = ExecFetchSlotMinimalTuple(outerTupleSlot,
+																	  &shouldFree);
+
+					/*
+					 * Need to save this outer tuple to a batch since batch 0
+					 * is fallback and we must later rewind.
+					 */
+					Assert(parallel_state == NULL);
+					ExecHashJoinSaveTuple(mintuple, hashvalue,
+										  &hashtable->outerBatchFile[batchno]);
+
+					if (shouldFree)
+						heap_free_minimal_tuple(mintuple);
+				}
 
 				/*
 				 * While probing the phantom stripe, don't increment
-- 
2.17.0

#53

Melanie Plageman

melanieplageman@gmail.com

over 5 years ago

In reply to: Melanie Plageman (#47)

2 attachment(s)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Tue, Apr 28, 2020 at 7:03 PM Melanie Plageman <melanieplageman@gmail.com>
wrote:

There is a deadlock hazard in parallel hashjoin (pointed out by Thomas
Munro in the past). Workers attached to the stripe_barrier emit tuples
and then wait on that barrier.
I believe that that can be addressed starting with this
relatively unoptimized solution:
- after probing a stripe in a batch, a worker sets the status of that
batch to "tentatively done" and saves the stripe_barrier phase
- if that worker is not the only worker attached to that batch, it
detaches from both stripe and batch barriers and moves on to other
batches
- if that worker is the only worker attached to the batch, it will
proceed to load the next stripe of that batch, and, once it has
finished loading, it will set the status of the batch back to "not
done" for itself
- when the other worker encounters that batch again, if the
stripe_barrier phase has not moved forward, it will mark that batch as
done for itself. if the stripe_barrier phase has moved forward, it can
join in in probing this batch for the current stripe.

Just to follow-up on the stripe barrier deadlock, I've implemented a
solution and attached it.

There are three solutions I've thought about so far:

1) leaders don't participate in fallback batches
2) serial after stripe 0
no worker can join a batch after any worker has left and only one
worker can work on stripes after stripe 0
3) provisionally complete batches
After the end of stripe 0, all workers except the last worker
detach from the stripe barrier, mark the batch as provisionally
done, save the stripe barrier phase, and move on to another batch.
Later, when one of these workers returns to the batch, if it is
not already done, the worker checks to see if the phase of the
stripe barrier has advanced. If the phase has advanced, it means
that no one is waiting for that worker. The worker can join that
batch. If the phase hasn't advanced, the worker won't risk
deadlock and will simply mark the batch as done. The last worker
executes the normal path -- participating in each stripe.

I've attached a patch to implement solution 3
v7-0002-Provisionally-detach-unless-last-worker.patch

This isn't a very optimized version of this solution. It detaches from
the stripe barrier and closes the outer match status bitmap upon
provisional completion by a worker. However, I ran into some problems
keeping outer match status bitmaps open for multiple batches at a time.

I've also attached the original adaptive hashjoin patch with a couple
small tweaks (not quite meriting a patch version bump, but that seemed
like the easiest way).

--
Melanie Plageman

Attachments:

v7-0002-Provisionally-detach-unless-last-worker.patchtext/x-patch; charset=US-ASCII; name=v7-0002-Provisionally-detach-unless-last-worker.patchDownload

From 1ae6e34d38e236cf350d340dd23c168dbba612f8 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 30 Apr 2020 10:08:38 -0700
Subject: [PATCH v7 2/2] Provisionally detach unless last worker

To solve the deadlock hazard of waiting on the stripe_barrier after
emitting tuples, provisionally detach from the stripe_barrier if you are
not the last worker. Save the state that the stripe_barrier was in.
Later, check this batch again and, if the stripe_barrier has not moved
forward since you last worked on it, call it done and detach for good.

Note that this patch could be much more efficient if workers did not
detach from the stripe barrier and close their outer match status
bitmaps after failing to be the last worker. When they rejoin, they will
have to create new bitmaps and re-attach to the stripe barrier.

Originally, this patch had workers keep their bitmaps open, however,
there were some synchronization problems with workers having outer match
status bitmaps for multiple batches open at the same time.
---
 src/backend/executor/nodeHash.c     |  6 ++-
 src/backend/executor/nodeHashjoin.c | 82 +++++++++++++++++++++++++----
 src/backend/storage/ipc/barrier.c   |  2 +-
 src/include/executor/hashjoin.h     | 11 +++-
 4 files changed, 87 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index ebfd8f8410..25bfcbace5 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -3139,6 +3139,9 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 				BarrierArriveAndWait(&shared->stripe_barrier, 0);
 			BarrierDetach(&shared->stripe_barrier);
 		}
+		accessor->last_participating_stripe_phase = -3;
+		/* why isn't done initialized here ? */
+		accessor->done = -1;
 
 		/* Initialize accessor state.  All members were zero-initialized. */
 		accessor->shared = shared;
@@ -3241,7 +3244,8 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 
 		accessor->shared = shared;
 		accessor->preallocated = 0;
-		accessor->done = false;
+		accessor->done = -1;
+		accessor->last_participating_stripe_phase = -3;
 		accessor->inner_tuples =
 			sts_attach(ParallelHashJoinBatchInner(shared),
 					   ParallelWorkerNumber + 1,
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index b87d32ad8e..87a854572d 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -1276,6 +1276,7 @@ ExecHashJoinLoadStripe(HashJoinState *hjstate)
 		 * possible for hashtable->nbatch to be increased here!
 		 */
 		uint32		hashTupleSize;
+
 		/*
 		 * TODO: wouldn't it be cool if this returned the size of the tuple
 		 * inserted
@@ -1360,9 +1361,17 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 
 	if (hashtable->curbatch >= 0)
 	{
+		ParallelHashJoinBatchAccessor *batch_accessor = &hashtable->batches[hashtable->curbatch];
 		if (IsHashloopFallback(hashtable))
+		{
 			sb_end_write(hashtable->batches[hashtable->curbatch].sba);
-		hashtable->batches[hashtable->curbatch].done = true;
+			if (batch_accessor->last_participating_stripe_phase > -3)
+				batch_accessor->done = 0;
+			else
+				batch_accessor->done = 1;
+		}
+		else
+			batch_accessor->done = 1;
 		ExecHashTableDetachBatch(hashtable);
 	}
 
@@ -1376,7 +1385,7 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 		hashtable->nbatch;
 	do
 	{
-		if (!hashtable->batches[batchno].done)
+		if (hashtable->batches[batchno].done != 1)
 		{
 			Barrier    *batch_barrier =
 			&hashtable->batches[batchno].shared->batch_barrier;
@@ -1413,8 +1422,21 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 						sb_initialize_accessor(hashtable->batches[hashtable->curbatch].sba,
 											   sts_get_tuplenum(hashtable->batches[hashtable->curbatch].outer_tuples));
 					hashtable->curstripe = -1;
-					ExecParallelHashJoinLoadStripe(hjstate);
-					return true;
+					if (ExecParallelHashJoinLoadStripe(hjstate))
+						return true;
+					/*
+					 * ExecParallelHashJoinLoadStripe() will return false from
+					 * here when no more work can be done by this worker on
+					 * this batch. Until further optimized, this worker will
+					 * have detached from the stripe_barrier and should close
+					 * its outer match statuses bitmap and then detach from the
+					 * batch. In order to reuse the code below, fall through,
+					 * even though the phase will not have been advanced
+					 */
+					if (hashtable->batches[batchno].shared->hashloop_fallback)
+						sb_end_write(hashtable->batches[batchno].sba);
+
+					/* Fall through. */
 
 				case PHJ_BATCH_DONE:
 
@@ -1423,7 +1445,7 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 					 * remain).
 					 */
 					BarrierDetach(batch_barrier);
-					hashtable->batches[batchno].done = true;
+					hashtable->batches[batchno].done = 1;
 					hashtable->curbatch = -1;
 					break;
 
@@ -1461,11 +1483,49 @@ ExecParallelHashJoinLoadStripe(HashJoinState *hjstate)
 
 	if (hashtable->curstripe >= 0)
 	{
-		BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_PROBING);
+		/*
+		 * After finishing with participating in a stripe, if a worker is the
+		 * only one working on a batch, it will continue working on it.
+		 * However, if a worker is not the only worker working on a batch, it
+		 * would risk deadlock if it waits on the barrier. Instead, it saves
+		 * the current stripe phase and move on. Later, when it comes back to
+		 * this batch, if the stripe phase hasn't advanced from when it last
+		 * participated, it will mark the batch done and never return. If the
+		 * stripe barrier has advanced, then, it will participate again in the
+		 * batch.
+		 */
+		if (!BarrierArriveAndDetach(stripe_barrier))
+		{
+			hashtable->batches[batchno].last_participating_stripe_phase = BarrierPhase(stripe_barrier);
+			sb_end_write(hashtable->batches[hashtable->curbatch].sba);
+			hashtable->curstripe = -1;
+			return false;
+		}
+
+		/*
+		 * This isn't a race condition if no other workers can stay attached to
+		 * this barrier in the intervening time. Basically, if you attach to a
+		 * stripe barrier in the PHJ_STRIPE_DONE phase,
+		 * detach immediately and move on.
+		 */
+		BarrierAttach(stripe_barrier);
 	}
 	else if (hashtable->curstripe == -1)
 	{
-		int			phase = BarrierAttach(stripe_barrier);
+		ParallelHashJoinBatchAccessor *batch_accessor = &hashtable->batches[batchno];
+		int			phase;
+
+		phase = BarrierAttach(stripe_barrier);
+
+		/*
+		 * If the phase hasn't advanced since the last time this worker
+		 * checked, detach and return to pick another batch. Only check this
+		 * if the worker has worked on this batch before. Workers are not permitted
+		 * to join after the batch has progressed past its first stripe.
+		 */
+		if (batch_accessor->done == 0 &&
+			batch_accessor->last_participating_stripe_phase == phase)
+			return ExecHashTableDetachStripe(hashtable);
 
 		/*
 		 * If a worker enters this phase machine on a stripe number greater
@@ -1474,10 +1534,10 @@ ExecParallelHashJoinLoadStripe(HashJoinState *hjstate)
 		 * fallback Either way the worker can't contribute so just detach and
 		 * move on.
 		 */
-		if (PHJ_STRIPE_NUMBER(phase) > batch->maximum_stripe_number)
-			return ExecHashTableDetachStripe(hashtable);
 
-		hashtable->curstripe = PHJ_STRIPE_NUMBER(phase);
+		if (PHJ_STRIPE_NUMBER(phase) > batch->maximum_stripe_number ||
+			PHJ_STRIPE_PHASE(phase) == PHJ_STRIPE_DONE)
+			return ExecHashTableDetachStripe(hashtable);
 	}
 	else if (hashtable->curstripe == -2)
 	{
@@ -1490,6 +1550,8 @@ ExecParallelHashJoinLoadStripe(HashJoinState *hjstate)
 		return ExecHashTableDetachStripe(hashtable);
 	}
 
+	hashtable->curstripe = PHJ_STRIPE_NUMBER(BarrierPhase(stripe_barrier));
+
 	/*
 	 * The outer side is exhausted and either 1) the current stripe of the
 	 * inner side is exhausted and it is time to advance the stripe 2) the
diff --git a/src/backend/storage/ipc/barrier.c b/src/backend/storage/ipc/barrier.c
index 3e200e02cc..2bfd7e6052 100644
--- a/src/backend/storage/ipc/barrier.c
+++ b/src/backend/storage/ipc/barrier.c
@@ -308,4 +308,4 @@ BarrierDetachImpl(Barrier *barrier, bool arrive)
 		ConditionVariableBroadcast(&barrier->condition_variable);
 
 	return last;
-}
+}
\ No newline at end of file
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 9ffcd84806..8d232a1304 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -224,10 +224,12 @@ typedef struct ParallelHashJoinBatchAccessor
 	size_t		old_ntuples;	/* how many tuples before repartitioning? */
 	bool		at_least_one_chunk; /* has this backend allocated a chunk? */
 
-	bool		done;			/* flag to remember that a batch is done */
+	int			done;			/* flag to remember that a batch is done */
+	/* -1 for not done, 0 for tentatively done, 1 for done */
 	SharedTuplestoreAccessor *inner_tuples;
 	SharedTuplestoreAccessor *outer_tuples;
 	SharedBitsAccessor *sba;
+	int			last_participating_stripe_phase;
 } ParallelHashJoinBatchAccessor;
 
 /*
@@ -362,7 +364,12 @@ typedef struct HashJoinTableData
 	 */
 	BufFile   **hashloop_fallback;	/* outer match status files if fall back */
 	List	   *fallback_batches_stats; /* per hashjoin batch statistics */
-	int			curstripe;		/* current stripe #; 0 on 1st pass, -2 on phantom stripe */
+
+	/*
+	 * current stripe #; 0 during 1st pass, -1 when detached -2 on phantom
+	 * stripe
+	 */
+	int			curstripe;
 
 	/*
 	 * Info about the datatype-specific hash functions for the datatypes being
-- 
2.20.1

v7-0001-Implement-Adaptive-Hashjoin.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Implement-Adaptive-Hashjoin.patchDownload

From fdcd2c5eddae7285d4fdeb166b0b88170557a2c0 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 26 Feb 2020 09:18:29 -0800
Subject: [PATCH v7 1/2] Implement Adaptive Hashjoin

If the inner side tuples of a hashjoin will not fit in memory, the
hashjoin can be executed in multiple batches. If the statistics on the
inner side relation are accurate, planner chooses a multi-batch
strategy and sets the number of batches.
The query executor measures the real size of the hashtable and increases
the number of batches if the hashtable grows too large.

The number of batches is always a power of two, so an increase in the
number of batches doubles it.

Serial hashjoin measures batch size lazily -- waiting until it is
loading a batch to determine if it will fit in memory.

Parallel hashjoin, on the other hand, completes all changes to the
number of batches during the build phase. If it doubles the number of
batches, it dumps all the tuples out, reassigns them to batches,
measures each batch, and checks that it will fit in the space allowed.

In both cases, the executor currently makes a best effort. If a
particular batch won't fit in memory, and, upon changing the number of
batches none of the tuples move to a new batch, the executor disables
growth in the number of batches globally. After growth is disabled, all
batches that would have previously triggered an increase in the number
of batches instead exceed the space allowed.

There is no mechanism to perform a hashjoin within memory constraints if
a run of tuples hash to the same batch. Also, hashjoin will continue to
double the number of batches if *some* tuples move each time -- even if
the batch will never fit in memory -- resulting in an explosion in the
number of batches (affecting performance negatively for multiple
reasons).

Adaptive hashjoin is a mechanism to process a run of inner side tuples
with join keys which hash to the same batch in a manner that is
efficient and respects the space allowed.

When an offending batch causes the number of batches to be doubled and
some percentage of the tuples would not move to a new batch, that batch
can be marked to "fall back". This mechanism replaces serial hashjoin's
"grow_enabled" flag and replaces part of the functionality of parallel
hashjoin's "growth = PHJ_GROWTH_DISABLED" flag. However, instead of
disabling growth in the number of batches for all batches, it only
prevents this batch from causing another increase in the number of
batches.

When the inner side of this batch is loaded into memory, stripes of
arbitrary tuples totaling work_mem in size are loaded into the
hashtable. After probing this stripe, the outer side batch is rewound
and the next stripe is loaded. Each stripe of inner is probed until all
tuples have been processed.

Tuples that match are emitted (depending on the join semantics of the
particular join type) during probing of a stripe. In order to make
left outer join work, unmatched tuples cannot be emitted NULL-extended
until all stripes have been probed. To address this, a bitmap is created
with a bit for each tuple of the outer side. If a tuple on the outer
side matches a tuple from the inner, the corresponding bit is set. At
the end of probing all stripes, the executor scans the bitmap and emits
unmatched outer tuples.

TODOs:
- Batch 0 falling back
- Implement stripe_barrier deadlock fix
- Fix semi-join
- Stripe instrumentation for parallel adaptive hashjoin
- Do benchmarking and experiment with different fallback threshholds
  (currently hardcoded to 80% but more parameterizable than before)
- Assorted TODOs in the code

Co-authored-by: Jesse Zhang <sbjesse@gmail.com>
Co-authored-by: David Kimura <dkimura@pivotal.io>
---
 src/backend/commands/explain.c            |  43 +-
 src/backend/executor/nodeHash.c           | 305 +++++--
 src/backend/executor/nodeHashjoin.c       | 656 ++++++++++++---
 src/backend/postmaster/pgstat.c           |  13 +-
 src/backend/utils/sort/Makefile           |   1 +
 src/backend/utils/sort/sharedbits.c       | 285 +++++++
 src/backend/utils/sort/sharedtuplestore.c | 112 ++-
 src/include/commands/explain.h            |   1 +
 src/include/executor/hashjoin.h           |  47 +-
 src/include/executor/instrument.h         |   7 +
 src/include/executor/nodeHash.h           |   1 +
 src/include/executor/tuptable.h           |   2 +
 src/include/nodes/execnodes.h             |   5 +
 src/include/pgstat.h                      |   5 +-
 src/include/utils/sharedbits.h            |  39 +
 src/include/utils/sharedtuplestore.h      |  19 +
 src/test/regress/expected/join_hash.out   | 945 +++++++++++++++++++++-
 src/test/regress/sql/join_hash.sql        | 127 +++
 18 files changed, 2447 insertions(+), 166 deletions(-)
 create mode 100644 src/backend/utils/sort/sharedbits.c
 create mode 100644 src/include/utils/sharedbits.h

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 1275bec673..fb5272e8c5 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -184,6 +184,8 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 			es->wal = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "settings") == 0)
 			es->settings = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "usage") == 0)
+			es->usage = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "timing") == 0)
 		{
 			timing_set = true;
@@ -312,6 +314,7 @@ NewExplainState(void)
 
 	/* Set default options (most fields can be left as zeroes). */
 	es->costs = true;
+	es->usage = true;
 	/* Prepare output buffer. */
 	es->str = makeStringInfo();
 
@@ -3026,22 +3029,50 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		else if (hinstrument.nbatch_original != hinstrument.nbatch ||
 				 hinstrument.nbuckets_original != hinstrument.nbuckets)
 		{
+			ListCell   *lc;
+
 			ExplainIndentText(es);
 			appendStringInfo(es->str,
-							 "Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 "Buckets: %d (originally %d)  Batches: %d (originally %d)",
 							 hinstrument.nbuckets,
 							 hinstrument.nbuckets_original,
 							 hinstrument.nbatch,
-							 hinstrument.nbatch_original,
-							 spacePeakKb);
+							 hinstrument.nbatch_original);
+			if (es->usage)
+				appendStringInfo(es->str, "  Memory Usage: %ldkB\n", spacePeakKb);
+			else
+				appendStringInfo(es->str, "\n");
+
+			foreach(lc, hinstrument.fallback_batches_stats)
+			{
+				FallbackBatchStats *fbs = lfirst(lc);
+
+				ExplainIndentText(es);
+				appendStringInfo(es->str, "Batch: %d  Stripes: %d\n", fbs->batchno, fbs->numstripes);
+			}
 		}
 		else
 		{
+			ListCell   *lc;
+
 			ExplainIndentText(es);
 			appendStringInfo(es->str,
-							 "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
-							 hinstrument.nbuckets, hinstrument.nbatch,
-							 spacePeakKb);
+							 "Buckets: %d  Batches: %d",
+							 hinstrument.nbuckets, hinstrument.nbatch);
+			if (es->usage)
+				appendStringInfo(es->str, "  Memory Usage: %ldkB\n", spacePeakKb);
+			else
+				appendStringInfo(es->str, "\n");
+			foreach(lc, hinstrument.fallback_batches_stats)
+			{
+				FallbackBatchStats *fbs = lfirst(lc);
+
+				ExplainIndentText(es);
+				appendStringInfo(es->str,
+								 "Batch: %d  Stripes: %d\n",
+								 fbs->batchno,
+								 fbs->numstripes);
+			}
 		}
 	}
 }
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 5da13ada72..ebfd8f8410 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -80,7 +80,6 @@ static bool ExecParallelHashTuplePrealloc(HashJoinTable hashtable,
 static void ExecParallelHashMergeCounters(HashJoinTable hashtable);
 static void ExecParallelHashCloseBatchAccessors(HashJoinTable hashtable);
 
-
 /* ----------------------------------------------------------------
  *		ExecHash
  *
@@ -321,6 +320,27 @@ MultiExecParallelHash(HashState *node)
 				 * skew).
 				 */
 				pstate->growth = PHJ_GROWTH_DISABLED;
+
+				/*
+				 * In the current design, batch 0 cannot fall back. That
+				 * behavior is an artifact of the existing design where batch
+				 * 0 fills the initial hash table and as an optimization it
+				 * doesn't need a batch file. But, there is no real reason
+				 * that batch 0 shouldn't be allowed to spill.
+				 *
+				 * Consider a hash table where majority of tuples with
+				 * hashvalue 0. These tuples will never relocate no matter how
+				 * many batches exist. If you cannot exceed work_mem, then you
+				 * will be stuck infinitely trying to double the number of
+				 * batches in order to accommodate the tuples that can only
+				 * ever be in batch 0. So, we allow it to be set to fall back
+				 * during the build phase to avoid excessive batch increases
+				 * but we don't check it when loading the actual tuples, so we
+				 * may exceed space_allowed. We set it back to false here so
+				 * that it isn't true during any of the checks that may happen
+				 * during probing.
+				 */
+				hashtable->batches[0].shared->hashloop_fallback = false;
 			}
 	}
 
@@ -495,12 +515,14 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 	hashtable->curbatch = 0;
 	hashtable->nbatch_original = nbatch;
 	hashtable->nbatch_outstart = nbatch;
-	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
 	hashtable->partialTuples = 0;
 	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
+	hashtable->hashloop_fallback = NULL;
+	hashtable->fallback_batches_stats = NULL;
+	hashtable->curstripe = -1;
 	hashtable->spaceUsed = 0;
 	hashtable->spacePeak = 0;
 	hashtable->spaceAllowed = space_allowed;
@@ -572,6 +594,8 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 			palloc0(nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			palloc0(nbatch * sizeof(BufFile *));
+		hashtable->hashloop_fallback = (BufFile **)
+			palloc0(nbatch * sizeof(BufFile *));
 		/* The files will not be opened until needed... */
 		/* ... but make sure we have temp tablespaces established for them */
 		PrepareTempTablespaces();
@@ -866,6 +890,8 @@ ExecHashTableDestroy(HashJoinTable hashtable)
 				BufFileClose(hashtable->innerBatchFile[i]);
 			if (hashtable->outerBatchFile[i])
 				BufFileClose(hashtable->outerBatchFile[i]);
+			if (hashtable->hashloop_fallback[i])
+				BufFileClose(hashtable->hashloop_fallback[i]);
 		}
 	}
 
@@ -876,6 +902,9 @@ ExecHashTableDestroy(HashJoinTable hashtable)
 	pfree(hashtable);
 }
 
+/* Threshhold for tuple relocation during batch split for parallel and serial */
+#define MAX_RELOCATION 0.8
+
 /*
  * ExecHashIncreaseNumBatches
  *		increase the original number of batches in order to reduce
@@ -886,14 +915,18 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 {
 	int			oldnbatch = hashtable->nbatch;
 	int			curbatch = hashtable->curbatch;
+	int			childbatch;
 	int			nbatch;
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
 	HashMemoryChunk oldchunks;
+	int			curbatch_outgoing_tuples;
+	int			childbatch_outgoing_tuples;
+	int			target_batch;
+	FallbackBatchStats *fallback_batch_stats;
 
-	/* do nothing if we've decided to shut off growth */
-	if (!hashtable->growEnabled)
+	if (hashtable->hashloop_fallback && hashtable->hashloop_fallback[curbatch])
 		return;
 
 	/* safety check to avoid overflow */
@@ -917,6 +950,8 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			palloc0(nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			palloc0(nbatch * sizeof(BufFile *));
+		hashtable->hashloop_fallback = (BufFile **)
+			palloc0(nbatch * sizeof(BufFile *));
 		/* time to establish the temp tablespaces, too */
 		PrepareTempTablespaces();
 	}
@@ -927,10 +962,14 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			repalloc(hashtable->innerBatchFile, nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			repalloc(hashtable->outerBatchFile, nbatch * sizeof(BufFile *));
+		hashtable->hashloop_fallback = (BufFile **)
+			repalloc(hashtable->hashloop_fallback, nbatch * sizeof(BufFile *));
 		MemSet(hashtable->innerBatchFile + oldnbatch, 0,
 			   (nbatch - oldnbatch) * sizeof(BufFile *));
 		MemSet(hashtable->outerBatchFile + oldnbatch, 0,
 			   (nbatch - oldnbatch) * sizeof(BufFile *));
+		MemSet(hashtable->hashloop_fallback + oldnbatch, 0,
+			   (nbatch - oldnbatch) * sizeof(BufFile *));
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -942,6 +981,8 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 * no longer of the current batch.
 	 */
 	ninmemory = nfreed = 0;
+	curbatch_outgoing_tuples = childbatch_outgoing_tuples = 0;
+	childbatch = (1U << (my_log2(hashtable->nbatch) - 1)) | hashtable->curbatch;
 
 	/* If know we need to resize nbuckets, we can do it while rebatching. */
 	if (hashtable->nbuckets_optimal != hashtable->nbuckets)
@@ -999,6 +1040,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 				/* and add it back to the appropriate bucket */
 				copyTuple->next.unshared = hashtable->buckets.unshared[bucketno];
 				hashtable->buckets.unshared[bucketno] = copyTuple;
+				curbatch_outgoing_tuples++;
 			}
 			else
 			{
@@ -1010,6 +1052,16 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 
 				hashtable->spaceUsed -= hashTupleSize;
 				nfreed++;
+
+				/*
+				 * TODO: what to do about tuples that don't go to the child
+				 * batch or stay in the current batch? (this is why we are
+				 * counting tuples to child and curbatch with two diff
+				 * variables in case the tuples go to a batch that isn't the
+				 * child)
+				 */
+				if (batchno == childbatch)
+					childbatch_outgoing_tuples++;
 			}
 
 			/* next tuple in this chunk */
@@ -1030,21 +1082,33 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 #endif
 
 	/*
-	 * If we dumped out either all or none of the tuples in the table, disable
-	 * further expansion of nbatch.  This situation implies that we have
-	 * enough tuples of identical hashvalues to overflow spaceAllowed.
-	 * Increasing nbatch will not fix it since there's no way to subdivide the
-	 * group any more finely. We have to just gut it out and hope the server
-	 * has enough RAM.
+	 * For now we do not support fallback in batch 0 as it is a special case
+	 * and assumed to fit in hashtable.
+	 */
+	if (curbatch == 0)
+		return;
+
+	/*
+	 * The same batch should not be marked to fall back more than once
 	 */
-	if (nfreed == 0 || nfreed == ninmemory)
-	{
-		hashtable->growEnabled = false;
 #ifdef HJDEBUG
-		printf("Hashjoin %p: disabling further increase of nbatch\n",
-			   hashtable);
+	if ((childbatch_outgoing_tuples / (float) ninmemory) >= 0.8)
+		printf("childbatch %i targeted to fallback.", childbatch);
+	if ((curbatch_outgoing_tuples / (float) ninmemory) >= 0.8)
+		printf("curbatch %i targeted to fallback.", curbatch);
 #endif
-	}
+	if ((childbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION && childbatch > 0)
+		target_batch = childbatch;
+	else if ((curbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION && curbatch > 0)
+		target_batch = curbatch;
+	else
+		return;
+	hashtable->hashloop_fallback[target_batch] = BufFileCreateTemp(false);
+
+	fallback_batch_stats = palloc0(sizeof(FallbackBatchStats));
+	fallback_batch_stats->batchno = target_batch;
+	fallback_batch_stats->numstripes = 0;
+	hashtable->fallback_batches_stats = lappend(hashtable->fallback_batches_stats, fallback_batch_stats);
 }
 
 /*
@@ -1213,7 +1277,6 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 									 WAIT_EVENT_HASH_GROW_BATCHES_DECIDING))
 			{
 				bool		space_exhausted = false;
-				bool		extreme_skew_detected = false;
 
 				/* Make sure that we have the current dimensions and buckets. */
 				ExecParallelHashEnsureBatchAccessors(hashtable);
@@ -1224,27 +1287,50 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 				{
 					ParallelHashJoinBatch *batch = hashtable->batches[i].shared;
 
+					/*
+					 * All batches were just created anew during
+					 * repartitioning
+					 */
+					Assert(!batch->hashloop_fallback);
+
+					/*
+					 * At the time of repartitioning, each batch updates its
+					 * estimated_size to reflect the size of the batch file on
+					 * disk. It is also updated when increasing preallocated
+					 * space in ExecParallelHashTuplePrealloc().  However,
+					 * batch 0 does not store anything on disk so it has no
+					 * estimated_size.
+					 *
+					 * We still want to allow batch 0 to trigger batch growth.
+					 * In order to do that, for batch 0 check whether the
+					 * actual size exceeds space_allowed. It is a little
+					 * backwards at this point as we would have already
+					 * exceeded inserted the allowed space.
+					 */
 					if (batch->space_exhausted ||
-						batch->estimated_size > pstate->space_allowed)
+						batch->estimated_size > pstate->space_allowed ||
+						batch->size > pstate->space_allowed)
 					{
 						int			parent;
+						float		frac_moved;
 
 						space_exhausted = true;
 
-						/*
-						 * Did this batch receive ALL of the tuples from its
-						 * parent batch?  That would indicate that further
-						 * repartitioning isn't going to help (the hash values
-						 * are probably all the same).
-						 */
 						parent = i % pstate->old_nbatch;
-						if (batch->ntuples == hashtable->batches[parent].shared->old_ntuples)
-							extreme_skew_detected = true;
+						frac_moved = batch->ntuples / (float) hashtable->batches[parent].shared->old_ntuples;
+
+						if (frac_moved >= MAX_RELOCATION)
+						{
+							batch->hashloop_fallback = true;
+							space_exhausted = false;
+						}
 					}
+					if (space_exhausted)
+						break;
 				}
 
-				/* Don't keep growing if it's not helping or we'd overflow. */
-				if (extreme_skew_detected || hashtable->nbatch >= INT_MAX / 2)
+				/* Don't keep growing if we'd overflow. */
+				if (hashtable->nbatch >= INT_MAX / 2)
 					pstate->growth = PHJ_GROWTH_DISABLED;
 				else if (space_exhausted)
 					pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
@@ -1311,11 +1397,28 @@ ExecParallelHashRepartitionFirst(HashJoinTable hashtable)
 			{
 				size_t		tuple_size =
 				MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
+				tupleMetadata metadata;
 
 				/* It belongs in a later batch. */
+				ParallelHashJoinBatch *batch = hashtable->batches[batchno].shared;
+
+				LWLockAcquire(&batch->lock, LW_EXCLUSIVE);
+
+				if (batch->estimated_stripe_size + tuple_size > hashtable->parallel_state->space_allowed)
+				{
+					batch->maximum_stripe_number++;
+					batch->estimated_stripe_size = 0;
+				}
+
+				batch->estimated_stripe_size += tuple_size;
+
+				metadata.hashvalue = hashTuple->hashvalue;
+				metadata.stripe = batch->maximum_stripe_number;
+				LWLockRelease(&batch->lock);
+
 				hashtable->batches[batchno].estimated_size += tuple_size;
-				sts_puttuple(hashtable->batches[batchno].inner_tuples,
-							 &hashTuple->hashvalue, tuple);
+
+				sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata, tuple);
 			}
 
 			/* Count this tuple. */
@@ -1363,27 +1466,41 @@ ExecParallelHashRepartitionRest(HashJoinTable hashtable)
 	for (i = 1; i < old_nbatch; ++i)
 	{
 		MinimalTuple tuple;
-		uint32		hashvalue;
+		tupleMetadata metadata;
 
 		/* Scan one partition from the previous generation. */
 		sts_begin_parallel_scan(old_inner_tuples[i]);
-		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i], &hashvalue)))
+
+		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i], &metadata.hashvalue)))
 		{
 			size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 			int			bucketno;
 			int			batchno;
+			ParallelHashJoinBatch *batch;
 
 			/* Decide which partition it goes to in the new generation. */
-			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
+			ExecHashGetBucketAndBatch(hashtable, metadata.hashvalue, &bucketno,
 									  &batchno);
 
 			hashtable->batches[batchno].estimated_size += tuple_size;
 			++hashtable->batches[batchno].ntuples;
 			++hashtable->batches[i].old_ntuples;
 
+			batch = hashtable->batches[batchno].shared;
+
 			/* Store the tuple its new batch. */
-			sts_puttuple(hashtable->batches[batchno].inner_tuples,
-						 &hashvalue, tuple);
+			LWLockAcquire(&batch->lock, LW_EXCLUSIVE);
+
+			if (batch->estimated_stripe_size + tuple_size > pstate->space_allowed)
+			{
+				batch->maximum_stripe_number++;
+				batch->estimated_stripe_size = 0;
+			}
+			batch->estimated_stripe_size += tuple_size;
+			metadata.stripe = batch->maximum_stripe_number;
+			LWLockRelease(&batch->lock);
+			/* Store the tuple its new batch. */
+			sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata, tuple);
 
 			CHECK_FOR_INTERRUPTS();
 		}
@@ -1693,6 +1810,12 @@ retry:
 
 	if (batchno == 0)
 	{
+		/*
+		 * TODO: if spilling is enabled for batch 0 so that it can fall back,
+		 * we will need to stop loading batch 0 into the hashtable somewhere--
+		 * maybe here-- and switch to saving tuples to a file. Currently, this
+		 * will simply exceed the space allowed
+		 */
 		HashJoinTuple hashTuple;
 
 		/* Try to load it into memory. */
@@ -1715,10 +1838,17 @@ retry:
 	else
 	{
 		size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
+		ParallelHashJoinBatch *batch;
+		tupleMetadata metadata;
 
 		Assert(batchno > 0);
 
 		/* Try to preallocate space in the batch if necessary. */
+
+		/*
+		 * TODO: is it okay to only count the tuple when it doesn't fit in the
+		 * preallocated memory?
+		 */
 		if (hashtable->batches[batchno].preallocated < tuple_size)
 		{
 			if (!ExecParallelHashTuplePrealloc(hashtable, batchno, tuple_size))
@@ -1727,8 +1857,14 @@ retry:
 
 		Assert(hashtable->batches[batchno].preallocated >= tuple_size);
 		hashtable->batches[batchno].preallocated -= tuple_size;
-		sts_puttuple(hashtable->batches[batchno].inner_tuples, &hashvalue,
-					 tuple);
+		batch = hashtable->batches[batchno].shared;
+
+		metadata.hashvalue = hashvalue;
+		LWLockAcquire(&batch->lock, LW_SHARED);
+		metadata.stripe = batch->maximum_stripe_number;
+		LWLockRelease(&batch->lock);
+
+		sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata, tuple);
 	}
 	++hashtable->batches[batchno].ntuples;
 
@@ -2697,6 +2833,7 @@ ExecHashAccumInstrumentation(HashInstrumentation *instrument,
 									  hashtable->nbatch_original);
 	instrument->space_peak = Max(instrument->space_peak,
 								 hashtable->spacePeak);
+	instrument->fallback_batches_stats = hashtable->fallback_batches_stats;
 }
 
 /*
@@ -2850,6 +2987,8 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 	/* Check if it's time to grow batches or buckets. */
 	if (pstate->growth != PHJ_GROWTH_DISABLED)
 	{
+		ParallelHashJoinBatchAccessor batch = hashtable->batches[0];
+
 		Assert(curbatch == 0);
 		Assert(BarrierPhase(&pstate->build_barrier) == PHJ_BUILD_HASHING_INNER);
 
@@ -2858,8 +2997,13 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 		 * very large tuples or very low work_mem setting, we'll always allow
 		 * each backend to allocate at least one chunk.
 		 */
-		if (hashtable->batches[0].at_least_one_chunk &&
-			hashtable->batches[0].shared->size +
+
+		/*
+		 * TODO: get rid of this check for batch 0 and make it so that
+		 * batch 0 always has to keep trying to increase the number of batches
+		 */
+		if (!batch.shared->hashloop_fallback && batch.at_least_one_chunk &&
+			batch.shared->size +
 			chunk_size > pstate->space_allowed)
 		{
 			pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
@@ -2891,6 +3035,11 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 
 	/* We are cleared to allocate a new chunk. */
 	chunk_shared = dsa_allocate(hashtable->area, chunk_size);
+
+	/*
+	 * TODO: if batch 0 will have stripes, need to account for this memory
+	 * there
+	 */
 	hashtable->batches[curbatch].shared->size += chunk_size;
 	hashtable->batches[curbatch].at_least_one_chunk = true;
 
@@ -2960,20 +3109,35 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 	{
 		ParallelHashJoinBatchAccessor *accessor = &hashtable->batches[i];
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
+		SharedBits *sbits = ParallelHashJoinBatchOuterBits(shared, pstate->nparticipants);
 		char		name[MAXPGPATH];
+		char		sbname[MAXPGPATH];
+
+		shared->hashloop_fallback = false;
+		/* TODO: is it okay to use the same tranche for this lock? */
+		LWLockInitialize(&shared->lock, LWTRANCHE_PARALLEL_HASH_JOIN);
+		shared->maximum_stripe_number = 0;
+		shared->estimated_stripe_size = 0;
 
 		/*
 		 * All members of shared were zero-initialized.  We just need to set
 		 * up the Barrier.
 		 */
 		BarrierInit(&shared->batch_barrier, 0);
+		BarrierInit(&shared->stripe_barrier, 0);
+
+		/* Batch 0 doesn't need to be loaded. */
 		if (i == 0)
 		{
-			/* Batch 0 doesn't need to be loaded. */
 			BarrierAttach(&shared->batch_barrier);
-			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_PROBING)
+			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_STRIPING)
 				BarrierArriveAndWait(&shared->batch_barrier, 0);
 			BarrierDetach(&shared->batch_barrier);
+
+			BarrierAttach(&shared->stripe_barrier);
+			while (BarrierPhase(&shared->stripe_barrier) < PHJ_STRIPE_PROBING)
+				BarrierArriveAndWait(&shared->stripe_barrier, 0);
+			BarrierDetach(&shared->stripe_barrier);
 		}
 
 		/* Initialize accessor state.  All members were zero-initialized. */
@@ -2985,7 +3149,7 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 			sts_initialize(ParallelHashJoinBatchInner(shared),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
@@ -2995,10 +3159,13 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 													  pstate->nparticipants),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
+		snprintf(sbname, MAXPGPATH, "%s.bitmaps", name);
+		accessor->sba = sb_initialize(sbits, pstate->nparticipants,
+									  ParallelWorkerNumber + 1, &pstate->sbfileset, sbname);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -3047,8 +3214,8 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 	 * It's possible for a backend to start up very late so that the whole
 	 * join is finished and the shm state for tracking batches has already
 	 * been freed by ExecHashTableDetach().  In that case we'll just leave
-	 * hashtable->batches as NULL so that ExecParallelHashJoinNewBatch() gives
-	 * up early.
+	 * hashtable->batches as NULL so that ExecParallelHashJoinAdvanceBatch()
+	 * gives up early.
 	 */
 	if (!DsaPointerIsValid(pstate->batches))
 		return;
@@ -3070,6 +3237,7 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 	{
 		ParallelHashJoinBatchAccessor *accessor = &hashtable->batches[i];
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
+		SharedBits *sbits = ParallelHashJoinBatchOuterBits(shared, pstate->nparticipants);
 
 		accessor->shared = shared;
 		accessor->preallocated = 0;
@@ -3083,6 +3251,7 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 												  pstate->nparticipants),
 					   ParallelWorkerNumber + 1,
 					   &pstate->fileset);
+		accessor->sba = sb_attach(sbits, ParallelWorkerNumber + 1, &pstate->sbfileset);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -3165,6 +3334,18 @@ ExecHashTableDetachBatch(HashJoinTable hashtable)
 	}
 }
 
+bool
+ExecHashTableDetachStripe(HashJoinTable hashtable)
+{
+	int			curbatch = hashtable->curbatch;
+	ParallelHashJoinBatch *batch = hashtable->batches[curbatch].shared;
+	Barrier    *stripe_barrier = &batch->stripe_barrier;
+
+	BarrierDetach(stripe_barrier);
+	hashtable->curstripe = -1;
+	return false;
+}
+
 /*
  * Detach from all shared resources.  If we are last to detach, clean up.
  */
@@ -3350,13 +3531,35 @@ ExecParallelHashTuplePrealloc(HashJoinTable hashtable, int batchno, size_t size)
 	{
 		/*
 		 * We have determined that this batch would exceed the space budget if
-		 * loaded into memory.  Command all participants to help repartition.
+		 * loaded into memory.
 		 */
-		batch->shared->space_exhausted = true;
-		pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
-		LWLockRelease(&pstate->lock);
-
-		return false;
+		/* TODO: the nested lock is a deadlock waiting to happen. */
+		LWLockAcquire(&batch->shared->lock, LW_EXCLUSIVE);
+		if (!batch->shared->hashloop_fallback)
+		{
+			/*
+			 * This batch is not marked to fall back so command all
+			 * participants to help repartition.
+			 */
+			batch->shared->space_exhausted = true;
+			pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
+			LWLockRelease(&batch->shared->lock);
+			LWLockRelease(&pstate->lock);
+			return false;
+		}
+		else if (batch->shared->estimated_stripe_size + want +
+				 HASH_CHUNK_HEADER_SIZE > pstate->space_allowed)
+		{
+			/*
+			 * This batch is marked to fall back and the current (last) stripe
+			 * does not have enough space to handle the request so we must
+			 * increment the number of stripes in the batch and reset the size
+			 * of its new last stripe.
+			 */
+			batch->shared->maximum_stripe_number++;
+			batch->shared->estimated_stripe_size = 0;
+		}
+		LWLockRelease(&batch->shared->lock);
 	}
 
 	batch->at_least_one_chunk = true;
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index cc8edacdd0..b87d32ad8e 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -126,7 +126,7 @@
 #define HJ_SCAN_BUCKET			3
 #define HJ_FILL_OUTER_TUPLE		4
 #define HJ_FILL_INNER_TUPLES	5
-#define HJ_NEED_NEW_BATCH		6
+#define HJ_NEED_NEW_STRIPE      6
 
 /* Returns true if doing null-fill on outer relation */
 #define HJ_FILL_OUTER(hjstate)	((hjstate)->hj_NullInnerTupleSlot != NULL)
@@ -143,10 +143,91 @@ static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 												 BufFile *file,
 												 uint32 *hashvalue,
 												 TupleTableSlot *tupleSlot);
+static int	ExecHashJoinLoadStripe(HashJoinState *hjstate);
 static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
 static bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
+static bool ExecParallelHashJoinLoadStripe(HashJoinState *hjstate);
 static void ExecParallelHashJoinPartitionOuter(HashJoinState *node);
+static bool checkbit(HashJoinState *hjstate);
+static void set_match_bit(HashJoinState *hjstate);
 
+static pg_attribute_always_inline bool
+			IsHashloopFallback(HashJoinTable hashtable);
+
+#define UINT_BITS (sizeof(unsigned int) * CHAR_BIT)
+
+static void
+set_match_bit(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	BufFile    *statusFile = hashtable->hashloop_fallback[hashtable->curbatch];
+	int			tupindex = hjstate->hj_CurNumOuterTuples - 1;
+	size_t		unit_size = sizeof(hjstate->hj_CurOuterMatchStatus);
+	off_t		offset = tupindex / UINT_BITS * unit_size;
+
+	int			fileno;
+	off_t		cursor;
+
+	BufFileTell(statusFile, &fileno, &cursor);
+
+	/* Extend the statusFile if this is stripe zero. */
+	if (hashtable->curstripe == 0)
+	{
+		for (; cursor < offset + unit_size; cursor += unit_size)
+		{
+			hjstate->hj_CurOuterMatchStatus = 0;
+			BufFileWrite(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+		}
+	}
+
+	if (cursor != offset)
+		BufFileSeek(statusFile, 0, offset, SEEK_SET);
+
+	BufFileRead(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+	BufFileSeek(statusFile, 0, -unit_size, SEEK_CUR);
+
+	hjstate->hj_CurOuterMatchStatus |= 1U << tupindex % UINT_BITS;
+	BufFileWrite(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+}
+
+/* return true if bit is set and false if not */
+static bool
+checkbit(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			curbatch = hashtable->curbatch;
+	BufFile    *outer_match_statuses;
+
+	int			bitno = hjstate->hj_EmitOuterTupleId % UINT_BITS;
+
+	hjstate->hj_EmitOuterTupleId++;
+	outer_match_statuses = hjstate->hj_HashTable->hashloop_fallback[curbatch];
+
+	/*
+	 * if current chunk of bitmap is exhausted, read next chunk of bitmap from
+	 * outer_match_status_file
+	 */
+	if (bitno == 0)
+		BufFileRead(outer_match_statuses, &hjstate->hj_CurOuterMatchStatus,
+					sizeof(hjstate->hj_CurOuterMatchStatus));
+
+	/*
+	 * check if current tuple's match bit is set in outer match status file
+	 */
+	return hjstate->hj_CurOuterMatchStatus & (1U << bitno);
+}
+
+static bool
+IsHashloopFallback(HashJoinTable hashtable)
+{
+	if (hashtable->parallel_state)
+		return hashtable->batches[hashtable->curbatch].shared->hashloop_fallback;
+
+	if (!hashtable->hashloop_fallback)
+		return false;
+
+	return hashtable->hashloop_fallback[hashtable->curbatch];
+}
 
 /* ----------------------------------------------------------------
  *		ExecHashJoinImpl
@@ -290,6 +371,12 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				hashNode->hashtable = hashtable;
 				(void) MultiExecProcNode((PlanState *) hashNode);
 
+				/*
+				 * After building the hashtable, stripe 0 of batch 0 will have
+				 * been loaded.
+				 */
+				hashtable->curstripe = 0;
+
 				/*
 				 * If the inner relation is completely empty, and we're not
 				 * doing a left outer join, we can quit without scanning the
@@ -333,12 +420,11 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 
 					/* Each backend should now select a batch to work on. */
 					hashtable->curbatch = -1;
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
 
-					continue;
+					if (!ExecParallelHashJoinNewBatch(node))
+						return NULL;
 				}
-				else
-					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
 				/* FALL THRU */
 
@@ -365,12 +451,18 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 						node->hj_JoinState = HJ_FILL_INNER_TUPLES;
 					}
 					else
-						node->hj_JoinState = HJ_NEED_NEW_BATCH;
+						node->hj_JoinState = HJ_NEED_NEW_STRIPE;
 					continue;
 				}
 
 				econtext->ecxt_outertuple = outerTupleSlot;
-				node->hj_MatchedOuter = false;
+
+				/*
+				 * Don't reset hj_MatchedOuter after the first stripe as it
+				 * would cancel out whatever we found before
+				 */
+				if (node->hj_HashTable->curstripe == 0)
+					node->hj_MatchedOuter = false;
 
 				/*
 				 * Find the corresponding bucket for this tuple in the main
@@ -386,9 +478,15 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				/*
 				 * The tuple might not belong to the current batch (where
 				 * "current batch" includes the skew buckets if any).
+				 *
+				 * This should only be done once per tuple per batch. If a
+				 * batch "falls back", its inner side will be split into
+				 * stripes. Any displaced outer tuples should only be
+				 * relocated while probing the first stripe of the inner side.
 				 */
 				if (batchno != hashtable->curbatch &&
-					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO)
+					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO &&
+					node->hj_HashTable->curstripe == 0)
 				{
 					bool		shouldFree;
 					MinimalTuple mintuple = ExecFetchSlotMinimalTuple(outerTupleSlot,
@@ -410,6 +508,13 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					continue;
 				}
 
+				/*
+				 * While probing the phantom stripe, don't increment
+				 * hj_CurNumOuterTuples or extend the bitmap
+				 */
+				if (!parallel && hashtable->curstripe != -2)
+					node->hj_CurNumOuterTuples++;
+
 				/* OK, let's scan the bucket for matches */
 				node->hj_JoinState = HJ_SCAN_BUCKET;
 
@@ -455,6 +560,14 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				{
 					node->hj_MatchedOuter = true;
 
+					if (HJ_FILL_OUTER(node) && IsHashloopFallback(hashtable))
+					{
+						if (parallel)
+							sb_setbit(hashtable->batches[hashtable->curbatch].sba, econtext->ecxt_outertuple->tts_tuplenum);
+						else
+							set_match_bit(node);
+					}
+
 					if (parallel)
 					{
 						/*
@@ -508,6 +621,22 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				 */
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
+				if (IsHashloopFallback(hashtable) && HJ_FILL_OUTER(node))
+				{
+					if (hashtable->curstripe != -2)
+						continue;
+
+					if (parallel)
+					{
+						ParallelHashJoinBatchAccessor *accessor =
+						&node->hj_HashTable->batches[node->hj_HashTable->curbatch];
+
+						node->hj_MatchedOuter = sb_checkbit(accessor->sba, econtext->ecxt_outertuple->tts_tuplenum);
+					}
+					else
+						node->hj_MatchedOuter = checkbit(node);
+				}
+
 				if (!node->hj_MatchedOuter &&
 					HJ_FILL_OUTER(node))
 				{
@@ -534,7 +663,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				if (!ExecScanHashTableForUnmatched(node, econtext))
 				{
 					/* no more unmatched tuples */
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					node->hj_JoinState = HJ_NEED_NEW_STRIPE;
 					continue;
 				}
 
@@ -550,19 +679,23 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					InstrCountFiltered2(node, 1);
 				break;
 
-			case HJ_NEED_NEW_BATCH:
+			case HJ_NEED_NEW_STRIPE:
 
 				/*
-				 * Try to advance to next batch.  Done if there are no more.
+				 * Try to advance to next stripe. Then try to advance to the
+				 * next batch if there are no more stripes in this batch. Done
+				 * if there are no more batches.
 				 */
 				if (parallel)
 				{
-					if (!ExecParallelHashJoinNewBatch(node))
+					if (!ExecParallelHashJoinLoadStripe(node) &&
+						!ExecParallelHashJoinNewBatch(node))
 						return NULL;	/* end of parallel-aware join */
 				}
 				else
 				{
-					if (!ExecHashJoinNewBatch(node))
+					if (!ExecHashJoinLoadStripe(node) &&
+						!ExecHashJoinNewBatch(node))
 						return NULL;	/* end of parallel-oblivious join */
 				}
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
@@ -751,6 +884,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate->hj_JoinState = HJ_BUILD_HASHTABLE;
 	hjstate->hj_MatchedOuter = false;
 	hjstate->hj_OuterNotEmpty = false;
+	hjstate->hj_CurNumOuterTuples = 0;
+	hjstate->hj_CurOuterMatchStatus = 0;
 
 	return hjstate;
 }
@@ -917,15 +1052,24 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 	}
 	else if (curbatch < hashtable->nbatch)
 	{
+		tupleMetadata metadata;
 		MinimalTuple tuple;
 
 		tuple = sts_parallel_scan_next(hashtable->batches[curbatch].outer_tuples,
-									   hashvalue);
+									   &metadata);
+		*hashvalue = metadata.hashvalue;
+
 		if (tuple != NULL)
 		{
 			ExecForceStoreMinimalTuple(tuple,
 									   hjstate->hj_OuterTupleSlot,
 									   false);
+
+			/*
+			 * TODO: should we use tupleid instead of position in the serial
+			 * case too?
+			 */
+			hjstate->hj_OuterTupleSlot->tts_tuplenum = metadata.tupleid;
 			slot = hjstate->hj_OuterTupleSlot;
 			return slot;
 		}
@@ -949,24 +1093,37 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	int			nbatch;
 	int			curbatch;
-	BufFile    *innerFile;
-	TupleTableSlot *slot;
-	uint32		hashvalue;
+	BufFile    *innerFile = NULL;
+	BufFile    *outerFile = NULL;
 
 	nbatch = hashtable->nbatch;
 	curbatch = hashtable->curbatch;
 
-	if (curbatch > 0)
+	/*
+	 * We no longer need the previous outer batch file; close it right away to
+	 * free disk space.
+	 */
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
 	{
-		/*
-		 * We no longer need the previous outer batch file; close it right
-		 * away to free disk space.
-		 */
-		if (hashtable->outerBatchFile[curbatch])
-			BufFileClose(hashtable->outerBatchFile[curbatch]);
+		BufFileClose(hashtable->outerBatchFile[curbatch]);
 		hashtable->outerBatchFile[curbatch] = NULL;
 	}
-	else						/* we just finished the first batch */
+	if (IsHashloopFallback(hashtable))
+	{
+		BufFileClose(hashtable->hashloop_fallback[curbatch]);
+		hashtable->hashloop_fallback[curbatch] = NULL;
+	}
+
+	/*
+	 * We are surely done with the inner batch file now
+	 */
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[curbatch])
+	{
+		BufFileClose(hashtable->innerBatchFile[curbatch]);
+		hashtable->innerBatchFile[curbatch] = NULL;
+	}
+
+	if (curbatch == 0)			/* we just finished the first batch */
 	{
 		/*
 		 * Reset some of the skew optimization state variables, since we no
@@ -1030,45 +1187,68 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 		return false;			/* no more batches */
 
 	hashtable->curbatch = curbatch;
+	hashtable->curstripe = -1;
+	hjstate->hj_CurNumOuterTuples = 0;
 
-	/*
-	 * Reload the hash table with the new inner batch (which could be empty)
-	 */
-	ExecHashTableReset(hashtable);
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[curbatch])
+		innerFile = hashtable->innerBatchFile[curbatch];
+
+	if (innerFile && BufFileSeek(innerFile, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
 
-	innerFile = hashtable->innerBatchFile[curbatch];
+	/* Need to rewind outer when this is the first stripe of a new batch */
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
+		outerFile = hashtable->outerBatchFile[curbatch];
+
+	if (outerFile && BufFileSeek(outerFile, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	ExecHashJoinLoadStripe(hjstate);
+	return true;
+}
+
+static inline void
+InstrIncrBatchStripes(List *fallback_batches_stats, int curbatch)
+{
+	ListCell   *lc;
 
-	if (innerFile != NULL)
+	foreach(lc, fallback_batches_stats)
 	{
-		if (BufFileSeek(innerFile, 0, 0L, SEEK_SET))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not rewind hash-join temporary file: %m")));
+		FallbackBatchStats *fallback_batch_stats = lfirst(lc);
 
-		while ((slot = ExecHashJoinGetSavedTuple(hjstate,
-												 innerFile,
-												 &hashvalue,
-												 hjstate->hj_HashTupleSlot)))
+		if (fallback_batch_stats->batchno == curbatch)
 		{
-			/*
-			 * NOTE: some tuples may be sent to future batches.  Also, it is
-			 * possible for hashtable->nbatch to be increased here!
-			 */
-			ExecHashTableInsert(hashtable, slot, hashvalue);
+			fallback_batch_stats->numstripes++;
+			break;
 		}
-
-		/*
-		 * after we build the hash table, the inner batch file is no longer
-		 * needed
-		 */
-		BufFileClose(innerFile);
-		hashtable->innerBatchFile[curbatch] = NULL;
 	}
+}
+
+/*
+ * Returns false when the inner batch file is exhausted
+ */
+static int
+ExecHashJoinLoadStripe(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			curbatch = hashtable->curbatch;
+	TupleTableSlot *slot;
+	uint32		hashvalue;
+	bool		loaded_inner = false;
+
+	if (hashtable->curstripe == -2)
+		return false;
 
 	/*
 	 * Rewind outer batch file (if present), so that we can start reading it.
+	 * TODO: This is only necessary if this is not the first stripe of the
+	 * batch
 	 */
-	if (hashtable->outerBatchFile[curbatch] != NULL)
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
 	{
 		if (BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET))
 			ereport(ERROR,
@@ -1076,9 +1256,78 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 					 errmsg("could not rewind hash-join temporary file: %m")));
 	}
 
-	return true;
+	hashtable->curstripe++;
+
+	if (!hashtable->innerBatchFile || !hashtable->innerBatchFile[curbatch])
+		return false;
+
+	/*
+	 * Reload the hash table with the new inner stripe
+	 */
+	ExecHashTableReset(hashtable);
+
+	while ((slot = ExecHashJoinGetSavedTuple(hjstate,
+											 hashtable->innerBatchFile[curbatch],
+											 &hashvalue,
+											 hjstate->hj_HashTupleSlot)))
+	{
+		/*
+		 * NOTE: some tuples may be sent to future batches.  Also, it is
+		 * possible for hashtable->nbatch to be increased here!
+		 */
+		uint32		hashTupleSize;
+		/*
+		 * TODO: wouldn't it be cool if this returned the size of the tuple
+		 * inserted
+		 */
+		ExecHashTableInsert(hashtable, slot, hashvalue);
+		loaded_inner = true;
+
+		if (!IsHashloopFallback(hashtable))
+			continue;
+
+		hashTupleSize = slot->tts_ops->get_minimal_tuple(slot)->t_len + HJTUPLE_OVERHEAD;
+
+		if (hashtable->spaceUsed + hashTupleSize +
+			hashtable->nbuckets_optimal * sizeof(HashJoinTuple)
+			> hashtable->spaceAllowed)
+			break;
+	}
+
+	/*
+	 * if we didn't load anything and it is a FOJ/LOJ fallback batch, we will
+	 * transition to emit unmatched outer tuples next. we want to know how
+	 * many tuples were in the batch in that case, so don't zero it out then
+	 */
+
+	/*
+	 * if we loaded anything into the hashtable or it is the phantom stripe,
+	 * must proceed to probing
+	 */
+	if (loaded_inner)
+	{
+		hjstate->hj_CurNumOuterTuples = 0;
+		InstrIncrBatchStripes(hashtable->fallback_batches_stats, curbatch);
+		return true;
+	}
+
+	if (IsHashloopFallback(hashtable) && HJ_FILL_OUTER(hjstate))
+	{
+		/*
+		 * if we didn't load anything and it is a fallback batch, we will
+		 * prepare to emit outer tuples during the phantom stripe probing
+		 */
+		hashtable->curstripe = -2;
+		hjstate->hj_EmitOuterTupleId = 0;
+		hjstate->hj_CurOuterMatchStatus = 0;
+		BufFileSeek(hashtable->hashloop_fallback[curbatch], 0, 0, SEEK_SET);
+		BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET);
+		return true;
+	}
+	return false;
 }
 
+
 /*
  * Choose a batch to work on, and attach to it.  Returns true if successful,
  * false if there are no more batches.
@@ -1101,10 +1350,18 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 	/*
 	 * If we were already attached to a batch, remember not to bother checking
 	 * it again, and detach from it (possibly freeing the hash table if we are
-	 * last to detach).
+	 * last to detach). curbatch is set when the batch_barrier phase is either
+	 * PHJ_BATCH_LOADING or PHJ_BATCH_STRIPING (note that the
+	 * PHJ_BATCH_LOADING case will fall through to the PHJ_BATCH_STRIPING
+	 * case). The PHJ_BATCH_STRIPING case returns to the caller. So when this
+	 * function is reentered with a curbatch >= 0 then we must be done
+	 * probing.
 	 */
+
 	if (hashtable->curbatch >= 0)
 	{
+		if (IsHashloopFallback(hashtable))
+			sb_end_write(hashtable->batches[hashtable->curbatch].sba);
 		hashtable->batches[hashtable->curbatch].done = true;
 		ExecHashTableDetachBatch(hashtable);
 	}
@@ -1119,13 +1376,8 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 		hashtable->nbatch;
 	do
 	{
-		uint32		hashvalue;
-		MinimalTuple tuple;
-		TupleTableSlot *slot;
-
 		if (!hashtable->batches[batchno].done)
 		{
-			SharedTuplestoreAccessor *inner_tuples;
 			Barrier    *batch_barrier =
 			&hashtable->batches[batchno].shared->batch_barrier;
 
@@ -1136,7 +1388,15 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 					/* One backend allocates the hash table. */
 					if (BarrierArriveAndWait(batch_barrier,
 											 WAIT_EVENT_HASH_BATCH_ELECTING))
+					{
 						ExecParallelHashTableAlloc(hashtable, batchno);
+
+						/*
+						 * one worker needs to 0 out the read_pages of all the
+						 * participants in the new batch
+						 */
+						sts_reinitialize(hashtable->batches[batchno].inner_tuples);
+					}
 					/* Fall through. */
 
 				case PHJ_BATCH_ALLOCATING:
@@ -1145,40 +1405,15 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 										 WAIT_EVENT_HASH_BATCH_ALLOCATING);
 					/* Fall through. */
 
-				case PHJ_BATCH_LOADING:
-					/* Start (or join in) loading tuples. */
-					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					inner_tuples = hashtable->batches[batchno].inner_tuples;
-					sts_begin_parallel_scan(inner_tuples);
-					while ((tuple = sts_parallel_scan_next(inner_tuples,
-														   &hashvalue)))
-					{
-						ExecForceStoreMinimalTuple(tuple,
-												   hjstate->hj_HashTupleSlot,
-												   false);
-						slot = hjstate->hj_HashTupleSlot;
-						ExecParallelHashTableInsertCurrentBatch(hashtable, slot,
-																hashvalue);
-					}
-					sts_end_parallel_scan(inner_tuples);
-					BarrierArriveAndWait(batch_barrier,
-										 WAIT_EVENT_HASH_BATCH_LOADING);
-					/* Fall through. */
+				case PHJ_BATCH_STRIPING:
 
-				case PHJ_BATCH_PROBING:
-
-					/*
-					 * This batch is ready to probe.  Return control to
-					 * caller. We stay attached to batch_barrier so that the
-					 * hash table stays alive until everyone's finished
-					 * probing it, but no participant is allowed to wait at
-					 * this barrier again (or else a deadlock could occur).
-					 * All attached participants must eventually call
-					 * BarrierArriveAndDetach() so that the final phase
-					 * PHJ_BATCH_DONE can be reached.
-					 */
 					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					sts_begin_parallel_scan(hashtable->batches[batchno].outer_tuples);
+					sts_begin_parallel_scan(hashtable->batches[batchno].inner_tuples);
+					if (hashtable->batches[batchno].shared->hashloop_fallback)
+						sb_initialize_accessor(hashtable->batches[hashtable->curbatch].sba,
+											   sts_get_tuplenum(hashtable->batches[hashtable->curbatch].outer_tuples));
+					hashtable->curstripe = -1;
+					ExecParallelHashJoinLoadStripe(hjstate);
 					return true;
 
 				case PHJ_BATCH_DONE:
@@ -1203,6 +1438,224 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 	return false;
 }
 
+
+
+/*
+ * Returns true if ready to probe and false if the inner is exhausted
+ * (there are no more stripes)
+ */
+bool
+ExecParallelHashJoinLoadStripe(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			batchno = hashtable->curbatch;
+	ParallelHashJoinBatch *batch = hashtable->batches[batchno].shared;
+	Barrier    *stripe_barrier = &batch->stripe_barrier;
+	SharedTuplestoreAccessor *outer_tuples;
+	SharedTuplestoreAccessor *inner_tuples;
+	ParallelHashJoinBatchAccessor *accessor;
+	dsa_pointer_atomic *buckets;
+
+	outer_tuples = hashtable->batches[batchno].outer_tuples;
+	inner_tuples = hashtable->batches[batchno].inner_tuples;
+
+	if (hashtable->curstripe >= 0)
+	{
+		BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_PROBING);
+	}
+	else if (hashtable->curstripe == -1)
+	{
+		int			phase = BarrierAttach(stripe_barrier);
+
+		/*
+		 * If a worker enters this phase machine on a stripe number greater
+		 * than the batch's maximum stripe number, then: 1) The batch is done,
+		 * or 2) The batch is on the phantom stripe that's used for hashloop
+		 * fallback Either way the worker can't contribute so just detach and
+		 * move on.
+		 */
+		if (PHJ_STRIPE_NUMBER(phase) > batch->maximum_stripe_number)
+			return ExecHashTableDetachStripe(hashtable);
+
+		hashtable->curstripe = PHJ_STRIPE_NUMBER(phase);
+	}
+	else if (hashtable->curstripe == -2)
+	{
+		sts_end_parallel_scan(outer_tuples);
+		/*
+		 * TODO: ideally this would go somewhere in the batch phase machine
+		 * Putting it in ExecHashTableDetachBatch didn't do the trick
+		 */
+		sb_end_read(hashtable->batches[batchno].sba);
+		return ExecHashTableDetachStripe(hashtable);
+	}
+
+	/*
+	 * The outer side is exhausted and either 1) the current stripe of the
+	 * inner side is exhausted and it is time to advance the stripe 2) the
+	 * last stripe of the inner side is exhausted and it is time to advance
+	 * the batch
+	 */
+	for (;;)
+	{
+		int			phase = BarrierPhase(stripe_barrier);
+
+		switch (PHJ_STRIPE_PHASE(phase))
+		{
+			case PHJ_STRIPE_ELECTING:
+				if (BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_ELECTING))
+				{
+					sts_reinitialize(outer_tuples);
+
+					/*
+					 * set the rewound flag back to false to prepare for the
+					 * next stripe
+					 */
+					sts_reset_rewound(inner_tuples);
+				}
+
+				/* Fall through. */
+
+			case PHJ_STRIPE_RESETTING:
+				/* TODO: not needed for phantom stripe */
+				BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_RESETTING);
+
+			case PHJ_STRIPE_LOADING:
+				{
+					MinimalTuple tuple;
+					tupleMetadata metadata;
+
+					/*
+					 * Start (or join in) loading the next stripe of inner
+					 * tuples.
+					 */
+
+					/*
+					 * I'm afraid there potential issue if a worker joins in
+					 * this phase and doesn't do the actions and resetting of
+					 * variables in sts_resume_parallel_scan. that is, if it
+					 * doesn't reset start_page and read_next_page in between
+					 * stripes. For now, call it. However, I think it might be
+					 * able to be removed.
+					 */
+
+					/*
+					 * TODO: sts_resume_parallel_scan() is overkill for stripe
+					 * 0 of each batch
+					 */
+					sts_resume_parallel_scan(inner_tuples);
+
+					while ((tuple = sts_parallel_scan_next(inner_tuples, &metadata)))
+					{
+						/* The tuple is from a previous stripe. Skip it */
+						if (metadata.stripe < PHJ_STRIPE_NUMBER(phase))
+							continue;
+
+						/*
+						 * tuple from future. time to back out read_page. end
+						 * of stripe
+						 */
+						if (metadata.stripe > PHJ_STRIPE_NUMBER(phase))
+						{
+							sts_parallel_scan_rewind(inner_tuples);
+							continue;
+						}
+
+						ExecForceStoreMinimalTuple(tuple, hjstate->hj_HashTupleSlot, false);
+						ExecParallelHashTableInsertCurrentBatch(
+																hashtable,
+																hjstate->hj_HashTupleSlot,
+																metadata.hashvalue);
+					}
+					BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOADING);
+					/* Fall through. */
+				}
+
+			case PHJ_STRIPE_PROBING:
+
+				/*
+				 * do this again here in case a worker began the scan and then
+				 * entered after loading before probing
+				 */
+				sts_end_parallel_scan(inner_tuples);
+				sts_begin_parallel_scan(outer_tuples);
+				return true;
+
+			case PHJ_STRIPE_DONE:
+
+				if (PHJ_STRIPE_NUMBER(phase) >= batch->maximum_stripe_number)
+				{
+					/*
+					 * Handle the phantom stripe case.
+					 */
+					if (batch->hashloop_fallback && HJ_FILL_OUTER(hjstate))
+						goto fallback_stripe;
+
+					/* Return if this is the last stripe */
+					return ExecHashTableDetachStripe(hashtable);
+				}
+
+				/* this, effectively, increments the stripe number */
+				if (BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOADING))
+				{
+					/*
+					 * reset inner's hashtable and recycle the existing bucket array.
+					 */
+					buckets = (dsa_pointer_atomic *)
+						dsa_get_address(hashtable->area, batch->buckets);
+
+					for (size_t i = 0; i < hashtable->nbuckets; ++i)
+						dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
+				}
+
+				hashtable->curstripe++;
+				continue;
+
+			default:
+				elog(ERROR, "unexpected stripe phase %d. pid %i. batch %i.", BarrierPhase(stripe_barrier), MyProcPid, batchno);
+		}
+	}
+
+fallback_stripe:
+	accessor = &hashtable->batches[hashtable->curbatch];
+	sb_end_write(accessor->sba);
+
+	/* Ensure that only a single worker is attached to the barrier */
+	if (!BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOADING))
+		return ExecHashTableDetachStripe(hashtable);
+
+
+	/* No one except the last worker will run this code */
+	hashtable->curstripe = -2;
+
+	/*
+	 * reset inner's hashtable and recycle the existing bucket array.
+	 */
+	buckets = (dsa_pointer_atomic *)
+		dsa_get_address(hashtable->area, batch->buckets);
+
+	for (size_t i = 0; i < hashtable->nbuckets; ++i)
+		dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
+
+	/*
+	 * If all workers (including this one) have finished probing the batch,
+	 * one worker is elected to Loop through the outer match status files from
+	 * all workers that were attached to this batch Combine them into one
+	 * bitmap Use the bitmap, loop through the outer batch file again, and
+	 * emit unmatched tuples All workers will detach from the batch barrier
+	 * and the last worker will clean up the hashtable. All workers except the
+	 * last worker will end their scans of the outer and inner side. The last
+	 * worker will end its scan of the inner side
+	 */
+
+	sb_combine(accessor->sba);
+	sts_reinitialize(outer_tuples);
+
+	sts_begin_parallel_scan(outer_tuples);
+
+	return true;
+}
+
 /*
  * ExecHashJoinSaveTuple
  *		save a tuple to a batch file.
@@ -1372,6 +1825,9 @@ ExecReScanHashJoin(HashJoinState *node)
 	node->hj_MatchedOuter = false;
 	node->hj_FirstOuterTupleSlot = NULL;
 
+	node->hj_CurNumOuterTuples = 0;
+	node->hj_CurOuterMatchStatus = 0;
+
 	/*
 	 * if chgParam of subnode is not null then plan will be re-scanned by
 	 * first ExecProcNode.
@@ -1402,7 +1858,6 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 	ExprContext *econtext = hjstate->js.ps.ps_ExprContext;
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	TupleTableSlot *slot;
-	uint32		hashvalue;
 	int			i;
 
 	Assert(hjstate->hj_FirstOuterTupleSlot == NULL);
@@ -1410,6 +1865,8 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 	/* Execute outer plan, writing all tuples to shared tuplestores. */
 	for (;;)
 	{
+		tupleMetadata metadata;
+
 		slot = ExecProcNode(outerState);
 		if (TupIsNull(slot))
 			break;
@@ -1418,17 +1875,23 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 								 hjstate->hj_OuterHashKeys,
 								 true,	/* outer tuple */
 								 HJ_FILL_OUTER(hjstate),
-								 &hashvalue))
+								 &metadata.hashvalue))
 		{
 			int			batchno;
 			int			bucketno;
 			bool		shouldFree;
+			SharedTuplestoreAccessor *accessor;
+
 			MinimalTuple mintup = ExecFetchSlotMinimalTuple(slot, &shouldFree);
 
-			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
+			ExecHashGetBucketAndBatch(hashtable, metadata.hashvalue, &bucketno,
 									  &batchno);
-			sts_puttuple(hashtable->batches[batchno].outer_tuples,
-						 &hashvalue, mintup);
+			accessor = hashtable->batches[batchno].outer_tuples;
+
+			/* cannot count on deterministic order of tupleids */
+			metadata.tupleid = sts_increment_ntuples(accessor);
+
+			sts_puttuple(hashtable->batches[batchno].outer_tuples, &metadata.hashvalue, mintup);
 
 			if (shouldFree)
 				heap_free_minimal_tuple(mintup);
@@ -1494,6 +1957,7 @@ ExecHashJoinInitializeDSM(HashJoinState *state, ParallelContext *pcxt)
 
 	/* Set up the space we'll use for shared temporary files. */
 	SharedFileSetInit(&pstate->fileset, pcxt->seg);
+	SharedFileSetInit(&pstate->sbfileset, pcxt->seg);
 
 	/* Initialize the shared state in the hash node. */
 	hashNode = (HashState *) innerPlanState(state);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3f8105c6eb..c1ad92930e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3780,8 +3780,17 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_BATCH_ELECTING:
 			event_name = "Hash/Batch/Electing";
 			break;
-		case WAIT_EVENT_HASH_BATCH_LOADING:
-			event_name = "Hash/Batch/Loading";
+		case WAIT_EVENT_HASH_STRIPE_ELECTING:
+			event_name = "Hash/Stripe/Electing";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_RESETTING:
+			event_name = "Hash/Stripe/RESETTING";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_LOADING:
+			event_name = "Hash/Stripe/Loading";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_PROBING:
+			event_name = "Hash/Stripe/Probing";
 			break;
 		case WAIT_EVENT_HASH_BUILD_ALLOCATING:
 			event_name = "Hash/Build/Allocating";
diff --git a/src/backend/utils/sort/Makefile b/src/backend/utils/sort/Makefile
index 7ac3659261..f11fe85aeb 100644
--- a/src/backend/utils/sort/Makefile
+++ b/src/backend/utils/sort/Makefile
@@ -16,6 +16,7 @@ override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS)
 
 OBJS = \
 	logtape.o \
+	sharedbits.o \
 	sharedtuplestore.o \
 	sortsupport.o \
 	tuplesort.o \
diff --git a/src/backend/utils/sort/sharedbits.c b/src/backend/utils/sort/sharedbits.c
new file mode 100644
index 0000000000..37df04844e
--- /dev/null
+++ b/src/backend/utils/sort/sharedbits.c
@@ -0,0 +1,285 @@
+#include "postgres.h"
+#include "storage/buffile.h"
+#include "utils/sharedbits.h"
+
+/*
+ * TODO: put a comment about not currently supporting parallel scan of the SharedBits
+ * To support parallel scan, need to introduce many more mechanisms
+ */
+
+/* Per-participant shared state */
+struct SharedBitsParticipant
+{
+	bool		present;
+	bool		writing;
+};
+
+/* Shared control object */
+struct SharedBits
+{
+	int			nparticipants;	/* Number of participants that can write. */
+	int64		nbits;
+	char		name[NAMEDATALEN];	/* A name for this bitstore. */
+
+	SharedBitsParticipant participants[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/* backend-local state */
+struct SharedBitsAccessor
+{
+	int			participant;
+	SharedBits *bits;
+	SharedFileSet *fileset;
+	BufFile    *write_file;
+	BufFile    *combined;
+};
+
+SharedBitsAccessor *
+sb_attach(SharedBits *sbits, int my_participant_number, SharedFileSet *fileset)
+{
+	SharedBitsAccessor *accessor = palloc0(sizeof(SharedBitsAccessor));
+
+	accessor->participant = my_participant_number;
+	accessor->bits = sbits;
+	accessor->fileset = fileset;
+	accessor->write_file = NULL;
+	accessor->combined = NULL;
+	return accessor;
+}
+
+SharedBitsAccessor *
+sb_initialize(SharedBits *sbits,
+			  int participants,
+			  int my_participant_number,
+			  SharedFileSet *fileset,
+			  char *name)
+{
+	SharedBitsAccessor *accessor;
+
+	sbits->nparticipants = participants;
+	strcpy(sbits->name, name);
+	sbits->nbits = 0;			/* TODO: maybe delete this */
+
+	accessor = palloc0(sizeof(SharedBitsAccessor));
+	accessor->participant = my_participant_number;
+	accessor->bits = sbits;
+	accessor->fileset = fileset;
+	accessor->write_file = NULL;
+	accessor->combined = NULL;
+	return accessor;
+}
+
+/*  TODO: is "initialize_accessor" a clear enough API for this? (making the file)? */
+void
+sb_initialize_accessor(SharedBitsAccessor *accessor, uint32 nbits)
+{
+	char		name[MAXPGPATH];
+	uint32		num_to_write;
+
+	snprintf(name, MAXPGPATH, "%s.p%d.bitmap", accessor->bits->name, accessor->participant);
+
+	accessor->write_file =
+		BufFileCreateShared(accessor->fileset, name);
+
+	accessor->bits->participants[accessor->participant].present = true;
+	/* TODO: check this math. tuplenumber will be too high? */
+	num_to_write = nbits / 8 + 1;
+
+	/*
+	 * TODO: add tests that could exercise a problem with junk being written
+	 * to bitmap
+	 */
+
+	/*
+	 * TODO: is there a better way to write the bytes to the file without
+	 * calling BufFileWrite() like this? palloc()ing an undetermined number of
+	 * bytes feels like it is against the spirit of this patch to begin with,
+	 * but the many function calls seem expensive
+	 */
+	for (int i = 0; i < num_to_write; i++)
+	{
+		unsigned char byteToWrite = 0;
+
+		BufFileWrite(accessor->write_file, &byteToWrite, 1);
+	}
+
+	if (BufFileSeek(accessor->write_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+}
+
+size_t
+sb_estimate(int participants)
+{
+	return offsetof(SharedBits, participants) + participants * sizeof(SharedBitsParticipant);
+}
+
+
+void
+sb_setbit(SharedBitsAccessor *accessor, uint64 bit)
+{
+	SharedBitsParticipant *const participant =
+	&accessor->bits->participants[accessor->participant];
+
+	/* TODO: use an unsigned int instead of a byte */
+	unsigned char current_outer_byte;
+
+	Assert(accessor->write_file);
+
+	if (!participant->writing)
+	{
+		participant->writing = true;
+	}
+
+	BufFileSeek(accessor->write_file, 0, bit / 8, SEEK_SET);
+	BufFileRead(accessor->write_file, &current_outer_byte, 1);
+
+	current_outer_byte |= 1U << (bit % 8);
+
+	BufFileSeek(accessor->write_file, 0, -1, SEEK_CUR);
+	BufFileWrite(accessor->write_file, &current_outer_byte, 1);
+}
+
+bool
+sb_checkbit(SharedBitsAccessor *accessor, uint32 n)
+{
+	bool		match;
+	uint32		bytenum = n / 8;
+	unsigned char bit = n % 8;
+	unsigned char byte_to_check = 0;
+
+	Assert(accessor->combined);
+
+	/* seek to byte to check */
+	if (BufFileSeek(accessor->combined,
+					0,
+					bytenum,
+					SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg(
+						"could not rewind shared outer temporary file: %m")));
+	/* read byte containing ntuple bit */
+	if (BufFileRead(accessor->combined, &byte_to_check, 1) == 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg(
+						"could not read byte in outer match status bitmap: %m.")));
+	/* if bit is set */
+	match = ((byte_to_check) >> bit) & 1;
+
+	return match;
+}
+
+BufFile *
+sb_combine(SharedBitsAccessor *accessor)
+{
+	/*
+	 * TODO: this tries to close an outer match status file for each
+	 * participant in the tuplestore. technically, only participants in the
+	 * barrier could have outer match status files, however, all but one
+	 * participant continue on and detach from the barrier so we won't have a
+	 * reliable way to close only files for those attached to the barrier
+	 */
+	BufFile   **statuses;
+	BufFile    *combined_bitmap_file;
+	int			statuses_length;
+
+	int			nbparticipants = 0;
+
+	for (int l = 0; l < accessor->bits->nparticipants; l++)
+	{
+		SharedBitsParticipant participant = accessor->bits->participants[l];
+
+		if (participant.present)
+		{
+			Assert(!participant.writing);
+			nbparticipants++;
+		}
+	}
+	statuses = palloc(sizeof(BufFile *) * nbparticipants);
+
+	/*
+	 * Open the bitmap shared BufFile from each participant. TODO: explain why
+	 * file can be NULLs
+	 */
+	statuses_length = 0;
+
+	for (int i = 0; i < accessor->bits->nparticipants; i++)
+	{
+		char		bitmap_filename[MAXPGPATH];
+		BufFile    *file;
+
+		/* TODO: make a function that will do this */
+		snprintf(bitmap_filename, MAXPGPATH, "%s.p%d.bitmap", accessor->bits->name, i);
+
+		if (!accessor->bits->participants[i].present)
+			continue;
+		file = BufFileOpenShared(accessor->fileset, bitmap_filename);
+
+		Assert(file);
+
+		statuses[statuses_length++] = file;
+	}
+
+	combined_bitmap_file = BufFileCreateTemp(false);
+
+	for (int64 cur = 0; cur < BufFileSize(statuses[0]); cur++)	/* make it while not EOF */
+	{
+		/*
+		 * TODO: make this use an unsigned int instead of a byte so it isn't
+		 * so slow
+		 */
+		unsigned char combined_byte = 0;
+
+		for (int i = 0; i < statuses_length; i++)
+		{
+			unsigned char read_byte;
+
+			BufFileRead(statuses[i], &read_byte, 1);
+			combined_byte |= read_byte;
+		}
+
+		BufFileWrite(combined_bitmap_file, &combined_byte, 1);
+	}
+
+	if (BufFileSeek(combined_bitmap_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	for (int i = 0; i < statuses_length; i++)
+		BufFileClose(statuses[i]);
+	pfree(statuses);
+
+	accessor->combined = combined_bitmap_file;
+	return combined_bitmap_file;
+}
+
+void
+sb_end_write(SharedBitsAccessor *sba)
+{
+	SharedBitsParticipant
+			   *const participant = &sba->bits->participants[sba->participant];
+
+	participant->writing = false;
+
+	/*
+	 * TODO: this should not be needed if flow is correct. need to fix that
+	 * and get rid of this check
+	 */
+	if (sba->write_file)
+		BufFileClose(sba->write_file);
+	sba->write_file = NULL;
+}
+
+void
+sb_end_read(SharedBitsAccessor *accessor)
+{
+	if (accessor->combined == NULL)
+		return;
+
+	BufFileClose(accessor->combined);
+	accessor->combined = NULL;
+}
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index c3ab494a45..0e3b3de2b6 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -52,6 +52,7 @@ typedef struct SharedTuplestoreParticipant
 {
 	LWLock		lock;
 	BlockNumber read_page;		/* Page number for next read. */
+	bool		rewound;
 	BlockNumber npages;			/* Number of pages written. */
 	bool		writing;		/* Used only for assertions. */
 } SharedTuplestoreParticipant;
@@ -60,6 +61,7 @@ typedef struct SharedTuplestoreParticipant
 struct SharedTuplestore
 {
 	int			nparticipants;	/* Number of participants that can write. */
+	pg_atomic_uint32 ntuples;	/* Number of tuples in this tuplestore. */
 	int			flags;			/* Flag bits from SHARED_TUPLESTORE_XXX */
 	size_t		meta_data_size; /* Size of per-tuple header. */
 	char		name[NAMEDATALEN];	/* A name for this tuplestore. */
@@ -85,6 +87,8 @@ struct SharedTuplestoreAccessor
 	char	   *read_buffer;	/* A buffer for loading tuples. */
 	size_t		read_buffer_size;
 	BlockNumber read_next_page; /* Lowest block we'll consider reading. */
+	BlockNumber start_page;		/* page to reset p->read_page to if back out
+								 * required */
 
 	/* State for writing. */
 	SharedTuplestoreChunk *write_chunk; /* Buffer for writing. */
@@ -137,6 +141,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 	Assert(my_participant_number < participants);
 
 	sts->nparticipants = participants;
+	pg_atomic_init_u32(&sts->ntuples, 1);
 	sts->meta_data_size = meta_data_size;
 	sts->flags = flags;
 
@@ -158,6 +163,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 		LWLockInitialize(&sts->participants[i].lock,
 						 LWTRANCHE_SHARED_TUPLESTORE);
 		sts->participants[i].read_page = 0;
+		sts->participants[i].rewound = false;
 		sts->participants[i].writing = false;
 	}
 
@@ -277,6 +283,45 @@ sts_begin_parallel_scan(SharedTuplestoreAccessor *accessor)
 	accessor->read_participant = accessor->participant;
 	accessor->read_file = NULL;
 	accessor->read_next_page = 0;
+	accessor->start_page = 0;
+}
+
+void
+sts_resume_parallel_scan(SharedTuplestoreAccessor *accessor)
+{
+	int			i PG_USED_FOR_ASSERTS_ONLY;
+	SharedTuplestoreParticipant *p;
+
+	/* End any existing scan that was in progress. */
+	sts_end_parallel_scan(accessor);
+
+	/*
+	 * Any backend that might have written into this shared tuplestore must
+	 * have called sts_end_write(), so that all buffers are flushed and the
+	 * files have stopped growing.
+	 */
+	for (i = 0; i < accessor->sts->nparticipants; ++i)
+		Assert(!accessor->sts->participants[i].writing);
+
+	/*
+	 * We will start out reading the file that THIS backend wrote.  There may
+	 * be some caching locality advantage to that.
+	 */
+
+	/*
+	 * TODO: does this still apply in the multi-stripe case? It seems like if
+	 * a participant file is exhausted for the current stripe it might be
+	 * better to remember that
+	 */
+	accessor->read_participant = accessor->participant;
+	accessor->read_file = NULL;
+	p = &accessor->sts->participants[accessor->read_participant];
+
+	/* TODO: find a better solution than this for resuming the parallel scan */
+	LWLockAcquire(&p->lock, LW_SHARED);
+	accessor->start_page = p->read_page;
+	LWLockRelease(&p->lock);
+	accessor->read_next_page = 0;
 }
 
 /*
@@ -295,6 +340,7 @@ sts_end_parallel_scan(SharedTuplestoreAccessor *accessor)
 		BufFileClose(accessor->read_file);
 		accessor->read_file = NULL;
 	}
+	accessor->start_page = 0;
 }
 
 /*
@@ -531,7 +577,13 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	for (;;)
 	{
 		/* Can we read more tuples from the current chunk? */
-		if (accessor->read_ntuples < accessor->read_ntuples_available)
+		/*
+		 * Added a check for accessor->read_file being present here, as it
+		 * became relevant for adaptive hashjoin. Not sure if this has other
+		 * consequences for correctness
+		 */
+
+		if (accessor->read_ntuples < accessor->read_ntuples_available && accessor->read_file)
 			return sts_read_tuple(accessor, meta_data);
 
 		/* Find the location of a new chunk to read. */
@@ -541,7 +593,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 		/* We can skip directly past overflow pages we know about. */
 		if (p->read_page < accessor->read_next_page)
 			p->read_page = accessor->read_next_page;
-		eof = p->read_page >= p->npages;
+		eof = p->read_page >= p->npages || p->rewound;
 		if (!eof)
 		{
 			/* Claim the next chunk. */
@@ -549,9 +601,22 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 			/* Advance the read head for the next reader. */
 			p->read_page += STS_CHUNK_PAGES;
 			accessor->read_next_page = p->read_page;
+
+			/*
+			 * initialize start_page to the read_page this participant will
+			 * start reading from
+			 */
+			accessor->start_page = read_page;
 		}
 		LWLockRelease(&p->lock);
 
+		if (!eof)
+		{
+			char		name[MAXPGPATH];
+
+			sts_filename(name, accessor, accessor->read_participant);
+		}
+
 		if (!eof)
 		{
 			SharedTuplestoreChunk chunk_header;
@@ -613,6 +678,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 			if (accessor->read_participant == accessor->participant)
 				break;
 			accessor->read_next_page = 0;
+			accessor->start_page = 0;
 
 			/* Go around again, so we can get a chunk from this file. */
 		}
@@ -621,6 +687,48 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	return NULL;
 }
 
+void
+sts_parallel_scan_rewind(SharedTuplestoreAccessor *accessor)
+{
+	SharedTuplestoreParticipant *p =
+	&accessor->sts->participants[accessor->read_participant];
+
+	/*
+	 * Only set the read_page back to the start of the sts_chunk this worker
+	 * was reading if some other worker has not already done so. It could be
+	 * the case that this worker saw a tuple from a future stripe and another
+	 * worker did also in its sts_chunk and it already set read_page to its
+	 * start_page If so, we want to set read_page to the lowest value to
+	 * ensure that we read all tuples from the stripe (don't miss tuples)
+	 */
+	LWLockAcquire(&p->lock, LW_EXCLUSIVE);
+	p->read_page = Min(p->read_page, accessor->start_page);
+	p->rewound = true;
+	LWLockRelease(&p->lock);
+
+	accessor->read_ntuples_available = 0;
+	accessor->read_next_page = 0;
+}
+
+void
+sts_reset_rewound(SharedTuplestoreAccessor *accessor)
+{
+	for (int i = 0; i < accessor->sts->nparticipants; ++i)
+		accessor->sts->participants[i].rewound = false;
+}
+
+uint32
+sts_increment_ntuples(SharedTuplestoreAccessor *accessor)
+{
+	return pg_atomic_fetch_add_u32(&accessor->sts->ntuples, 1);
+}
+
+uint32
+sts_get_tuplenum(SharedTuplestoreAccessor *accessor)
+{
+	return pg_atomic_read_u32(&accessor->sts->ntuples);
+}
+
 /*
  * Create the name used for the BufFile that a given participant will write.
  */
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index ba661d32a6..0ba9d856c8 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -46,6 +46,7 @@ typedef struct ExplainState
 	bool		timing;			/* print detailed node timing */
 	bool		summary;		/* print total planning and execution timing */
 	bool		settings;		/* print modified settings */
+	bool		usage;			/* print memory usage */
 	ExplainFormat format;		/* output format */
 	/* state for output formatting --- not reset for each new plan tree */
 	int			indent;			/* current indentation level */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 79b634e8ed..9ffcd84806 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -19,6 +19,7 @@
 #include "storage/barrier.h"
 #include "storage/buffile.h"
 #include "storage/lwlock.h"
+#include "utils/sharedbits.h"
 
 /* ----------------------------------------------------------------
  *				hash-join hash table structures
@@ -152,6 +153,7 @@ typedef struct ParallelHashJoinBatch
 {
 	dsa_pointer buckets;		/* array of hash table buckets */
 	Barrier		batch_barrier;	/* synchronization for joining this batch */
+	Barrier		stripe_barrier; /* synchronization for stripes */
 
 	dsa_pointer chunks;			/* chunks of tuples loaded */
 	size_t		size;			/* size of buckets + chunks in memory */
@@ -160,6 +162,17 @@ typedef struct ParallelHashJoinBatch
 	size_t		old_ntuples;	/* number of tuples before repartitioning */
 	bool		space_exhausted;
 
+	/* Adaptive HashJoin */
+
+	/*
+	 * after finishing build phase, hashloop_fallback cannot change, and does
+	 * not require a lock to read
+	 */
+	bool		hashloop_fallback;
+	int			maximum_stripe_number;
+	size_t		estimated_stripe_size;	/* size of last stripe in batch */
+	LWLock		lock;
+
 	/*
 	 * Variable-sized SharedTuplestore objects follow this struct in memory.
 	 * See the accessor macros below.
@@ -177,10 +190,17 @@ typedef struct ParallelHashJoinBatch
 	 ((char *) ParallelHashJoinBatchInner(batch) +						\
 	  MAXALIGN(sts_estimate(nparticipants))))
 
+/* Accessor for sharedbits following a ParallelHashJoinBatch. */
+#define ParallelHashJoinBatchOuterBits(batch, nparticipants) \
+	((SharedBits *)												\
+	 ((char *) ParallelHashJoinBatchOuter(batch, nparticipants) +						\
+	  MAXALIGN(sts_estimate(nparticipants))))
+
 /* Total size of a ParallelHashJoinBatch and tuplestores. */
 #define EstimateParallelHashJoinBatch(hashtable)						\
 	(MAXALIGN(sizeof(ParallelHashJoinBatch)) +							\
-	 MAXALIGN(sts_estimate((hashtable)->parallel_state->nparticipants)) * 2)
+	 MAXALIGN(sts_estimate((hashtable)->parallel_state->nparticipants)) * 2 + \
+	 MAXALIGN(sb_estimate((hashtable)->parallel_state->nparticipants)))
 
 /* Accessor for the nth ParallelHashJoinBatch given the base. */
 #define NthParallelHashJoinBatch(base, n)								\
@@ -207,6 +227,7 @@ typedef struct ParallelHashJoinBatchAccessor
 	bool		done;			/* flag to remember that a batch is done */
 	SharedTuplestoreAccessor *inner_tuples;
 	SharedTuplestoreAccessor *outer_tuples;
+	SharedBitsAccessor *sba;
 } ParallelHashJoinBatchAccessor;
 
 /*
@@ -251,6 +272,7 @@ typedef struct ParallelHashJoinState
 	pg_atomic_uint32 distributor;	/* counter for load balancing */
 
 	SharedFileSet fileset;		/* space for shared temporary files */
+	SharedFileSet sbfileset;
 } ParallelHashJoinState;
 
 /* The phases for building batches, used by build_barrier. */
@@ -263,9 +285,17 @@ typedef struct ParallelHashJoinState
 /* The phases for probing each batch, used by for batch_barrier. */
 #define PHJ_BATCH_ELECTING				0
 #define PHJ_BATCH_ALLOCATING			1
-#define PHJ_BATCH_LOADING				2
-#define PHJ_BATCH_PROBING				3
-#define PHJ_BATCH_DONE					4
+#define PHJ_BATCH_STRIPING				2
+#define PHJ_BATCH_DONE					3
+
+/* The phases for probing each stripe of each batch used with stripe barriers */
+#define PHJ_STRIPE_ELECTING				0
+#define PHJ_STRIPE_RESETTING			1
+#define PHJ_STRIPE_LOADING				2
+#define PHJ_STRIPE_PROBING				3
+#define PHJ_STRIPE_DONE				    4
+#define PHJ_STRIPE_NUMBER(n)            ((n) / 5)
+#define PHJ_STRIPE_PHASE(n)             ((n) % 5)
 
 /* The phases of batch growth while hashing, for grow_batches_barrier. */
 #define PHJ_GROW_BATCHES_ELECTING		0
@@ -313,8 +343,6 @@ typedef struct HashJoinTableData
 	int			nbatch_original;	/* nbatch when we started inner scan */
 	int			nbatch_outstart;	/* nbatch when we started outer scan */
 
-	bool		growEnabled;	/* flag to shut off nbatch increases */
-
 	double		totalTuples;	/* # tuples obtained from inner plan */
 	double		partialTuples;	/* # tuples obtained from inner plan by me */
 	double		skewTuples;		/* # tuples inserted into skew tuples */
@@ -329,6 +357,13 @@ typedef struct HashJoinTableData
 	BufFile   **innerBatchFile; /* buffered virtual temp file per batch */
 	BufFile   **outerBatchFile; /* buffered virtual temp file per batch */
 
+	/*
+	 * Adaptive hashjoin variables
+	 */
+	BufFile   **hashloop_fallback;	/* outer match status files if fall back */
+	List	   *fallback_batches_stats; /* per hashjoin batch statistics */
+	int			curstripe;		/* current stripe #; 0 on 1st pass, -2 on phantom stripe */
+
 	/*
 	 * Info about the datatype-specific hash functions for the datatypes being
 	 * hashed. These are arrays of the same length as the number of hash join
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index a97562e7a4..e72bd5702a 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -14,6 +14,7 @@
 #define INSTRUMENT_H
 
 #include "portability/instr_time.h"
+#include "nodes/pg_list.h"
 
 
 typedef struct BufferUsage
@@ -39,6 +40,12 @@ typedef struct WalUsage
 	uint64		wal_bytes;		/* size of WAL records produced */
 } WalUsage;
 
+typedef struct FallbackBatchStats
+{
+	int			batchno;
+	int			numstripes;
+} FallbackBatchStats;
+
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
 typedef enum InstrumentOption
 {
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 64d2ce693c..f85308738b 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -31,6 +31,7 @@ extern void ExecParallelHashTableAlloc(HashJoinTable hashtable,
 extern void ExecHashTableDestroy(HashJoinTable hashtable);
 extern void ExecHashTableDetach(HashJoinTable hashtable);
 extern void ExecHashTableDetachBatch(HashJoinTable hashtable);
+extern bool ExecHashTableDetachStripe(HashJoinTable hashtable);
 extern void ExecParallelHashTableSetCurrentBatch(HashJoinTable hashtable,
 												 int batchno);
 
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index f7df70b5ab..0c0d87d1d3 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -129,6 +129,7 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+	uint32		tts_tuplenum;	/* a tuple id for use when ctid cannot be used */
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -425,6 +426,7 @@ static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
 	slot->tts_ops->clear(slot);
+	slot->tts_tuplenum = 0;		/* TODO: should this be done elsewhere? */
 
 	return slot;
 }
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4fee043bb2..41a4133c3a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1957,6 +1957,10 @@ typedef struct HashJoinState
 	int			hj_JoinState;
 	bool		hj_MatchedOuter;
 	bool		hj_OuterNotEmpty;
+	/* Adaptive Hashjoin variables */
+	int			hj_CurNumOuterTuples;	/* number of outer tuples in a batch */
+	unsigned int hj_CurOuterMatchStatus;
+	int			hj_EmitOuterTupleId;
 } HashJoinState;
 
 
@@ -2359,6 +2363,7 @@ typedef struct HashInstrumentation
 	int			nbatch;			/* number of batches at end of execution */
 	int			nbatch_original;	/* planned number of batches */
 	Size		space_peak;		/* peak memory usage in bytes */
+	List	   *fallback_batches_stats; /* per hashjoin batch stats */
 } HashInstrumentation;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9988..9ebdeeeb8a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -857,7 +857,10 @@ typedef enum
 	WAIT_EVENT_EXECUTE_GATHER,
 	WAIT_EVENT_HASH_BATCH_ALLOCATING,
 	WAIT_EVENT_HASH_BATCH_ELECTING,
-	WAIT_EVENT_HASH_BATCH_LOADING,
+	WAIT_EVENT_HASH_STRIPE_ELECTING,
+	WAIT_EVENT_HASH_STRIPE_RESETTING,
+	WAIT_EVENT_HASH_STRIPE_LOADING,
+	WAIT_EVENT_HASH_STRIPE_PROBING,
 	WAIT_EVENT_HASH_BUILD_ALLOCATING,
 	WAIT_EVENT_HASH_BUILD_ELECTING,
 	WAIT_EVENT_HASH_BUILD_HASHING_INNER,
diff --git a/src/include/utils/sharedbits.h b/src/include/utils/sharedbits.h
new file mode 100644
index 0000000000..de43279de8
--- /dev/null
+++ b/src/include/utils/sharedbits.h
@@ -0,0 +1,39 @@
+/*-------------------------------------------------------------------------
+ *
+ * sharedbits.h
+ *	  Simple mechanism for sharing bits between backends.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/sharedbits.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SHAREDBITS_H
+#define SHAREDBITS_H
+
+#include "storage/sharedfileset.h"
+
+struct SharedBits;
+typedef struct SharedBits SharedBits;
+
+struct SharedBitsParticipant;
+typedef struct SharedBitsParticipant SharedBitsParticipant;
+
+struct SharedBitsAccessor;
+typedef struct SharedBitsAccessor SharedBitsAccessor;
+
+extern SharedBitsAccessor *sb_attach(SharedBits *sbits, int my_participant_number, SharedFileSet *fileset);
+extern SharedBitsAccessor *sb_initialize(SharedBits *sbits, int participants, int my_participant_number, SharedFileSet *fileset, char *name);
+extern void sb_initialize_accessor(SharedBitsAccessor *accessor, uint32 nbits);
+extern size_t sb_estimate(int participants);
+
+extern void sb_setbit(SharedBitsAccessor *accessor, uint64 bit);
+extern bool sb_checkbit(SharedBitsAccessor *accessor, uint32 n);
+extern BufFile *sb_combine(SharedBitsAccessor *accessor);
+
+extern void sb_end_write(SharedBitsAccessor *sba);
+extern void sb_end_read(SharedBitsAccessor *accessor);
+
+#endif							/* SHAREDBITS_H */
diff --git a/src/include/utils/sharedtuplestore.h b/src/include/utils/sharedtuplestore.h
index 9754504cc5..99aead8a4a 100644
--- a/src/include/utils/sharedtuplestore.h
+++ b/src/include/utils/sharedtuplestore.h
@@ -22,6 +22,17 @@ typedef struct SharedTuplestore SharedTuplestore;
 
 struct SharedTuplestoreAccessor;
 typedef struct SharedTuplestoreAccessor SharedTuplestoreAccessor;
+struct tupleMetadata;
+typedef struct tupleMetadata tupleMetadata;
+struct tupleMetadata
+{
+	uint32		hashvalue;
+	union
+	{
+		uint32		tupleid;	/* tuple number or id on the outer side */
+		int			stripe;		/* stripe number for inner side */
+	};
+};
 
 /*
  * A flag indicating that the tuplestore will only be scanned once, so backing
@@ -49,6 +60,8 @@ extern void sts_reinitialize(SharedTuplestoreAccessor *accessor);
 
 extern void sts_begin_parallel_scan(SharedTuplestoreAccessor *accessor);
 
+extern void sts_resume_parallel_scan(SharedTuplestoreAccessor *accessor);
+
 extern void sts_end_parallel_scan(SharedTuplestoreAccessor *accessor);
 
 extern void sts_puttuple(SharedTuplestoreAccessor *accessor,
@@ -58,4 +71,10 @@ extern void sts_puttuple(SharedTuplestoreAccessor *accessor,
 extern MinimalTuple sts_parallel_scan_next(SharedTuplestoreAccessor *accessor,
 										   void *meta_data);
 
+extern void sts_parallel_scan_rewind(SharedTuplestoreAccessor *accessor);
+
+extern void sts_reset_rewound(SharedTuplestoreAccessor *accessor);
+extern uint32 sts_increment_ntuples(SharedTuplestoreAccessor *accessor);
+extern uint32 sts_get_tuplenum(SharedTuplestoreAccessor *accessor);
+
 #endif							/* SHAREDTUPLESTORE_H */
diff --git a/src/test/regress/expected/join_hash.out b/src/test/regress/expected/join_hash.out
index 3a91c144a2..98a90a85e4 100644
--- a/src/test/regress/expected/join_hash.out
+++ b/src/test/regress/expected/join_hash.out
@@ -443,7 +443,7 @@ $$
 $$);
  original | final 
 ----------+-------
-        1 |     2
+        1 |     4
 (1 row)
 
 rollback to settings;
@@ -478,7 +478,7 @@ $$
 $$);
  original | final 
 ----------+-------
-        1 |     2
+        1 |     4
 (1 row)
 
 rollback to settings;
@@ -1013,3 +1013,944 @@ WHERE
 (1 row)
 
 ROLLBACK;
+-- Serial Adaptive Hash Join
+CREATE TYPE stub AS (hash INTEGER, value CHAR(8098));
+CREATE FUNCTION stub_hash(item stub)
+RETURNS INTEGER AS $$
+DECLARE
+  batch_size INTEGER;
+BEGIN
+  batch_size := 4;
+  RETURN item.hash << (batch_size - 1);
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+CREATE FUNCTION stub_eq(item1 stub, item2 stub)
+RETURNS BOOLEAN AS $$
+BEGIN
+  RETURN item1.hash = item2.hash AND item1.value = item2.value;
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+CREATE OPERATOR = (
+  FUNCTION = stub_eq,
+  LEFTARG = stub,
+  RIGHTARG = stub,
+  COMMUTATOR = =,
+  HASHES, MERGES
+);
+CREATE OPERATOR CLASS stub_hash_ops
+DEFAULT FOR TYPE stub USING hash AS
+  OPERATOR 1 =(stub, stub),
+  FUNCTION 1 stub_hash(stub);
+CREATE TABLE probeside(a stub);
+ALTER TABLE probeside ALTER COLUMN a SET STORAGE PLAIN;
+-- non-fallback batch with unmatched outer tuple
+INSERT INTO probeside SELECT '(2, "")' FROM generate_series(1, 1);
+-- fallback batch unmatched outer tuple (in first stripe maybe)
+INSERT INTO probeside SELECT '(1, "unmatched outer tuple")' FROM generate_series(1, 1);
+-- fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(1, "")' FROM generate_series(1, 5);
+-- fallback batch unmatched outer tuple (in last stripe maybe)
+-- When numbatches=4, hash 5 maps to batch 1, but after numbatches doubles to
+-- 8 batches hash 5 maps to batch 5.
+INSERT INTO probeside SELECT '(5, "")' FROM generate_series(1, 1);
+-- non-fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(3, "")' FROM generate_series(1, 1);
+-- batch with 3 stripes where non-first/non-last stripe contains unmatched outer tuple
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 5);
+INSERT INTO probeside SELECT '(6, "unmatched outer tuple")' FROM generate_series(1, 1);
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 1);
+CREATE TABLE hashside_wide(a stub, id int);
+ALTER TABLE hashside_wide ALTER COLUMN a SET STORAGE PLAIN;
+-- falls back with an unmatched inner tuple that is in fist, middle, and last
+-- stripe
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in first stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in middle stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in last stripe")', 1 FROM generate_series(1, 1);
+-- doesn't fall back -- matched tuple
+INSERT INTO hashside_wide SELECT '(3, "")', 3 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(6, "")', 6 FROM generate_series(1, 20);
+ANALYZE probeside, hashside_wide;
+SET enable_nestloop TO off;
+SET enable_mergejoin TO off;
+SET work_mem = 64;
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash | btrim 
+------+-----------------------+----+------+-------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+(215 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Left Join (actual rows=215 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash | btrim | id | hash |                 btrim                  
+------+-------+----+------+----------------------------------------
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    3 |       |  3 |    3 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+      |       |  1 |    1 | unmatched inner tuple in first stripe
+      |       |  1 |    1 | unmatched inner tuple in last stripe
+      |       |  1 |    1 | unmatched inner tuple in middle stripe
+(214 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Right Join (actual rows=214 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+FULL OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash |                 btrim                  
+------+-----------------------+----+------+----------------------------------------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+      |                       |  1 |    1 | unmatched inner tuple in first stripe
+      |                       |  1 |    1 | unmatched inner tuple in last stripe
+      |                       |  1 |    1 | unmatched inner tuple in middle stripe
+(218 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+FULL OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Full Join (actual rows=218 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+/*
+-- semi-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+*/
+-- anti-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Anti Join (actual rows=4 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+ hash |         btrim         
+------+-----------------------
+    1 | unmatched outer tuple
+    2 | 
+    5 | 
+    6 | unmatched outer tuple
+(4 rows)
+
+-- Test spill of batch 0 gives correct results.
+CREATE TABLE probeside_batch0(a stub);
+ALTER TABLE probeside_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO probeside_batch0 SELECT '(0, "")' FROM generate_series(1, 13);
+INSERT INTO probeside_batch0 SELECT '(0, "unmatched outer")' FROM generate_series(1, 1);
+CREATE TABLE hashside_wide_batch0(a stub, id int);
+ALTER TABLE hashside_wide_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO hashside_wide_batch0 SELECT '(0, "")', 1 FROM generate_series(1, 9);
+ANALYZE probeside_batch0, hashside_wide_batch0;
+SELECT (probeside_batch0.a).hash, ((((probeside_batch0.a).hash << 7) >> 3) & 31) AS batchno, TRIM((probeside_batch0.a).value), hashside_wide_batch0.id, hashside_wide_batch0.ctid, (hashside_wide_batch0.a).hash, TRIM((hashside_wide_batch0.a).value)
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash | batchno |      btrim      | id | ctid  | hash | btrim 
+------+---------+-----------------+----+-------+------+-------
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 | unmatched outer |    |       |      | 
+(118 rows)
+
diff --git a/src/test/regress/sql/join_hash.sql b/src/test/regress/sql/join_hash.sql
index 68c1a8c7b6..1f70300d02 100644
--- a/src/test/regress/sql/join_hash.sql
+++ b/src/test/regress/sql/join_hash.sql
@@ -538,3 +538,130 @@ WHERE
     AND hjtest_1.a <> hjtest_2.b;
 
 ROLLBACK;
+
+-- Serial Adaptive Hash Join
+
+CREATE TYPE stub AS (hash INTEGER, value CHAR(8098));
+
+CREATE FUNCTION stub_hash(item stub)
+RETURNS INTEGER AS $$
+DECLARE
+  batch_size INTEGER;
+BEGIN
+  batch_size := 4;
+  RETURN item.hash << (batch_size - 1);
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+
+CREATE FUNCTION stub_eq(item1 stub, item2 stub)
+RETURNS BOOLEAN AS $$
+BEGIN
+  RETURN item1.hash = item2.hash AND item1.value = item2.value;
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+
+CREATE OPERATOR = (
+  FUNCTION = stub_eq,
+  LEFTARG = stub,
+  RIGHTARG = stub,
+  COMMUTATOR = =,
+  HASHES, MERGES
+);
+
+CREATE OPERATOR CLASS stub_hash_ops
+DEFAULT FOR TYPE stub USING hash AS
+  OPERATOR 1 =(stub, stub),
+  FUNCTION 1 stub_hash(stub);
+
+CREATE TABLE probeside(a stub);
+ALTER TABLE probeside ALTER COLUMN a SET STORAGE PLAIN;
+-- non-fallback batch with unmatched outer tuple
+INSERT INTO probeside SELECT '(2, "")' FROM generate_series(1, 1);
+-- fallback batch unmatched outer tuple (in first stripe maybe)
+INSERT INTO probeside SELECT '(1, "unmatched outer tuple")' FROM generate_series(1, 1);
+-- fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(1, "")' FROM generate_series(1, 5);
+-- fallback batch unmatched outer tuple (in last stripe maybe)
+-- When numbatches=4, hash 5 maps to batch 1, but after numbatches doubles to
+-- 8 batches hash 5 maps to batch 5.
+INSERT INTO probeside SELECT '(5, "")' FROM generate_series(1, 1);
+-- non-fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(3, "")' FROM generate_series(1, 1);
+-- batch with 3 stripes where non-first/non-last stripe contains unmatched outer tuple
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 5);
+INSERT INTO probeside SELECT '(6, "unmatched outer tuple")' FROM generate_series(1, 1);
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 1);
+
+CREATE TABLE hashside_wide(a stub, id int);
+ALTER TABLE hashside_wide ALTER COLUMN a SET STORAGE PLAIN;
+-- falls back with an unmatched inner tuple that is in fist, middle, and last
+-- stripe
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in first stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in middle stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in last stripe")', 1 FROM generate_series(1, 1);
+
+-- doesn't fall back -- matched tuple
+INSERT INTO hashside_wide SELECT '(3, "")', 3 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(6, "")', 6 FROM generate_series(1, 20);
+
+ANALYZE probeside, hashside_wide;
+
+SET enable_nestloop TO off;
+SET enable_mergejoin TO off;
+SET work_mem = 64;
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+FULL OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+FULL OUTER JOIN hashside_wide USING (a);
+
+/*
+-- semi-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+*/
+
+-- anti-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+
+-- Test spill of batch 0 gives correct results.
+CREATE TABLE probeside_batch0(a stub);
+ALTER TABLE probeside_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO probeside_batch0 SELECT '(0, "")' FROM generate_series(1, 13);
+INSERT INTO probeside_batch0 SELECT '(0, "unmatched outer")' FROM generate_series(1, 1);
+
+CREATE TABLE hashside_wide_batch0(a stub, id int);
+ALTER TABLE hashside_wide_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO hashside_wide_batch0 SELECT '(0, "")', 1 FROM generate_series(1, 9);
+ANALYZE probeside_batch0, hashside_wide_batch0;
+
+SELECT (probeside_batch0.a).hash, ((((probeside_batch0.a).hash << 7) >> 3) & 31) AS batchno, TRIM((probeside_batch0.a).value), hashside_wide_batch0.id, hashside_wide_batch0.ctid, (hashside_wide_batch0.a).hash, TRIM((hashside_wide_batch0.a).value)
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5;
-- 
2.20.1

#54

Melanie Plageman

melanieplageman@gmail.com

over 5 years ago

In reply to: Melanie Plageman (#53)

1 attachment(s)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

I've attached a rebased patch which includes the "provisionally detach"
deadlock hazard fix approach as well as addresses some of the following
feedback from Jeff Davis provided off-list:

Can you add some high-level comments that describe the algorithm and
what the terms mean?

I added to the large comment at the top of nodeHashjoin.c. I've also
added comments to a few of the new members in some structs. Plus I've
added some in-line comments to assist the reviewer that may or may not
be overkill in a final version.

Can you add some comments to describe what's happening when a batch is
entering fallback mode?

...

Can you add some comments describing tuple relocation?

...

Can you describe somewhere what all the bits for outer matches are for?

All three done.

Also, we kept the batch 0 spilling patch David Kimura authored [1]/messages/by-id/CAHnPFjQiYN83NjQ4KvjX19Wti==uzyw8D24va56zJKzOt+B51A@mail.gmail.com
separate so it could be discussed separately because we still had some
questions.
It would be great to discuss those, however, keeping them separate might
be more confusing -- I'm not sure.

[1]: /messages/by-id/CAHnPFjQiYN83NjQ4KvjX19Wti==uzyw8D24va56zJKzOt+B51A@mail.gmail.com
/messages/by-id/CAHnPFjQiYN83NjQ4KvjX19Wti==uzyw8D24va56zJKzOt+B51A@mail.gmail.com

Attachments:

v8-0001-Implement-Adaptive-Hashjoin.patchapplication/octet-stream; name=v8-0001-Implement-Adaptive-Hashjoin.patchDownload

From d9859f157235e8305430066a968900ed7d34244e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 May 2020 17:17:05 -0700
Subject: [PATCH v8] Implement Adaptive Hashjoin

If the inner side tuples of a hashjoin will not fit in memory, the
hashjoin can be executed in multiple batches. If the statistics on the
inner side relation are accurate, planner chooses a multi-batch
strategy and sets the number of batches.
The query executor measures the real size of the hashtable and increases
the number of batches if the hashtable grows too large.

The number of batches is always a power of two, so an increase in the
number of batches doubles it.

Serial hashjoin measures batch size lazily -- waiting until it is
loading a batch to determine if it will fit in memory.

Parallel hashjoin, on the other hand, completes all changes to the
number of batches during the build phase. If it doubles the number of
batches, it dumps all the tuples out, reassigns them to batches,
measures each batch, and checks that it will fit in the space allowed.

In both cases, the executor currently makes a best effort. If a
particular batch won't fit in memory, and, upon changing the number of
batches none of the tuples move to a new batch, the executor disables
growth in the number of batches globally. After growth is disabled, all
batches that would have previously triggered an increase in the number
of batches instead exceed the space allowed.

There is no mechanism to perform a hashjoin within memory constraints if
a run of tuples hash to the same batch. Also, hashjoin will continue to
double the number of batches if *some* tuples move each time -- even if
the batch will never fit in memory -- resulting in an explosion in the
number of batches (affecting performance negatively for multiple
reasons).

Adaptive hashjoin is a mechanism to process a run of inner side tuples
with join keys which hash to the same batch in a manner that is
efficient and respects the space allowed.

When an offending batch causes the number of batches to be doubled and
some percentage of the tuples would not move to a new batch, that batch
can be marked to "fall back". This mechanism replaces serial hashjoin's
"grow_enabled" flag and replaces part of the functionality of parallel
hashjoin's "growth = PHJ_GROWTH_DISABLED" flag. However, instead of
disabling growth in the number of batches for all batches, it only
prevents this batch from causing another increase in the number of
batches.

When the inner side of this batch is loaded into memory, stripes of
arbitrary tuples totaling work_mem in size are loaded into the
hashtable. After probing this stripe, the outer side batch is rewound
and the next stripe is loaded. Each stripe of inner is probed until all
tuples have been processed.

Tuples that match are emitted (depending on the join semantics of the
particular join type) during probing of a stripe. In order to make
left outer join work, unmatched tuples cannot be emitted NULL-extended
until all stripes have been probed. To address this, a bitmap is created
with a bit for each tuple of the outer side. If a tuple on the outer
side matches a tuple from the inner, the corresponding bit is set. At
the end of probing all stripes, the executor scans the bitmap and emits
unmatched outer tuples.

Note that there is still a separate patch to make Batch 0 fallback to
make it easier to talk about separately.

TODOs:
- Fix semi-join
- Stripe instrumentation for parallel adaptive hashjoin
- Experiment with different fallback threshholds
  (currently hardcoded to 80% but parameterizable)
- Assorted TODOs in the code

Co-authored-by: Jesse Zhang <sbjesse@gmail.com>
Co-authored-by: David Kimura <dkimura@pivotal.io>
---
 src/backend/commands/explain.c            |  43 +-
 src/backend/executor/nodeHash.c           | 330 ++++++--
 src/backend/executor/nodeHashjoin.c       | 777 +++++++++++++++---
 src/backend/postmaster/pgstat.c           |  13 +-
 src/backend/storage/ipc/barrier.c         |   2 +-
 src/backend/utils/sort/Makefile           |   1 +
 src/backend/utils/sort/sharedbits.c       | 285 +++++++
 src/backend/utils/sort/sharedtuplestore.c | 112 ++-
 src/include/commands/explain.h            |   1 +
 src/include/executor/hashjoin.h           |  76 +-
 src/include/executor/instrument.h         |   7 +
 src/include/executor/nodeHash.h           |   1 +
 src/include/executor/tuptable.h           |   2 +
 src/include/nodes/execnodes.h             |   5 +
 src/include/pgstat.h                      |   5 +-
 src/include/utils/sharedbits.h            |  39 +
 src/include/utils/sharedtuplestore.h      |  19 +
 src/test/regress/expected/join_hash.out   | 945 +++++++++++++++++++++-
 src/test/regress/sql/join_hash.sql        | 127 +++
 19 files changed, 2617 insertions(+), 173 deletions(-)
 create mode 100644 src/backend/utils/sort/sharedbits.c
 create mode 100644 src/include/utils/sharedbits.h

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index efd7201d61..76f6da0688 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -184,6 +184,8 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 			es->wal = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "settings") == 0)
 			es->settings = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "usage") == 0)
+			es->usage = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "timing") == 0)
 		{
 			timing_set = true;
@@ -312,6 +314,7 @@ NewExplainState(void)
 
 	/* Set default options (most fields can be left as zeroes). */
 	es->costs = true;
+	es->usage = true;
 	/* Prepare output buffer. */
 	es->str = makeStringInfo();
 
@@ -3022,22 +3025,50 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		else if (hinstrument.nbatch_original != hinstrument.nbatch ||
 				 hinstrument.nbuckets_original != hinstrument.nbuckets)
 		{
+			ListCell   *lc;
+
 			ExplainIndentText(es);
 			appendStringInfo(es->str,
-							 "Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 "Buckets: %d (originally %d)  Batches: %d (originally %d)",
 							 hinstrument.nbuckets,
 							 hinstrument.nbuckets_original,
 							 hinstrument.nbatch,
-							 hinstrument.nbatch_original,
-							 spacePeakKb);
+							 hinstrument.nbatch_original);
+			if (es->usage)
+				appendStringInfo(es->str, "  Memory Usage: %ldkB\n", spacePeakKb);
+			else
+				appendStringInfo(es->str, "\n");
+
+			foreach(lc, hinstrument.fallback_batches_stats)
+			{
+				FallbackBatchStats *fbs = lfirst(lc);
+
+				ExplainIndentText(es);
+				appendStringInfo(es->str, "Batch: %d  Stripes: %d\n", fbs->batchno, fbs->numstripes);
+			}
 		}
 		else
 		{
+			ListCell   *lc;
+
 			ExplainIndentText(es);
 			appendStringInfo(es->str,
-							 "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
-							 hinstrument.nbuckets, hinstrument.nbatch,
-							 spacePeakKb);
+							 "Buckets: %d  Batches: %d",
+							 hinstrument.nbuckets, hinstrument.nbatch);
+			if (es->usage)
+				appendStringInfo(es->str, "  Memory Usage: %ldkB\n", spacePeakKb);
+			else
+				appendStringInfo(es->str, "\n");
+			foreach(lc, hinstrument.fallback_batches_stats)
+			{
+				FallbackBatchStats *fbs = lfirst(lc);
+
+				ExplainIndentText(es);
+				appendStringInfo(es->str,
+								 "Batch: %d  Stripes: %d\n",
+								 fbs->batchno,
+								 fbs->numstripes);
+			}
 		}
 	}
 }
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 45b342011f..03fcc5c8bb 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -80,7 +80,6 @@ static bool ExecParallelHashTuplePrealloc(HashJoinTable hashtable,
 static void ExecParallelHashMergeCounters(HashJoinTable hashtable);
 static void ExecParallelHashCloseBatchAccessors(HashJoinTable hashtable);
 
-
 /* ----------------------------------------------------------------
  *		ExecHash
  *
@@ -321,6 +320,27 @@ MultiExecParallelHash(HashState *node)
 				 * skew).
 				 */
 				pstate->growth = PHJ_GROWTH_DISABLED;
+
+				/*
+				 * In the current design, batch 0 cannot fall back. That
+				 * behavior is an artifact of the existing design where batch
+				 * 0 fills the initial hash table and as an optimization it
+				 * doesn't need a batch file. But, there is no real reason
+				 * that batch 0 shouldn't be allowed to spill.
+				 *
+				 * Consider a hash table where majority of tuples with
+				 * hashvalue 0. These tuples will never relocate no matter how
+				 * many batches exist. If you cannot exceed work_mem, then you
+				 * will be stuck infinitely trying to double the number of
+				 * batches in order to accommodate the tuples that can only
+				 * ever be in batch 0. So, we allow it to be set to fall back
+				 * during the build phase to avoid excessive batch increases
+				 * but we don't check it when loading the actual tuples, so we
+				 * may exceed space_allowed. We set it back to false here so
+				 * that it isn't true during any of the checks that may happen
+				 * during probing.
+				 */
+				hashtable->batches[0].shared->hashloop_fallback = false;
 			}
 	}
 
@@ -495,12 +515,14 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 	hashtable->curbatch = 0;
 	hashtable->nbatch_original = nbatch;
 	hashtable->nbatch_outstart = nbatch;
-	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
 	hashtable->partialTuples = 0;
 	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
+	hashtable->hashloop_fallback = NULL;
+	hashtable->fallback_batches_stats = NULL;
+	hashtable->curstripe = -1;
 	hashtable->spaceUsed = 0;
 	hashtable->spacePeak = 0;
 	hashtable->spaceAllowed = space_allowed;
@@ -572,6 +594,8 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 			palloc0(nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			palloc0(nbatch * sizeof(BufFile *));
+		hashtable->hashloop_fallback = (BufFile **)
+			palloc0(nbatch * sizeof(BufFile *));
 		/* The files will not be opened until needed... */
 		/* ... but make sure we have temp tablespaces established for them */
 		PrepareTempTablespaces();
@@ -866,6 +890,8 @@ ExecHashTableDestroy(HashJoinTable hashtable)
 				BufFileClose(hashtable->innerBatchFile[i]);
 			if (hashtable->outerBatchFile[i])
 				BufFileClose(hashtable->outerBatchFile[i]);
+			if (hashtable->hashloop_fallback[i])
+				BufFileClose(hashtable->hashloop_fallback[i]);
 		}
 	}
 
@@ -876,6 +902,18 @@ ExecHashTableDestroy(HashJoinTable hashtable)
 	pfree(hashtable);
 }
 
+/*
+ * Threshhold for tuple relocation during batch split for parallel and serial
+ * hashjoin.
+ * While growing the number of batches, for the batch which triggered the growth,
+ * if more than MAX_RELOCATION % of its tuples move to its child batch, then
+ * it likely has skewed data and so the child batch (the new home to the skewed
+ * tuples) will be marked as a "fallback" batch and processed using the hashloop
+ * join algorithm. The reverse is true as well: if more than MAX_RELOCATION
+ * remain in the parent, it too should be marked to "fallback".
+ */
+#define MAX_RELOCATION 0.8
+
 /*
  * ExecHashIncreaseNumBatches
  *		increase the original number of batches in order to reduce
@@ -886,14 +924,18 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 {
 	int			oldnbatch = hashtable->nbatch;
 	int			curbatch = hashtable->curbatch;
+	int			childbatch;
 	int			nbatch;
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
 	HashMemoryChunk oldchunks;
+	int			curbatch_outgoing_tuples;
+	int			childbatch_outgoing_tuples;
+	int			target_batch;
+	FallbackBatchStats *fallback_batch_stats;
 
-	/* do nothing if we've decided to shut off growth */
-	if (!hashtable->growEnabled)
+	if (hashtable->hashloop_fallback && hashtable->hashloop_fallback[curbatch])
 		return;
 
 	/* safety check to avoid overflow */
@@ -917,6 +959,8 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			palloc0(nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			palloc0(nbatch * sizeof(BufFile *));
+		hashtable->hashloop_fallback = (BufFile **)
+			palloc0(nbatch * sizeof(BufFile *));
 		/* time to establish the temp tablespaces, too */
 		PrepareTempTablespaces();
 	}
@@ -927,10 +971,14 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			repalloc(hashtable->innerBatchFile, nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			repalloc(hashtable->outerBatchFile, nbatch * sizeof(BufFile *));
+		hashtable->hashloop_fallback = (BufFile **)
+			repalloc(hashtable->hashloop_fallback, nbatch * sizeof(BufFile *));
 		MemSet(hashtable->innerBatchFile + oldnbatch, 0,
 			   (nbatch - oldnbatch) * sizeof(BufFile *));
 		MemSet(hashtable->outerBatchFile + oldnbatch, 0,
 			   (nbatch - oldnbatch) * sizeof(BufFile *));
+		MemSet(hashtable->hashloop_fallback + oldnbatch, 0,
+			   (nbatch - oldnbatch) * sizeof(BufFile *));
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -942,6 +990,8 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 * no longer of the current batch.
 	 */
 	ninmemory = nfreed = 0;
+	curbatch_outgoing_tuples = childbatch_outgoing_tuples = 0;
+	childbatch = (1U << (my_log2(hashtable->nbatch) - 1)) | hashtable->curbatch;
 
 	/* If know we need to resize nbuckets, we can do it while rebatching. */
 	if (hashtable->nbuckets_optimal != hashtable->nbuckets)
@@ -999,6 +1049,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 				/* and add it back to the appropriate bucket */
 				copyTuple->next.unshared = hashtable->buckets.unshared[bucketno];
 				hashtable->buckets.unshared[bucketno] = copyTuple;
+				curbatch_outgoing_tuples++;
 			}
 			else
 			{
@@ -1010,6 +1061,16 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 
 				hashtable->spaceUsed -= hashTupleSize;
 				nfreed++;
+
+				/*
+				 * TODO: what to do about tuples that don't go to the child
+				 * batch or stay in the current batch? (this is why we are
+				 * counting tuples to child and curbatch with two diff
+				 * variables in case the tuples go to a batch that isn't the
+				 * child)
+				 */
+				if (batchno == childbatch)
+					childbatch_outgoing_tuples++;
 			}
 
 			/* next tuple in this chunk */
@@ -1030,21 +1091,39 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 #endif
 
 	/*
-	 * If we dumped out either all or none of the tuples in the table, disable
-	 * further expansion of nbatch.  This situation implies that we have
-	 * enough tuples of identical hashvalues to overflow spaceAllowed.
-	 * Increasing nbatch will not fix it since there's no way to subdivide the
-	 * group any more finely. We have to just gut it out and hope the server
-	 * has enough RAM.
+	 * For now we do not support fallback in batch 0 as it is a special case
+	 * and assumed to fit in hashtable.
+	 */
+	if (curbatch == 0)
+		return;
+
+	/*
+	 * The same batch should not be marked to fall back more than once
 	 */
-	if (nfreed == 0 || nfreed == ninmemory)
-	{
-		hashtable->growEnabled = false;
 #ifdef HJDEBUG
-		printf("Hashjoin %p: disabling further increase of nbatch\n",
-			   hashtable);
+	if ((childbatch_outgoing_tuples / (float) ninmemory) >= 0.8)
+		printf("childbatch %i targeted to fallback.", childbatch);
+	if ((curbatch_outgoing_tuples / (float) ninmemory) >= 0.8)
+		printf("curbatch %i targeted to fallback.", curbatch);
 #endif
-	}
+	/*
+	 * If too many tuples remain in the parent or too many tuples migrate to the
+	 * child, there is likely skew and continuing to increase the number of batches
+	 * will not help. Mark the batch which contains the skewed tuples to be
+	 * processed with block nested hashloop join.
+	 */
+	if ((childbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION && childbatch > 0)
+		target_batch = childbatch;
+	else if ((curbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION && curbatch > 0)
+		target_batch = curbatch;
+	else
+		return;
+	hashtable->hashloop_fallback[target_batch] = BufFileCreateTemp(false);
+
+	fallback_batch_stats = palloc0(sizeof(FallbackBatchStats));
+	fallback_batch_stats->batchno = target_batch;
+	fallback_batch_stats->numstripes = 0;
+	hashtable->fallback_batches_stats = lappend(hashtable->fallback_batches_stats, fallback_batch_stats);
 }
 
 /*
@@ -1213,7 +1292,6 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 									 WAIT_EVENT_HASH_GROW_BATCHES_DECIDE))
 			{
 				bool		space_exhausted = false;
-				bool		extreme_skew_detected = false;
 
 				/* Make sure that we have the current dimensions and buckets. */
 				ExecParallelHashEnsureBatchAccessors(hashtable);
@@ -1224,27 +1302,56 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 				{
 					ParallelHashJoinBatch *batch = hashtable->batches[i].shared;
 
+					/*
+					 * All batches were just created anew during
+					 * repartitioning
+					 */
+					Assert(!batch->hashloop_fallback);
+
+					/*
+					 * At the time of repartitioning, each batch updates its
+					 * estimated_size to reflect the size of the batch file on
+					 * disk. It is also updated when increasing preallocated
+					 * space in ExecParallelHashTuplePrealloc().  However,
+					 * batch 0 does not store anything on disk so it has no
+					 * estimated_size.
+					 *
+					 * We still want to allow batch 0 to trigger batch growth.
+					 * In order to do that, for batch 0 check whether the
+					 * actual size exceeds space_allowed. It is a little
+					 * backwards at this point as we would have already
+					 * exceeded inserted the allowed space.
+					 */
 					if (batch->space_exhausted ||
-						batch->estimated_size > pstate->space_allowed)
+						batch->estimated_size > pstate->space_allowed ||
+						batch->size > pstate->space_allowed)
 					{
 						int			parent;
+						float		frac_moved;
 
 						space_exhausted = true;
 
+						parent = i % pstate->old_nbatch;
+						frac_moved = batch->ntuples / (float) hashtable->batches[parent].shared->old_ntuples;
+
 						/*
-						 * Did this batch receive ALL of the tuples from its
-						 * parent batch?  That would indicate that further
-						 * repartitioning isn't going to help (the hash values
-						 * are probably all the same).
+						 * If too many tuples remain in the parent or too many tuples migrate to the
+						 * child, there is likely skew and continuing to increase the number of batches
+						 * will not help. Mark the batch which contains the skewed tuples to be
+						 * processed with block nested hashloop join.
 						 */
-						parent = i % pstate->old_nbatch;
-						if (batch->ntuples == hashtable->batches[parent].shared->old_ntuples)
-							extreme_skew_detected = true;
+						if (frac_moved >= MAX_RELOCATION)
+						{
+							batch->hashloop_fallback = true;
+							space_exhausted = false;
+						}
 					}
+					if (space_exhausted)
+						break;
 				}
 
-				/* Don't keep growing if it's not helping or we'd overflow. */
-				if (extreme_skew_detected || hashtable->nbatch >= INT_MAX / 2)
+				/* Don't keep growing if we'd overflow. */
+				if (hashtable->nbatch >= INT_MAX / 2)
 					pstate->growth = PHJ_GROWTH_DISABLED;
 				else if (space_exhausted)
 					pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
@@ -1311,11 +1418,28 @@ ExecParallelHashRepartitionFirst(HashJoinTable hashtable)
 			{
 				size_t		tuple_size =
 				MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
+				tupleMetadata metadata;
 
 				/* It belongs in a later batch. */
+				ParallelHashJoinBatch *batch = hashtable->batches[batchno].shared;
+
+				LWLockAcquire(&batch->lock, LW_EXCLUSIVE);
+
+				if (batch->estimated_stripe_size + tuple_size > hashtable->parallel_state->space_allowed)
+				{
+					batch->maximum_stripe_number++;
+					batch->estimated_stripe_size = 0;
+				}
+
+				batch->estimated_stripe_size += tuple_size;
+
+				metadata.hashvalue = hashTuple->hashvalue;
+				metadata.stripe = batch->maximum_stripe_number;
+				LWLockRelease(&batch->lock);
+
 				hashtable->batches[batchno].estimated_size += tuple_size;
-				sts_puttuple(hashtable->batches[batchno].inner_tuples,
-							 &hashTuple->hashvalue, tuple);
+
+				sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata, tuple);
 			}
 
 			/* Count this tuple. */
@@ -1363,27 +1487,41 @@ ExecParallelHashRepartitionRest(HashJoinTable hashtable)
 	for (i = 1; i < old_nbatch; ++i)
 	{
 		MinimalTuple tuple;
-		uint32		hashvalue;
+		tupleMetadata metadata;
 
 		/* Scan one partition from the previous generation. */
 		sts_begin_parallel_scan(old_inner_tuples[i]);
-		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i], &hashvalue)))
+
+		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i], &metadata.hashvalue)))
 		{
 			size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 			int			bucketno;
 			int			batchno;
+			ParallelHashJoinBatch *batch;
 
 			/* Decide which partition it goes to in the new generation. */
-			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
+			ExecHashGetBucketAndBatch(hashtable, metadata.hashvalue, &bucketno,
 									  &batchno);
 
 			hashtable->batches[batchno].estimated_size += tuple_size;
 			++hashtable->batches[batchno].ntuples;
 			++hashtable->batches[i].old_ntuples;
 
+			batch = hashtable->batches[batchno].shared;
+
 			/* Store the tuple its new batch. */
-			sts_puttuple(hashtable->batches[batchno].inner_tuples,
-						 &hashvalue, tuple);
+			LWLockAcquire(&batch->lock, LW_EXCLUSIVE);
+
+			if (batch->estimated_stripe_size + tuple_size > pstate->space_allowed)
+			{
+				batch->maximum_stripe_number++;
+				batch->estimated_stripe_size = 0;
+			}
+			batch->estimated_stripe_size += tuple_size;
+			metadata.stripe = batch->maximum_stripe_number;
+			LWLockRelease(&batch->lock);
+			/* Store the tuple its new batch. */
+			sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata, tuple);
 
 			CHECK_FOR_INTERRUPTS();
 		}
@@ -1693,6 +1831,12 @@ retry:
 
 	if (batchno == 0)
 	{
+		/*
+		 * TODO: if spilling is enabled for batch 0 so that it can fall back,
+		 * we will need to stop loading batch 0 into the hashtable somewhere--
+		 * maybe here-- and switch to saving tuples to a file. Currently, this
+		 * will simply exceed the space allowed
+		 */
 		HashJoinTuple hashTuple;
 
 		/* Try to load it into memory. */
@@ -1715,10 +1859,17 @@ retry:
 	else
 	{
 		size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
+		ParallelHashJoinBatch *batch;
+		tupleMetadata metadata;
 
 		Assert(batchno > 0);
 
 		/* Try to preallocate space in the batch if necessary. */
+
+		/*
+		 * TODO: is it okay to only count the tuple when it doesn't fit in the
+		 * preallocated memory?
+		 */
 		if (hashtable->batches[batchno].preallocated < tuple_size)
 		{
 			if (!ExecParallelHashTuplePrealloc(hashtable, batchno, tuple_size))
@@ -1727,8 +1878,14 @@ retry:
 
 		Assert(hashtable->batches[batchno].preallocated >= tuple_size);
 		hashtable->batches[batchno].preallocated -= tuple_size;
-		sts_puttuple(hashtable->batches[batchno].inner_tuples, &hashvalue,
-					 tuple);
+		batch = hashtable->batches[batchno].shared;
+
+		metadata.hashvalue = hashvalue;
+		LWLockAcquire(&batch->lock, LW_SHARED);
+		metadata.stripe = batch->maximum_stripe_number;
+		LWLockRelease(&batch->lock);
+
+		sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata, tuple);
 	}
 	++hashtable->batches[batchno].ntuples;
 
@@ -2697,6 +2854,7 @@ ExecHashAccumInstrumentation(HashInstrumentation *instrument,
 									  hashtable->nbatch_original);
 	instrument->space_peak = Max(instrument->space_peak,
 								 hashtable->spacePeak);
+	instrument->fallback_batches_stats = hashtable->fallback_batches_stats;
 }
 
 /*
@@ -2850,6 +3008,8 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 	/* Check if it's time to grow batches or buckets. */
 	if (pstate->growth != PHJ_GROWTH_DISABLED)
 	{
+		ParallelHashJoinBatchAccessor batch = hashtable->batches[0];
+
 		Assert(curbatch == 0);
 		Assert(BarrierPhase(&pstate->build_barrier) == PHJ_BUILD_HASHING_INNER);
 
@@ -2858,8 +3018,13 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 		 * very large tuples or very low work_mem setting, we'll always allow
 		 * each backend to allocate at least one chunk.
 		 */
-		if (hashtable->batches[0].at_least_one_chunk &&
-			hashtable->batches[0].shared->size +
+
+		/*
+		 * TODO: get rid of this check for batch 0 and make it so that
+		 * batch 0 always has to keep trying to increase the number of batches
+		 */
+		if (!batch.shared->hashloop_fallback && batch.at_least_one_chunk &&
+			batch.shared->size +
 			chunk_size > pstate->space_allowed)
 		{
 			pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
@@ -2891,6 +3056,11 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 
 	/* We are cleared to allocate a new chunk. */
 	chunk_shared = dsa_allocate(hashtable->area, chunk_size);
+
+	/*
+	 * TODO: if batch 0 will have stripes, need to account for this memory
+	 * there
+	 */
 	hashtable->batches[curbatch].shared->size += chunk_size;
 	hashtable->batches[curbatch].at_least_one_chunk = true;
 
@@ -2960,21 +3130,39 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 	{
 		ParallelHashJoinBatchAccessor *accessor = &hashtable->batches[i];
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
+		SharedBits *sbits = ParallelHashJoinBatchOuterBits(shared, pstate->nparticipants);
 		char		name[MAXPGPATH];
+		char		sbname[MAXPGPATH];
+
+		shared->hashloop_fallback = false;
+		/* TODO: is it okay to use the same tranche for this lock? */
+		LWLockInitialize(&shared->lock, LWTRANCHE_PARALLEL_HASH_JOIN);
+		shared->maximum_stripe_number = 0;
+		shared->estimated_stripe_size = 0;
 
 		/*
 		 * All members of shared were zero-initialized.  We just need to set
 		 * up the Barrier.
 		 */
 		BarrierInit(&shared->batch_barrier, 0);
+		BarrierInit(&shared->stripe_barrier, 0);
+
+		/* Batch 0 doesn't need to be loaded. */
 		if (i == 0)
 		{
-			/* Batch 0 doesn't need to be loaded. */
 			BarrierAttach(&shared->batch_barrier);
-			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_PROBING)
+			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_STRIPING)
 				BarrierArriveAndWait(&shared->batch_barrier, 0);
 			BarrierDetach(&shared->batch_barrier);
+
+			BarrierAttach(&shared->stripe_barrier);
+			while (BarrierPhase(&shared->stripe_barrier) < PHJ_STRIPE_PROBING)
+				BarrierArriveAndWait(&shared->stripe_barrier, 0);
+			BarrierDetach(&shared->stripe_barrier);
 		}
+		accessor->last_participating_stripe_phase = PHJ_STRIPE_INVALID_PHASE;
+		/* why isn't done initialized here ? */
+		accessor->done = PHJ_BATCH_ACCESSOR_NOT_DONE;
 
 		/* Initialize accessor state.  All members were zero-initialized. */
 		accessor->shared = shared;
@@ -2985,7 +3173,7 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 			sts_initialize(ParallelHashJoinBatchInner(shared),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
@@ -2995,10 +3183,13 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 													  pstate->nparticipants),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
+		snprintf(sbname, MAXPGPATH, "%s.bitmaps", name);
+		accessor->sba = sb_initialize(sbits, pstate->nparticipants,
+									  ParallelWorkerNumber + 1, &pstate->sbfileset, sbname);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -3047,8 +3238,8 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 	 * It's possible for a backend to start up very late so that the whole
 	 * join is finished and the shm state for tracking batches has already
 	 * been freed by ExecHashTableDetach().  In that case we'll just leave
-	 * hashtable->batches as NULL so that ExecParallelHashJoinNewBatch() gives
-	 * up early.
+	 * hashtable->batches as NULL so that ExecParallelHashJoinAdvanceBatch()
+	 * gives up early.
 	 */
 	if (!DsaPointerIsValid(pstate->batches))
 		return;
@@ -3070,10 +3261,12 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 	{
 		ParallelHashJoinBatchAccessor *accessor = &hashtable->batches[i];
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
+		SharedBits *sbits = ParallelHashJoinBatchOuterBits(shared, pstate->nparticipants);
 
 		accessor->shared = shared;
 		accessor->preallocated = 0;
-		accessor->done = false;
+		accessor->done = PHJ_BATCH_ACCESSOR_NOT_DONE;
+		accessor->last_participating_stripe_phase = PHJ_STRIPE_INVALID_PHASE;
 		accessor->inner_tuples =
 			sts_attach(ParallelHashJoinBatchInner(shared),
 					   ParallelWorkerNumber + 1,
@@ -3083,6 +3276,7 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 												  pstate->nparticipants),
 					   ParallelWorkerNumber + 1,
 					   &pstate->fileset);
+		accessor->sba = sb_attach(sbits, ParallelWorkerNumber + 1, &pstate->sbfileset);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -3165,6 +3359,18 @@ ExecHashTableDetachBatch(HashJoinTable hashtable)
 	}
 }
 
+bool
+ExecHashTableDetachStripe(HashJoinTable hashtable)
+{
+	int			curbatch = hashtable->curbatch;
+	ParallelHashJoinBatch *batch = hashtable->batches[curbatch].shared;
+	Barrier    *stripe_barrier = &batch->stripe_barrier;
+
+	BarrierDetach(stripe_barrier);
+	hashtable->curstripe = -1;
+	return false;
+}
+
 /*
  * Detach from all shared resources.  If we are last to detach, clean up.
  */
@@ -3350,13 +3556,35 @@ ExecParallelHashTuplePrealloc(HashJoinTable hashtable, int batchno, size_t size)
 	{
 		/*
 		 * We have determined that this batch would exceed the space budget if
-		 * loaded into memory.  Command all participants to help repartition.
+		 * loaded into memory.
 		 */
-		batch->shared->space_exhausted = true;
-		pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
-		LWLockRelease(&pstate->lock);
-
-		return false;
+		/* TODO: the nested lock is a deadlock waiting to happen. */
+		LWLockAcquire(&batch->shared->lock, LW_EXCLUSIVE);
+		if (!batch->shared->hashloop_fallback)
+		{
+			/*
+			 * This batch is not marked to fall back so command all
+			 * participants to help repartition.
+			 */
+			batch->shared->space_exhausted = true;
+			pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
+			LWLockRelease(&batch->shared->lock);
+			LWLockRelease(&pstate->lock);
+			return false;
+		}
+		else if (batch->shared->estimated_stripe_size + want +
+				 HASH_CHUNK_HEADER_SIZE > pstate->space_allowed)
+		{
+			/*
+			 * This batch is marked to fall back and the current (last) stripe
+			 * does not have enough space to handle the request so we must
+			 * increment the number of stripes in the batch and reset the size
+			 * of its new last stripe.
+			 */
+			batch->shared->maximum_stripe_number++;
+			batch->shared->estimated_stripe_size = 0;
+		}
+		LWLockRelease(&batch->shared->lock);
 	}
 
 	batch->at_least_one_chunk = true;
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 2cdc38a601..0228b32d5a 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -92,6 +92,27 @@
  * work_mem of all participants to create a large shared hash table.  If that
  * turns out either at planning or execution time to be impossible then we
  * fall back to regular work_mem sized hash tables.
+ * If a given batch causes the number of batches to be doubled and data skew
+ * causes too few or too many tuples to be relocated to the child of this batch,
+ * the batch which is now home to the skewed tuples is marked as a "fallback"
+ * batch. This means that it will be processed using multiple loops --
+ * each loop probing an arbitrary stripe of tuples from this batch
+ * which fit in work_mem or combined work_mem.
+ * This batch is no longer permitted to cause growth in the number of batches.
+ *
+ * When the inner side of a fallback batch is loaded into memory, stripes of
+ * arbitrary tuples totaling work_mem or combined work_mem in size are loaded
+ * into the hashtable. After probing this stripe, the outer side batch is
+ * rewound and the next stripe is loaded. Each stripe of the inner batch is
+ * probed until all tuples from that batch have been processed.
+ *
+ * Tuples that match are emitted (depending on the join semantics of the
+ * particular join type) during probing of the stripe. However, in order to make
+ * left outer join work, unmatched tuples cannot be emitted NULL-extended until
+ * all stripes have been probed. To address this, a bitmap is created with a bit
+ * for each tuple of the outer side. If a tuple on the outer side matches a
+ * tuple from the inner, the corresponding bit is set. At the end of probing all
+ * stripes, the executor scans the bitmap and emits unmatched outer tuples.
  *
  * To avoid deadlocks, we never wait for any barrier unless it is known that
  * all other backends attached to it are actively executing the node or have
@@ -126,7 +147,7 @@
 #define HJ_SCAN_BUCKET			3
 #define HJ_FILL_OUTER_TUPLE		4
 #define HJ_FILL_INNER_TUPLES	5
-#define HJ_NEED_NEW_BATCH		6
+#define HJ_NEED_NEW_STRIPE      6
 
 /* Returns true if doing null-fill on outer relation */
 #define HJ_FILL_OUTER(hjstate)	((hjstate)->hj_NullInnerTupleSlot != NULL)
@@ -143,10 +164,91 @@ static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 												 BufFile *file,
 												 uint32 *hashvalue,
 												 TupleTableSlot *tupleSlot);
+static int	ExecHashJoinLoadStripe(HashJoinState *hjstate);
 static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
 static bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
+static bool ExecParallelHashJoinLoadStripe(HashJoinState *hjstate);
 static void ExecParallelHashJoinPartitionOuter(HashJoinState *node);
+static bool checkbit(HashJoinState *hjstate);
+static void set_match_bit(HashJoinState *hjstate);
+
+static pg_attribute_always_inline bool
+			IsHashloopFallback(HashJoinTable hashtable);
+
+#define UINT_BITS (sizeof(unsigned int) * CHAR_BIT)
+
+static void
+set_match_bit(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	BufFile    *statusFile = hashtable->hashloop_fallback[hashtable->curbatch];
+	int			tupindex = hjstate->hj_CurNumOuterTuples - 1;
+	size_t		unit_size = sizeof(hjstate->hj_CurOuterMatchStatus);
+	off_t		offset = tupindex / UINT_BITS * unit_size;
+
+	int			fileno;
+	off_t		cursor;
+
+	BufFileTell(statusFile, &fileno, &cursor);
+
+	/* Extend the statusFile if this is stripe zero. */
+	if (hashtable->curstripe == 0)
+	{
+		for (; cursor < offset + unit_size; cursor += unit_size)
+		{
+			hjstate->hj_CurOuterMatchStatus = 0;
+			BufFileWrite(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+		}
+	}
+
+	if (cursor != offset)
+		BufFileSeek(statusFile, 0, offset, SEEK_SET);
+
+	BufFileRead(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+	BufFileSeek(statusFile, 0, -unit_size, SEEK_CUR);
+
+	hjstate->hj_CurOuterMatchStatus |= 1U << tupindex % UINT_BITS;
+	BufFileWrite(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+}
 
+/* return true if bit is set and false if not */
+static bool
+checkbit(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			curbatch = hashtable->curbatch;
+	BufFile    *outer_match_statuses;
+
+	int			bitno = hjstate->hj_EmitOuterTupleId % UINT_BITS;
+
+	hjstate->hj_EmitOuterTupleId++;
+	outer_match_statuses = hjstate->hj_HashTable->hashloop_fallback[curbatch];
+
+	/*
+	 * if current chunk of bitmap is exhausted, read next chunk of bitmap from
+	 * outer_match_status_file
+	 */
+	if (bitno == 0)
+		BufFileRead(outer_match_statuses, &hjstate->hj_CurOuterMatchStatus,
+					sizeof(hjstate->hj_CurOuterMatchStatus));
+
+	/*
+	 * check if current tuple's match bit is set in outer match status file
+	 */
+	return hjstate->hj_CurOuterMatchStatus & (1U << bitno);
+}
+
+static bool
+IsHashloopFallback(HashJoinTable hashtable)
+{
+	if (hashtable->parallel_state)
+		return hashtable->batches[hashtable->curbatch].shared->hashloop_fallback;
+
+	if (!hashtable->hashloop_fallback)
+		return false;
+
+	return hashtable->hashloop_fallback[hashtable->curbatch];
+}
 
 /* ----------------------------------------------------------------
  *		ExecHashJoinImpl
@@ -290,6 +392,12 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				hashNode->hashtable = hashtable;
 				(void) MultiExecProcNode((PlanState *) hashNode);
 
+				/*
+				 * After building the hashtable, stripe 0 of batch 0 will have
+				 * been loaded.
+				 */
+				hashtable->curstripe = 0;
+
 				/*
 				 * If the inner relation is completely empty, and we're not
 				 * doing a left outer join, we can quit without scanning the
@@ -333,12 +441,11 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 
 					/* Each backend should now select a batch to work on. */
 					hashtable->curbatch = -1;
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
 
-					continue;
+					if (!ExecParallelHashJoinNewBatch(node))
+						return NULL;
 				}
-				else
-					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
 				/* FALL THRU */
 
@@ -365,12 +472,18 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 						node->hj_JoinState = HJ_FILL_INNER_TUPLES;
 					}
 					else
-						node->hj_JoinState = HJ_NEED_NEW_BATCH;
+						node->hj_JoinState = HJ_NEED_NEW_STRIPE;
 					continue;
 				}
 
 				econtext->ecxt_outertuple = outerTupleSlot;
-				node->hj_MatchedOuter = false;
+
+				/*
+				 * Don't reset hj_MatchedOuter after the first stripe as it
+				 * would cancel out whatever we found before
+				 */
+				if (node->hj_HashTable->curstripe == 0)
+					node->hj_MatchedOuter = false;
 
 				/*
 				 * Find the corresponding bucket for this tuple in the main
@@ -386,9 +499,15 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				/*
 				 * The tuple might not belong to the current batch (where
 				 * "current batch" includes the skew buckets if any).
+				 *
+				 * This should only be done once per tuple per batch. If a
+				 * batch "falls back", its inner side will be split into
+				 * stripes. Any displaced outer tuples should only be
+				 * relocated while probing the first stripe of the inner side.
 				 */
 				if (batchno != hashtable->curbatch &&
-					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO)
+					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO &&
+					node->hj_HashTable->curstripe == 0)
 				{
 					bool		shouldFree;
 					MinimalTuple mintuple = ExecFetchSlotMinimalTuple(outerTupleSlot,
@@ -410,6 +529,13 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					continue;
 				}
 
+				/*
+				 * While probing the phantom stripe, don't increment
+				 * hj_CurNumOuterTuples or extend the bitmap
+				 */
+				if (!parallel && hashtable->curstripe != -2)
+					node->hj_CurNumOuterTuples++;
+
 				/* OK, let's scan the bucket for matches */
 				node->hj_JoinState = HJ_SCAN_BUCKET;
 
@@ -455,6 +581,25 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				{
 					node->hj_MatchedOuter = true;
 
+					if (HJ_FILL_OUTER(node) && IsHashloopFallback(hashtable))
+					{
+						/*
+						 * Each bit corresponds to a single tuple. Setting the
+						 * match bit keeps track of which tuples were matched
+						 * for batches which are using the block nested hashloop
+						 * fallback method. It persists this match status across
+						 * multiple stripes of tuples, each of which is loaded
+						 * into the hashtable and probed. The outer match status
+						 * file is the cumulative match status of outer tuples
+						 * for a given batch across all stripes of that inner
+						 * side batch.
+						 */
+						if (parallel)
+							sb_setbit(hashtable->batches[hashtable->curbatch].sba, econtext->ecxt_outertuple->tts_tuplenum);
+						else
+							set_match_bit(node);
+					}
+
 					if (parallel)
 					{
 						/*
@@ -508,6 +653,22 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				 */
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
+				if (IsHashloopFallback(hashtable) && HJ_FILL_OUTER(node))
+				{
+					if (hashtable->curstripe != -2)
+						continue;
+
+					if (parallel)
+					{
+						ParallelHashJoinBatchAccessor *accessor =
+						&node->hj_HashTable->batches[node->hj_HashTable->curbatch];
+
+						node->hj_MatchedOuter = sb_checkbit(accessor->sba, econtext->ecxt_outertuple->tts_tuplenum);
+					}
+					else
+						node->hj_MatchedOuter = checkbit(node);
+				}
+
 				if (!node->hj_MatchedOuter &&
 					HJ_FILL_OUTER(node))
 				{
@@ -534,7 +695,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				if (!ExecScanHashTableForUnmatched(node, econtext))
 				{
 					/* no more unmatched tuples */
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					node->hj_JoinState = HJ_NEED_NEW_STRIPE;
 					continue;
 				}
 
@@ -550,19 +711,23 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					InstrCountFiltered2(node, 1);
 				break;
 
-			case HJ_NEED_NEW_BATCH:
+			case HJ_NEED_NEW_STRIPE:
 
 				/*
-				 * Try to advance to next batch.  Done if there are no more.
+				 * Try to advance to next stripe. Then try to advance to the
+				 * next batch if there are no more stripes in this batch. Done
+				 * if there are no more batches.
 				 */
 				if (parallel)
 				{
-					if (!ExecParallelHashJoinNewBatch(node))
+					if (!ExecParallelHashJoinLoadStripe(node) &&
+						!ExecParallelHashJoinNewBatch(node))
 						return NULL;	/* end of parallel-aware join */
 				}
 				else
 				{
-					if (!ExecHashJoinNewBatch(node))
+					if (!ExecHashJoinLoadStripe(node) &&
+						!ExecHashJoinNewBatch(node))
 						return NULL;	/* end of parallel-oblivious join */
 				}
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
@@ -751,6 +916,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate->hj_JoinState = HJ_BUILD_HASHTABLE;
 	hjstate->hj_MatchedOuter = false;
 	hjstate->hj_OuterNotEmpty = false;
+	hjstate->hj_CurNumOuterTuples = 0;
+	hjstate->hj_CurOuterMatchStatus = 0;
 
 	return hjstate;
 }
@@ -917,15 +1084,24 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 	}
 	else if (curbatch < hashtable->nbatch)
 	{
+		tupleMetadata metadata;
 		MinimalTuple tuple;
 
 		tuple = sts_parallel_scan_next(hashtable->batches[curbatch].outer_tuples,
-									   hashvalue);
+									   &metadata);
+		*hashvalue = metadata.hashvalue;
+
 		if (tuple != NULL)
 		{
 			ExecForceStoreMinimalTuple(tuple,
 									   hjstate->hj_OuterTupleSlot,
 									   false);
+
+			/*
+			 * TODO: should we use tupleid instead of position in the serial
+			 * case too?
+			 */
+			hjstate->hj_OuterTupleSlot->tts_tuplenum = metadata.tupleid;
 			slot = hjstate->hj_OuterTupleSlot;
 			return slot;
 		}
@@ -949,24 +1125,37 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	int			nbatch;
 	int			curbatch;
-	BufFile    *innerFile;
-	TupleTableSlot *slot;
-	uint32		hashvalue;
+	BufFile    *innerFile = NULL;
+	BufFile    *outerFile = NULL;
 
 	nbatch = hashtable->nbatch;
 	curbatch = hashtable->curbatch;
 
-	if (curbatch > 0)
+	/*
+	 * We no longer need the previous outer batch file; close it right away to
+	 * free disk space.
+	 */
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
 	{
-		/*
-		 * We no longer need the previous outer batch file; close it right
-		 * away to free disk space.
-		 */
-		if (hashtable->outerBatchFile[curbatch])
-			BufFileClose(hashtable->outerBatchFile[curbatch]);
+		BufFileClose(hashtable->outerBatchFile[curbatch]);
 		hashtable->outerBatchFile[curbatch] = NULL;
 	}
-	else						/* we just finished the first batch */
+	if (IsHashloopFallback(hashtable))
+	{
+		BufFileClose(hashtable->hashloop_fallback[curbatch]);
+		hashtable->hashloop_fallback[curbatch] = NULL;
+	}
+
+	/*
+	 * We are surely done with the inner batch file now
+	 */
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[curbatch])
+	{
+		BufFileClose(hashtable->innerBatchFile[curbatch]);
+		hashtable->innerBatchFile[curbatch] = NULL;
+	}
+
+	if (curbatch == 0)			/* we just finished the first batch */
 	{
 		/*
 		 * Reset some of the skew optimization state variables, since we no
@@ -1030,45 +1219,68 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 		return false;			/* no more batches */
 
 	hashtable->curbatch = curbatch;
+	hashtable->curstripe = -1;
+	hjstate->hj_CurNumOuterTuples = 0;
 
-	/*
-	 * Reload the hash table with the new inner batch (which could be empty)
-	 */
-	ExecHashTableReset(hashtable);
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[curbatch])
+		innerFile = hashtable->innerBatchFile[curbatch];
+
+	if (innerFile && BufFileSeek(innerFile, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	/* Need to rewind outer when this is the first stripe of a new batch */
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
+		outerFile = hashtable->outerBatchFile[curbatch];
+
+	if (outerFile && BufFileSeek(outerFile, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
 
-	innerFile = hashtable->innerBatchFile[curbatch];
+	ExecHashJoinLoadStripe(hjstate);
+	return true;
+}
+
+static inline void
+InstrIncrBatchStripes(List *fallback_batches_stats, int curbatch)
+{
+	ListCell   *lc;
 
-	if (innerFile != NULL)
+	foreach(lc, fallback_batches_stats)
 	{
-		if (BufFileSeek(innerFile, 0, 0L, SEEK_SET))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not rewind hash-join temporary file: %m")));
+		FallbackBatchStats *fallback_batch_stats = lfirst(lc);
 
-		while ((slot = ExecHashJoinGetSavedTuple(hjstate,
-												 innerFile,
-												 &hashvalue,
-												 hjstate->hj_HashTupleSlot)))
+		if (fallback_batch_stats->batchno == curbatch)
 		{
-			/*
-			 * NOTE: some tuples may be sent to future batches.  Also, it is
-			 * possible for hashtable->nbatch to be increased here!
-			 */
-			ExecHashTableInsert(hashtable, slot, hashvalue);
+			fallback_batch_stats->numstripes++;
+			break;
 		}
-
-		/*
-		 * after we build the hash table, the inner batch file is no longer
-		 * needed
-		 */
-		BufFileClose(innerFile);
-		hashtable->innerBatchFile[curbatch] = NULL;
 	}
+}
+
+/*
+ * Returns false when the inner batch file is exhausted
+ */
+static int
+ExecHashJoinLoadStripe(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			curbatch = hashtable->curbatch;
+	TupleTableSlot *slot;
+	uint32		hashvalue;
+	bool		loaded_inner = false;
+
+	if (hashtable->curstripe == -2)
+		return false;
 
 	/*
 	 * Rewind outer batch file (if present), so that we can start reading it.
+	 * TODO: This is only necessary if this is not the first stripe of the
+	 * batch
 	 */
-	if (hashtable->outerBatchFile[curbatch] != NULL)
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
 	{
 		if (BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET))
 			ereport(ERROR,
@@ -1076,9 +1288,78 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 					 errmsg("could not rewind hash-join temporary file: %m")));
 	}
 
-	return true;
+	hashtable->curstripe++;
+
+	if (!hashtable->innerBatchFile || !hashtable->innerBatchFile[curbatch])
+		return false;
+
+	/*
+	 * Reload the hash table with the new inner stripe
+	 */
+	ExecHashTableReset(hashtable);
+
+	while ((slot = ExecHashJoinGetSavedTuple(hjstate,
+											 hashtable->innerBatchFile[curbatch],
+											 &hashvalue,
+											 hjstate->hj_HashTupleSlot)))
+	{
+		/*
+		 * NOTE: some tuples may be sent to future batches.  Also, it is
+		 * possible for hashtable->nbatch to be increased here!
+		 */
+		uint32		hashTupleSize;
+		/*
+		 * TODO: wouldn't it be cool if this returned the size of the tuple
+		 * inserted
+		 */
+		ExecHashTableInsert(hashtable, slot, hashvalue);
+		loaded_inner = true;
+
+		if (!IsHashloopFallback(hashtable))
+			continue;
+
+		hashTupleSize = slot->tts_ops->get_minimal_tuple(slot)->t_len + HJTUPLE_OVERHEAD;
+
+		if (hashtable->spaceUsed + hashTupleSize +
+			hashtable->nbuckets_optimal * sizeof(HashJoinTuple)
+			> hashtable->spaceAllowed)
+			break;
+	}
+
+	/*
+	 * if we didn't load anything and it is a FOJ/LOJ fallback batch, we will
+	 * transition to emit unmatched outer tuples next. we want to know how
+	 * many tuples were in the batch in that case, so don't zero it out then
+	 */
+
+	/*
+	 * if we loaded anything into the hashtable or it is the phantom stripe,
+	 * must proceed to probing
+	 */
+	if (loaded_inner)
+	{
+		hjstate->hj_CurNumOuterTuples = 0;
+		InstrIncrBatchStripes(hashtable->fallback_batches_stats, curbatch);
+		return true;
+	}
+
+	if (IsHashloopFallback(hashtable) && HJ_FILL_OUTER(hjstate))
+	{
+		/*
+		 * if we didn't load anything and it is a fallback batch, we will
+		 * prepare to emit outer tuples during the phantom stripe probing
+		 */
+		hashtable->curstripe = -2;
+		hjstate->hj_EmitOuterTupleId = 0;
+		hjstate->hj_CurOuterMatchStatus = 0;
+		BufFileSeek(hashtable->hashloop_fallback[curbatch], 0, 0, SEEK_SET);
+		BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET);
+		return true;
+	}
+	return false;
 }
 
+
 /*
  * Choose a batch to work on, and attach to it.  Returns true if successful,
  * false if there are no more batches.
@@ -1101,11 +1382,35 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 	/*
 	 * If we were already attached to a batch, remember not to bother checking
 	 * it again, and detach from it (possibly freeing the hash table if we are
-	 * last to detach).
+	 * last to detach). curbatch is set when the batch_barrier phase is either
+	 * PHJ_BATCH_LOADING or PHJ_BATCH_STRIPING (note that the
+	 * PHJ_BATCH_LOADING case will fall through to the PHJ_BATCH_STRIPING
+	 * case). The PHJ_BATCH_STRIPING case returns to the caller. So when this
+	 * function is reentered with a curbatch >= 0 then we must be done
+	 * probing.
 	 */
+
 	if (hashtable->curbatch >= 0)
 	{
-		hashtable->batches[hashtable->curbatch].done = true;
+		ParallelHashJoinBatchAccessor *batch_accessor = &hashtable->batches[hashtable->curbatch];
+		if (IsHashloopFallback(hashtable))
+		{
+			/*
+			 * If this worker just finished a stripe in a batch in which it was
+			 * not the last participant, it will have saved the phase at the time
+			 * that it detached. Set the "done" flag to 0 to indicate it is only
+			 * provisionally done with this batch, giving it a chance to return
+			 * later and participate if the batch is making progress without
+			 * risking deadlock.
+			 */
+			sb_end_write(hashtable->batches[hashtable->curbatch].sba);
+			if (batch_accessor->last_participating_stripe_phase > PHJ_STRIPE_INVALID_PHASE)
+				batch_accessor->done = PHJ_BATCH_ACCESSOR_PROVISIONALLY_DONE;
+			else
+				batch_accessor->done = PHJ_BATCH_ACCESSOR_DONE;
+		}
+		else
+			batch_accessor->done = PHJ_BATCH_ACCESSOR_DONE;
 		ExecHashTableDetachBatch(hashtable);
 	}
 
@@ -1119,13 +1424,8 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 		hashtable->nbatch;
 	do
 	{
-		uint32		hashvalue;
-		MinimalTuple tuple;
-		TupleTableSlot *slot;
-
-		if (!hashtable->batches[batchno].done)
+		if (hashtable->batches[batchno].done != PHJ_BATCH_ACCESSOR_DONE)
 		{
-			SharedTuplestoreAccessor *inner_tuples;
 			Barrier    *batch_barrier =
 			&hashtable->batches[batchno].shared->batch_barrier;
 
@@ -1135,51 +1435,47 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 
 					/* One backend allocates the hash table. */
 					if (BarrierArriveAndWait(batch_barrier,
-											 WAIT_EVENT_HASH_BATCH_ELECT))
+					                         WAIT_EVENT_HASH_BATCH_ELECT))
+					{
 						ExecParallelHashTableAlloc(hashtable, batchno);
+
+						/*
+						 * one worker needs to 0 out the read_pages of all the
+						 * participants in the new batch
+						 */
+						sts_reinitialize(hashtable->batches[batchno].inner_tuples);
+					}
 					/* Fall through. */
 
 				case PHJ_BATCH_ALLOCATING:
 					/* Wait for allocation to complete. */
 					BarrierArriveAndWait(batch_barrier,
-										 WAIT_EVENT_HASH_BATCH_ALLOCATE);
+					                     WAIT_EVENT_HASH_BATCH_ALLOCATE);
 					/* Fall through. */
 
-				case PHJ_BATCH_LOADING:
-					/* Start (or join in) loading tuples. */
-					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					inner_tuples = hashtable->batches[batchno].inner_tuples;
-					sts_begin_parallel_scan(inner_tuples);
-					while ((tuple = sts_parallel_scan_next(inner_tuples,
-														   &hashvalue)))
-					{
-						ExecForceStoreMinimalTuple(tuple,
-												   hjstate->hj_HashTupleSlot,
-												   false);
-						slot = hjstate->hj_HashTupleSlot;
-						ExecParallelHashTableInsertCurrentBatch(hashtable, slot,
-																hashvalue);
-					}
-					sts_end_parallel_scan(inner_tuples);
-					BarrierArriveAndWait(batch_barrier,
-										 WAIT_EVENT_HASH_BATCH_LOAD);
-					/* Fall through. */
-
-				case PHJ_BATCH_PROBING:
+				case PHJ_BATCH_STRIPING:
 
+					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
+					sts_begin_parallel_scan(hashtable->batches[batchno].inner_tuples);
+					if (hashtable->batches[batchno].shared->hashloop_fallback)
+						sb_initialize_accessor(hashtable->batches[hashtable->curbatch].sba,
+											   sts_get_tuplenum(hashtable->batches[hashtable->curbatch].outer_tuples));
+					hashtable->curstripe = -1;
+					if (ExecParallelHashJoinLoadStripe(hjstate))
+						return true;
 					/*
-					 * This batch is ready to probe.  Return control to
-					 * caller. We stay attached to batch_barrier so that the
-					 * hash table stays alive until everyone's finished
-					 * probing it, but no participant is allowed to wait at
-					 * this barrier again (or else a deadlock could occur).
-					 * All attached participants must eventually call
-					 * BarrierArriveAndDetach() so that the final phase
-					 * PHJ_BATCH_DONE can be reached.
+					 * ExecParallelHashJoinLoadStripe() will return false from
+					 * here when no more work can be done by this worker on
+					 * this batch. Until further optimized, this worker will
+					 * have detached from the stripe_barrier and should close
+					 * its outer match statuses bitmap and then detach from the
+					 * batch. In order to reuse the code below, fall through,
+					 * even though the phase will not have been advanced
 					 */
-					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					sts_begin_parallel_scan(hashtable->batches[batchno].outer_tuples);
-					return true;
+					if (hashtable->batches[batchno].shared->hashloop_fallback)
+						sb_end_write(hashtable->batches[batchno].sba);
+
+					/* Fall through. */
 
 				case PHJ_BATCH_DONE:
 
@@ -1188,7 +1484,7 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 					 * remain).
 					 */
 					BarrierDetach(batch_barrier);
-					hashtable->batches[batchno].done = true;
+					hashtable->batches[batchno].done = PHJ_BATCH_ACCESSOR_DONE;
 					hashtable->curbatch = -1;
 					break;
 
@@ -1203,6 +1499,274 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 	return false;
 }
 
+
+
+/*
+ * Returns true if ready to probe and false if the inner is exhausted
+ * (there are no more stripes)
+ */
+bool
+ExecParallelHashJoinLoadStripe(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			batchno = hashtable->curbatch;
+	ParallelHashJoinBatch *batch = hashtable->batches[batchno].shared;
+	Barrier    *stripe_barrier = &batch->stripe_barrier;
+	SharedTuplestoreAccessor *outer_tuples;
+	SharedTuplestoreAccessor *inner_tuples;
+	ParallelHashJoinBatchAccessor *accessor;
+	dsa_pointer_atomic *buckets;
+
+	outer_tuples = hashtable->batches[batchno].outer_tuples;
+	inner_tuples = hashtable->batches[batchno].inner_tuples;
+
+	if (hashtable->curstripe >= 0)
+	{
+		/*
+		 * After finishing with participating in a stripe, if a worker is the
+		 * only one working on a batch, it will continue working on it.
+		 * However, if a worker is not the only worker working on a batch, it
+		 * would risk deadlock if it waits on the barrier. Instead, it saves
+		 * the current stripe phase and move on. Later, when it comes back to
+		 * this batch, if the stripe phase hasn't advanced from when it last
+		 * participated, it will mark the batch done and never return. If the
+		 * stripe barrier has advanced, then, it will participate again in the
+		 * batch.
+		 *
+		 * It would be more efficient if workers did not detach from the stripe
+		 * barrier and close their outer match status bitmaps after failing to
+		 * be the last worker. When they rejoin, they will have to create new
+		 * bitmaps and re-attach to the stripe barrier.
+		 *
+		 * Originally, the patch had workers keep their bitmaps open, however,
+		 * there were some synchronization problems with workers having outer
+		 * match status bitmaps for multiple batches open at the same time.
+		 */
+		if (!BarrierArriveAndDetach(stripe_barrier))
+		{
+			hashtable->batches[batchno].last_participating_stripe_phase = BarrierPhase(stripe_barrier);
+			sb_end_write(hashtable->batches[hashtable->curbatch].sba);
+			hashtable->curstripe = -1;
+			return false;
+		}
+
+		/*
+		 * This isn't a race condition if no other workers can stay attached to
+		 * this barrier in the intervening time. Basically, if you attach to a
+		 * stripe barrier in the PHJ_STRIPE_DONE phase,
+		 * detach immediately and move on.
+		 */
+		BarrierAttach(stripe_barrier);
+	}
+	else if (hashtable->curstripe == -1)
+	{
+		ParallelHashJoinBatchAccessor *batch_accessor = &hashtable->batches[batchno];
+		int			phase;
+
+		phase = BarrierAttach(stripe_barrier);
+
+		/*
+		 * If the phase hasn't advanced since the last time this worker
+		 * checked, detach and return to pick another batch. Only check this
+		 * if the worker has worked on this batch before. Workers are not permitted
+		 * to join after the batch has progressed past its first stripe.
+		 */
+		if (batch_accessor->done == PHJ_BATCH_ACCESSOR_PROVISIONALLY_DONE &&
+			batch_accessor->last_participating_stripe_phase == phase)
+			return ExecHashTableDetachStripe(hashtable);
+
+		/*
+		 * If a worker enters this phase machine on a stripe number greater
+		 * than the batch's maximum stripe number, then: 1) The batch is done,
+		 * or 2) The batch is on the phantom stripe that's used for hashloop
+		 * fallback Either way the worker can't contribute so just detach and
+		 * move on.
+		 */
+
+		if (PHJ_STRIPE_NUMBER(phase) > batch->maximum_stripe_number ||
+			PHJ_STRIPE_PHASE(phase) == PHJ_STRIPE_DONE)
+			return ExecHashTableDetachStripe(hashtable);
+	}
+	else if (hashtable->curstripe == -2)
+	{
+		sts_end_parallel_scan(outer_tuples);
+		/*
+		 * TODO: ideally this would go somewhere in the batch phase machine
+		 * Putting it in ExecHashTableDetachBatch didn't do the trick
+		 */
+		sb_end_read(hashtable->batches[batchno].sba);
+		return ExecHashTableDetachStripe(hashtable);
+	}
+
+	hashtable->curstripe = PHJ_STRIPE_NUMBER(BarrierPhase(stripe_barrier));
+
+	/*
+	 * The outer side is exhausted and either 1) the current stripe of the
+	 * inner side is exhausted and it is time to advance the stripe 2) the
+	 * last stripe of the inner side is exhausted and it is time to advance
+	 * the batch
+	 */
+	for (;;)
+	{
+		int			phase = BarrierPhase(stripe_barrier);
+
+		switch (PHJ_STRIPE_PHASE(phase))
+		{
+			case PHJ_STRIPE_ELECTING:
+				if (BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_ELECT))
+				{
+					sts_reinitialize(outer_tuples);
+
+					/*
+					 * set the rewound flag back to false to prepare for the
+					 * next stripe
+					 */
+					sts_reset_rewound(inner_tuples);
+				}
+
+				/* FALLTHROUGH */
+
+			case PHJ_STRIPE_RESETTING:
+				/* TODO: not needed for phantom stripe */
+				BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_RESET);
+				/* FALLTHROUGH */
+
+			case PHJ_STRIPE_LOADING:
+				{
+					MinimalTuple tuple;
+					tupleMetadata metadata;
+
+					/*
+					 * Start (or join in) loading the next stripe of inner
+					 * tuples.
+					 */
+
+					/*
+					 * I'm afraid there potential issue if a worker joins in
+					 * this phase and doesn't do the actions and resetting of
+					 * variables in sts_resume_parallel_scan. that is, if it
+					 * doesn't reset start_page and read_next_page in between
+					 * stripes. For now, call it. However, I think it might be
+					 * able to be removed.
+					 */
+
+					/*
+					 * TODO: sts_resume_parallel_scan() is overkill for stripe
+					 * 0 of each batch
+					 */
+					sts_resume_parallel_scan(inner_tuples);
+
+					while ((tuple = sts_parallel_scan_next(inner_tuples, &metadata)))
+					{
+						/* The tuple is from a previous stripe. Skip it */
+						if (metadata.stripe < PHJ_STRIPE_NUMBER(phase))
+							continue;
+
+						/*
+						 * tuple from future. time to back out read_page. end
+						 * of stripe
+						 */
+						if (metadata.stripe > PHJ_STRIPE_NUMBER(phase))
+						{
+							sts_parallel_scan_rewind(inner_tuples);
+							continue;
+						}
+
+						ExecForceStoreMinimalTuple(tuple, hjstate->hj_HashTupleSlot, false);
+						ExecParallelHashTableInsertCurrentBatch(
+																hashtable,
+																hjstate->hj_HashTupleSlot,
+																metadata.hashvalue);
+					}
+					BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOAD);
+				}
+				/* FALLTHROUGH */
+
+			case PHJ_STRIPE_PROBING:
+
+				/*
+				 * do this again here in case a worker began the scan and then
+				 * entered after loading before probing
+				 */
+				sts_end_parallel_scan(inner_tuples);
+				sts_begin_parallel_scan(outer_tuples);
+				return true;
+
+			case PHJ_STRIPE_DONE:
+
+				if (PHJ_STRIPE_NUMBER(phase) >= batch->maximum_stripe_number)
+				{
+					/*
+					 * Handle the phantom stripe case.
+					 */
+					if (batch->hashloop_fallback && HJ_FILL_OUTER(hjstate))
+						goto fallback_stripe;
+
+					/* Return if this is the last stripe */
+					return ExecHashTableDetachStripe(hashtable);
+				}
+
+				/* this, effectively, increments the stripe number */
+				if (BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOAD))
+				{
+					/*
+					 * reset inner's hashtable and recycle the existing bucket array.
+					 */
+					buckets = (dsa_pointer_atomic *)
+						dsa_get_address(hashtable->area, batch->buckets);
+
+					for (size_t i = 0; i < hashtable->nbuckets; ++i)
+						dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
+				}
+
+				hashtable->curstripe++;
+				continue;
+
+			default:
+				elog(ERROR, "unexpected stripe phase %d. pid %i. batch %i.", BarrierPhase(stripe_barrier), MyProcPid, batchno);
+		}
+	}
+
+fallback_stripe:
+	accessor = &hashtable->batches[hashtable->curbatch];
+	sb_end_write(accessor->sba);
+
+	/* Ensure that only a single worker is attached to the barrier */
+	if (!BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOAD))
+		return ExecHashTableDetachStripe(hashtable);
+
+
+	/* No one except the last worker will run this code */
+	hashtable->curstripe = -2;
+
+	/*
+	 * reset inner's hashtable and recycle the existing bucket array.
+	 */
+	buckets = (dsa_pointer_atomic *)
+		dsa_get_address(hashtable->area, batch->buckets);
+
+	for (size_t i = 0; i < hashtable->nbuckets; ++i)
+		dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
+
+	/*
+	 * If all workers (including this one) have finished probing the batch,
+	 * one worker is elected to Loop through the outer match status files from
+	 * all workers that were attached to this batch Combine them into one
+	 * bitmap Use the bitmap, loop through the outer batch file again, and
+	 * emit unmatched tuples All workers will detach from the batch barrier
+	 * and the last worker will clean up the hashtable. All workers except the
+	 * last worker will end their scans of the outer and inner side. The last
+	 * worker will end its scan of the inner side
+	 */
+
+	sb_combine(accessor->sba);
+	sts_reinitialize(outer_tuples);
+
+	sts_begin_parallel_scan(outer_tuples);
+
+	return true;
+}
+
 /*
  * ExecHashJoinSaveTuple
  *		save a tuple to a batch file.
@@ -1372,6 +1936,9 @@ ExecReScanHashJoin(HashJoinState *node)
 	node->hj_MatchedOuter = false;
 	node->hj_FirstOuterTupleSlot = NULL;
 
+	node->hj_CurNumOuterTuples = 0;
+	node->hj_CurOuterMatchStatus = 0;
+
 	/*
 	 * if chgParam of subnode is not null then plan will be re-scanned by
 	 * first ExecProcNode.
@@ -1402,7 +1969,6 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 	ExprContext *econtext = hjstate->js.ps.ps_ExprContext;
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	TupleTableSlot *slot;
-	uint32		hashvalue;
 	int			i;
 
 	Assert(hjstate->hj_FirstOuterTupleSlot == NULL);
@@ -1410,6 +1976,8 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 	/* Execute outer plan, writing all tuples to shared tuplestores. */
 	for (;;)
 	{
+		tupleMetadata metadata;
+
 		slot = ExecProcNode(outerState);
 		if (TupIsNull(slot))
 			break;
@@ -1418,17 +1986,23 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 								 hjstate->hj_OuterHashKeys,
 								 true,	/* outer tuple */
 								 HJ_FILL_OUTER(hjstate),
-								 &hashvalue))
+								 &metadata.hashvalue))
 		{
 			int			batchno;
 			int			bucketno;
 			bool		shouldFree;
+			SharedTuplestoreAccessor *accessor;
+
 			MinimalTuple mintup = ExecFetchSlotMinimalTuple(slot, &shouldFree);
 
-			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
+			ExecHashGetBucketAndBatch(hashtable, metadata.hashvalue, &bucketno,
 									  &batchno);
-			sts_puttuple(hashtable->batches[batchno].outer_tuples,
-						 &hashvalue, mintup);
+			accessor = hashtable->batches[batchno].outer_tuples;
+
+			/* cannot count on deterministic order of tupleids */
+			metadata.tupleid = sts_increment_ntuples(accessor);
+
+			sts_puttuple(hashtable->batches[batchno].outer_tuples, &metadata.hashvalue, mintup);
 
 			if (shouldFree)
 				heap_free_minimal_tuple(mintup);
@@ -1494,6 +2068,7 @@ ExecHashJoinInitializeDSM(HashJoinState *state, ParallelContext *pcxt)
 
 	/* Set up the space we'll use for shared temporary files. */
 	SharedFileSetInit(&pstate->fileset, pcxt->seg);
+	SharedFileSetInit(&pstate->sbfileset, pcxt->seg);
 
 	/* Initialize the shared state in the hash node. */
 	hashNode = (HashState *) innerPlanState(state);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d7f99d9944..49174f1690 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3783,8 +3783,17 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_BATCH_ELECT:
 			event_name = "HashBatchElect";
 			break;
-		case WAIT_EVENT_HASH_BATCH_LOAD:
-			event_name = "HashBatchLoad";
+		case WAIT_EVENT_HASH_STRIPE_ELECT:
+			event_name = "HashStripeElect";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_RESET:
+			event_name = "HashStripeReset";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_LOAD:
+			event_name = "HashStripeLoad";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_PROBE:
+			event_name = "HashStripeProbe";
 			break;
 		case WAIT_EVENT_HASH_BUILD_ALLOCATE:
 			event_name = "HashBuildAllocate";
diff --git a/src/backend/storage/ipc/barrier.c b/src/backend/storage/ipc/barrier.c
index 3e200e02cc..2bfd7e6052 100644
--- a/src/backend/storage/ipc/barrier.c
+++ b/src/backend/storage/ipc/barrier.c
@@ -308,4 +308,4 @@ BarrierDetachImpl(Barrier *barrier, bool arrive)
 		ConditionVariableBroadcast(&barrier->condition_variable);
 
 	return last;
-}
+}
\ No newline at end of file
diff --git a/src/backend/utils/sort/Makefile b/src/backend/utils/sort/Makefile
index 7ac3659261..f11fe85aeb 100644
--- a/src/backend/utils/sort/Makefile
+++ b/src/backend/utils/sort/Makefile
@@ -16,6 +16,7 @@ override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS)
 
 OBJS = \
 	logtape.o \
+	sharedbits.o \
 	sharedtuplestore.o \
 	sortsupport.o \
 	tuplesort.o \
diff --git a/src/backend/utils/sort/sharedbits.c b/src/backend/utils/sort/sharedbits.c
new file mode 100644
index 0000000000..37df04844e
--- /dev/null
+++ b/src/backend/utils/sort/sharedbits.c
@@ -0,0 +1,285 @@
+#include "postgres.h"
+#include "storage/buffile.h"
+#include "utils/sharedbits.h"
+
+/*
+ * TODO: put a comment about not currently supporting parallel scan of the SharedBits
+ * To support parallel scan, need to introduce many more mechanisms
+ */
+
+/* Per-participant shared state */
+struct SharedBitsParticipant
+{
+	bool		present;
+	bool		writing;
+};
+
+/* Shared control object */
+struct SharedBits
+{
+	int			nparticipants;	/* Number of participants that can write. */
+	int64		nbits;
+	char		name[NAMEDATALEN];	/* A name for this bitstore. */
+
+	SharedBitsParticipant participants[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/* backend-local state */
+struct SharedBitsAccessor
+{
+	int			participant;
+	SharedBits *bits;
+	SharedFileSet *fileset;
+	BufFile    *write_file;
+	BufFile    *combined;
+};
+
+SharedBitsAccessor *
+sb_attach(SharedBits *sbits, int my_participant_number, SharedFileSet *fileset)
+{
+	SharedBitsAccessor *accessor = palloc0(sizeof(SharedBitsAccessor));
+
+	accessor->participant = my_participant_number;
+	accessor->bits = sbits;
+	accessor->fileset = fileset;
+	accessor->write_file = NULL;
+	accessor->combined = NULL;
+	return accessor;
+}
+
+SharedBitsAccessor *
+sb_initialize(SharedBits *sbits,
+			  int participants,
+			  int my_participant_number,
+			  SharedFileSet *fileset,
+			  char *name)
+{
+	SharedBitsAccessor *accessor;
+
+	sbits->nparticipants = participants;
+	strcpy(sbits->name, name);
+	sbits->nbits = 0;			/* TODO: maybe delete this */
+
+	accessor = palloc0(sizeof(SharedBitsAccessor));
+	accessor->participant = my_participant_number;
+	accessor->bits = sbits;
+	accessor->fileset = fileset;
+	accessor->write_file = NULL;
+	accessor->combined = NULL;
+	return accessor;
+}
+
+/*  TODO: is "initialize_accessor" a clear enough API for this? (making the file)? */
+void
+sb_initialize_accessor(SharedBitsAccessor *accessor, uint32 nbits)
+{
+	char		name[MAXPGPATH];
+	uint32		num_to_write;
+
+	snprintf(name, MAXPGPATH, "%s.p%d.bitmap", accessor->bits->name, accessor->participant);
+
+	accessor->write_file =
+		BufFileCreateShared(accessor->fileset, name);
+
+	accessor->bits->participants[accessor->participant].present = true;
+	/* TODO: check this math. tuplenumber will be too high? */
+	num_to_write = nbits / 8 + 1;
+
+	/*
+	 * TODO: add tests that could exercise a problem with junk being written
+	 * to bitmap
+	 */
+
+	/*
+	 * TODO: is there a better way to write the bytes to the file without
+	 * calling BufFileWrite() like this? palloc()ing an undetermined number of
+	 * bytes feels like it is against the spirit of this patch to begin with,
+	 * but the many function calls seem expensive
+	 */
+	for (int i = 0; i < num_to_write; i++)
+	{
+		unsigned char byteToWrite = 0;
+
+		BufFileWrite(accessor->write_file, &byteToWrite, 1);
+	}
+
+	if (BufFileSeek(accessor->write_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+}
+
+size_t
+sb_estimate(int participants)
+{
+	return offsetof(SharedBits, participants) + participants * sizeof(SharedBitsParticipant);
+}
+
+
+void
+sb_setbit(SharedBitsAccessor *accessor, uint64 bit)
+{
+	SharedBitsParticipant *const participant =
+	&accessor->bits->participants[accessor->participant];
+
+	/* TODO: use an unsigned int instead of a byte */
+	unsigned char current_outer_byte;
+
+	Assert(accessor->write_file);
+
+	if (!participant->writing)
+	{
+		participant->writing = true;
+	}
+
+	BufFileSeek(accessor->write_file, 0, bit / 8, SEEK_SET);
+	BufFileRead(accessor->write_file, &current_outer_byte, 1);
+
+	current_outer_byte |= 1U << (bit % 8);
+
+	BufFileSeek(accessor->write_file, 0, -1, SEEK_CUR);
+	BufFileWrite(accessor->write_file, &current_outer_byte, 1);
+}
+
+bool
+sb_checkbit(SharedBitsAccessor *accessor, uint32 n)
+{
+	bool		match;
+	uint32		bytenum = n / 8;
+	unsigned char bit = n % 8;
+	unsigned char byte_to_check = 0;
+
+	Assert(accessor->combined);
+
+	/* seek to byte to check */
+	if (BufFileSeek(accessor->combined,
+					0,
+					bytenum,
+					SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg(
+						"could not rewind shared outer temporary file: %m")));
+	/* read byte containing ntuple bit */
+	if (BufFileRead(accessor->combined, &byte_to_check, 1) == 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg(
+						"could not read byte in outer match status bitmap: %m.")));
+	/* if bit is set */
+	match = ((byte_to_check) >> bit) & 1;
+
+	return match;
+}
+
+BufFile *
+sb_combine(SharedBitsAccessor *accessor)
+{
+	/*
+	 * TODO: this tries to close an outer match status file for each
+	 * participant in the tuplestore. technically, only participants in the
+	 * barrier could have outer match status files, however, all but one
+	 * participant continue on and detach from the barrier so we won't have a
+	 * reliable way to close only files for those attached to the barrier
+	 */
+	BufFile   **statuses;
+	BufFile    *combined_bitmap_file;
+	int			statuses_length;
+
+	int			nbparticipants = 0;
+
+	for (int l = 0; l < accessor->bits->nparticipants; l++)
+	{
+		SharedBitsParticipant participant = accessor->bits->participants[l];
+
+		if (participant.present)
+		{
+			Assert(!participant.writing);
+			nbparticipants++;
+		}
+	}
+	statuses = palloc(sizeof(BufFile *) * nbparticipants);
+
+	/*
+	 * Open the bitmap shared BufFile from each participant. TODO: explain why
+	 * file can be NULLs
+	 */
+	statuses_length = 0;
+
+	for (int i = 0; i < accessor->bits->nparticipants; i++)
+	{
+		char		bitmap_filename[MAXPGPATH];
+		BufFile    *file;
+
+		/* TODO: make a function that will do this */
+		snprintf(bitmap_filename, MAXPGPATH, "%s.p%d.bitmap", accessor->bits->name, i);
+
+		if (!accessor->bits->participants[i].present)
+			continue;
+		file = BufFileOpenShared(accessor->fileset, bitmap_filename);
+
+		Assert(file);
+
+		statuses[statuses_length++] = file;
+	}
+
+	combined_bitmap_file = BufFileCreateTemp(false);
+
+	for (int64 cur = 0; cur < BufFileSize(statuses[0]); cur++)	/* make it while not EOF */
+	{
+		/*
+		 * TODO: make this use an unsigned int instead of a byte so it isn't
+		 * so slow
+		 */
+		unsigned char combined_byte = 0;
+
+		for (int i = 0; i < statuses_length; i++)
+		{
+			unsigned char read_byte;
+
+			BufFileRead(statuses[i], &read_byte, 1);
+			combined_byte |= read_byte;
+		}
+
+		BufFileWrite(combined_bitmap_file, &combined_byte, 1);
+	}
+
+	if (BufFileSeek(combined_bitmap_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	for (int i = 0; i < statuses_length; i++)
+		BufFileClose(statuses[i]);
+	pfree(statuses);
+
+	accessor->combined = combined_bitmap_file;
+	return combined_bitmap_file;
+}
+
+void
+sb_end_write(SharedBitsAccessor *sba)
+{
+	SharedBitsParticipant
+			   *const participant = &sba->bits->participants[sba->participant];
+
+	participant->writing = false;
+
+	/*
+	 * TODO: this should not be needed if flow is correct. need to fix that
+	 * and get rid of this check
+	 */
+	if (sba->write_file)
+		BufFileClose(sba->write_file);
+	sba->write_file = NULL;
+}
+
+void
+sb_end_read(SharedBitsAccessor *accessor)
+{
+	if (accessor->combined == NULL)
+		return;
+
+	BufFileClose(accessor->combined);
+	accessor->combined = NULL;
+}
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index c3ab494a45..0e3b3de2b6 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -52,6 +52,7 @@ typedef struct SharedTuplestoreParticipant
 {
 	LWLock		lock;
 	BlockNumber read_page;		/* Page number for next read. */
+	bool		rewound;
 	BlockNumber npages;			/* Number of pages written. */
 	bool		writing;		/* Used only for assertions. */
 } SharedTuplestoreParticipant;
@@ -60,6 +61,7 @@ typedef struct SharedTuplestoreParticipant
 struct SharedTuplestore
 {
 	int			nparticipants;	/* Number of participants that can write. */
+	pg_atomic_uint32 ntuples;	/* Number of tuples in this tuplestore. */
 	int			flags;			/* Flag bits from SHARED_TUPLESTORE_XXX */
 	size_t		meta_data_size; /* Size of per-tuple header. */
 	char		name[NAMEDATALEN];	/* A name for this tuplestore. */
@@ -85,6 +87,8 @@ struct SharedTuplestoreAccessor
 	char	   *read_buffer;	/* A buffer for loading tuples. */
 	size_t		read_buffer_size;
 	BlockNumber read_next_page; /* Lowest block we'll consider reading. */
+	BlockNumber start_page;		/* page to reset p->read_page to if back out
+								 * required */
 
 	/* State for writing. */
 	SharedTuplestoreChunk *write_chunk; /* Buffer for writing. */
@@ -137,6 +141,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 	Assert(my_participant_number < participants);
 
 	sts->nparticipants = participants;
+	pg_atomic_init_u32(&sts->ntuples, 1);
 	sts->meta_data_size = meta_data_size;
 	sts->flags = flags;
 
@@ -158,6 +163,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 		LWLockInitialize(&sts->participants[i].lock,
 						 LWTRANCHE_SHARED_TUPLESTORE);
 		sts->participants[i].read_page = 0;
+		sts->participants[i].rewound = false;
 		sts->participants[i].writing = false;
 	}
 
@@ -277,6 +283,45 @@ sts_begin_parallel_scan(SharedTuplestoreAccessor *accessor)
 	accessor->read_participant = accessor->participant;
 	accessor->read_file = NULL;
 	accessor->read_next_page = 0;
+	accessor->start_page = 0;
+}
+
+void
+sts_resume_parallel_scan(SharedTuplestoreAccessor *accessor)
+{
+	int			i PG_USED_FOR_ASSERTS_ONLY;
+	SharedTuplestoreParticipant *p;
+
+	/* End any existing scan that was in progress. */
+	sts_end_parallel_scan(accessor);
+
+	/*
+	 * Any backend that might have written into this shared tuplestore must
+	 * have called sts_end_write(), so that all buffers are flushed and the
+	 * files have stopped growing.
+	 */
+	for (i = 0; i < accessor->sts->nparticipants; ++i)
+		Assert(!accessor->sts->participants[i].writing);
+
+	/*
+	 * We will start out reading the file that THIS backend wrote.  There may
+	 * be some caching locality advantage to that.
+	 */
+
+	/*
+	 * TODO: does this still apply in the multi-stripe case? It seems like if
+	 * a participant file is exhausted for the current stripe it might be
+	 * better to remember that
+	 */
+	accessor->read_participant = accessor->participant;
+	accessor->read_file = NULL;
+	p = &accessor->sts->participants[accessor->read_participant];
+
+	/* TODO: find a better solution than this for resuming the parallel scan */
+	LWLockAcquire(&p->lock, LW_SHARED);
+	accessor->start_page = p->read_page;
+	LWLockRelease(&p->lock);
+	accessor->read_next_page = 0;
 }
 
 /*
@@ -295,6 +340,7 @@ sts_end_parallel_scan(SharedTuplestoreAccessor *accessor)
 		BufFileClose(accessor->read_file);
 		accessor->read_file = NULL;
 	}
+	accessor->start_page = 0;
 }
 
 /*
@@ -531,7 +577,13 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	for (;;)
 	{
 		/* Can we read more tuples from the current chunk? */
-		if (accessor->read_ntuples < accessor->read_ntuples_available)
+		/*
+		 * Added a check for accessor->read_file being present here, as it
+		 * became relevant for adaptive hashjoin. Not sure if this has other
+		 * consequences for correctness
+		 */
+
+		if (accessor->read_ntuples < accessor->read_ntuples_available && accessor->read_file)
 			return sts_read_tuple(accessor, meta_data);
 
 		/* Find the location of a new chunk to read. */
@@ -541,7 +593,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 		/* We can skip directly past overflow pages we know about. */
 		if (p->read_page < accessor->read_next_page)
 			p->read_page = accessor->read_next_page;
-		eof = p->read_page >= p->npages;
+		eof = p->read_page >= p->npages || p->rewound;
 		if (!eof)
 		{
 			/* Claim the next chunk. */
@@ -549,9 +601,22 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 			/* Advance the read head for the next reader. */
 			p->read_page += STS_CHUNK_PAGES;
 			accessor->read_next_page = p->read_page;
+
+			/*
+			 * initialize start_page to the read_page this participant will
+			 * start reading from
+			 */
+			accessor->start_page = read_page;
 		}
 		LWLockRelease(&p->lock);
 
+		if (!eof)
+		{
+			char		name[MAXPGPATH];
+
+			sts_filename(name, accessor, accessor->read_participant);
+		}
+
 		if (!eof)
 		{
 			SharedTuplestoreChunk chunk_header;
@@ -613,6 +678,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 			if (accessor->read_participant == accessor->participant)
 				break;
 			accessor->read_next_page = 0;
+			accessor->start_page = 0;
 
 			/* Go around again, so we can get a chunk from this file. */
 		}
@@ -621,6 +687,48 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	return NULL;
 }
 
+void
+sts_parallel_scan_rewind(SharedTuplestoreAccessor *accessor)
+{
+	SharedTuplestoreParticipant *p =
+	&accessor->sts->participants[accessor->read_participant];
+
+	/*
+	 * Only set the read_page back to the start of the sts_chunk this worker
+	 * was reading if some other worker has not already done so. It could be
+	 * the case that this worker saw a tuple from a future stripe and another
+	 * worker did also in its sts_chunk and it already set read_page to its
+	 * start_page If so, we want to set read_page to the lowest value to
+	 * ensure that we read all tuples from the stripe (don't miss tuples)
+	 */
+	LWLockAcquire(&p->lock, LW_EXCLUSIVE);
+	p->read_page = Min(p->read_page, accessor->start_page);
+	p->rewound = true;
+	LWLockRelease(&p->lock);
+
+	accessor->read_ntuples_available = 0;
+	accessor->read_next_page = 0;
+}
+
+void
+sts_reset_rewound(SharedTuplestoreAccessor *accessor)
+{
+	for (int i = 0; i < accessor->sts->nparticipants; ++i)
+		accessor->sts->participants[i].rewound = false;
+}
+
+uint32
+sts_increment_ntuples(SharedTuplestoreAccessor *accessor)
+{
+	return pg_atomic_fetch_add_u32(&accessor->sts->ntuples, 1);
+}
+
+uint32
+sts_get_tuplenum(SharedTuplestoreAccessor *accessor)
+{
+	return pg_atomic_read_u32(&accessor->sts->ntuples);
+}
+
 /*
  * Create the name used for the BufFile that a given participant will write.
  */
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index ba661d32a6..0ba9d856c8 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -46,6 +46,7 @@ typedef struct ExplainState
 	bool		timing;			/* print detailed node timing */
 	bool		summary;		/* print total planning and execution timing */
 	bool		settings;		/* print modified settings */
+	bool		usage;			/* print memory usage */
 	ExplainFormat format;		/* output format */
 	/* state for output formatting --- not reset for each new plan tree */
 	int			indent;			/* current indentation level */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 79b634e8ed..e1c8d78e60 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -19,6 +19,7 @@
 #include "storage/barrier.h"
 #include "storage/buffile.h"
 #include "storage/lwlock.h"
+#include "utils/sharedbits.h"
 
 /* ----------------------------------------------------------------
  *				hash-join hash table structures
@@ -152,6 +153,7 @@ typedef struct ParallelHashJoinBatch
 {
 	dsa_pointer buckets;		/* array of hash table buckets */
 	Barrier		batch_barrier;	/* synchronization for joining this batch */
+	Barrier		stripe_barrier; /* synchronization for stripes */
 
 	dsa_pointer chunks;			/* chunks of tuples loaded */
 	size_t		size;			/* size of buckets + chunks in memory */
@@ -160,6 +162,17 @@ typedef struct ParallelHashJoinBatch
 	size_t		old_ntuples;	/* number of tuples before repartitioning */
 	bool		space_exhausted;
 
+	/* Adaptive HashJoin */
+
+	/*
+	 * after finishing build phase, hashloop_fallback cannot change, and does
+	 * not require a lock to read
+	 */
+	bool		hashloop_fallback;
+	int			maximum_stripe_number; /* the number of stripes in the batch */
+	size_t		estimated_stripe_size;	/* size of last stripe in batch */
+	LWLock		lock;
+
 	/*
 	 * Variable-sized SharedTuplestore objects follow this struct in memory.
 	 * See the accessor macros below.
@@ -177,10 +190,17 @@ typedef struct ParallelHashJoinBatch
 	 ((char *) ParallelHashJoinBatchInner(batch) +						\
 	  MAXALIGN(sts_estimate(nparticipants))))
 
+/* Accessor for sharedbits following a ParallelHashJoinBatch. */
+#define ParallelHashJoinBatchOuterBits(batch, nparticipants) \
+	((SharedBits *)												\
+	 ((char *) ParallelHashJoinBatchOuter(batch, nparticipants) +						\
+	  MAXALIGN(sts_estimate(nparticipants))))
+
 /* Total size of a ParallelHashJoinBatch and tuplestores. */
 #define EstimateParallelHashJoinBatch(hashtable)						\
 	(MAXALIGN(sizeof(ParallelHashJoinBatch)) +							\
-	 MAXALIGN(sts_estimate((hashtable)->parallel_state->nparticipants)) * 2)
+	 MAXALIGN(sts_estimate((hashtable)->parallel_state->nparticipants)) * 2 + \
+	 MAXALIGN(sb_estimate((hashtable)->parallel_state->nparticipants)))
 
 /* Accessor for the nth ParallelHashJoinBatch given the base. */
 #define NthParallelHashJoinBatch(base, n)								\
@@ -204,9 +224,18 @@ typedef struct ParallelHashJoinBatchAccessor
 	size_t		old_ntuples;	/* how many tuples before repartitioning? */
 	bool		at_least_one_chunk; /* has this backend allocated a chunk? */
 
-	bool		done;			/* flag to remember that a batch is done */
+	int			done;			/* flag to remember that a batch is done */
+	/* -1 for not done, 0 for tentatively done, 1 for done */
 	SharedTuplestoreAccessor *inner_tuples;
 	SharedTuplestoreAccessor *outer_tuples;
+	SharedBitsAccessor *sba;
+	/*
+	 * All participants except the last worker working on a batch which has
+	 * fallen back to hashloop processing save the stripe barrier phase and
+	 * detach to avoid the deadlock hazard of waiting on a barrier after
+	 * tuples have been emitted.
+	 */
+	int			last_participating_stripe_phase;
 } ParallelHashJoinBatchAccessor;
 
 /*
@@ -227,6 +256,19 @@ typedef enum ParallelHashGrowth
 	PHJ_GROWTH_DISABLED
 } ParallelHashGrowth;
 
+typedef enum ParallelHashJoinBatchAccessorStatus
+{
+	/* No more useful work can be done on this batch by this worker */
+	PHJ_BATCH_ACCESSOR_DONE,
+	/*
+	 * This worker has probed a stripe of this batch but wasn't the last worker,
+	 * so it will check back later to see if it can work on this batch again
+	 */
+	PHJ_BATCH_ACCESSOR_PROVISIONALLY_DONE,
+	/* The worker has not yet checked this batch to see if it can do useful work */
+	PHJ_BATCH_ACCESSOR_NOT_DONE
+} ParallelHashJoinBatchAccessorStatus;
+
 /*
  * The shared state used to coordinate a Parallel Hash Join.  This is stored
  * in the DSM segment.
@@ -251,6 +293,7 @@ typedef struct ParallelHashJoinState
 	pg_atomic_uint32 distributor;	/* counter for load balancing */
 
 	SharedFileSet fileset;		/* space for shared temporary files */
+	SharedFileSet sbfileset;
 } ParallelHashJoinState;
 
 /* The phases for building batches, used by build_barrier. */
@@ -263,9 +306,18 @@ typedef struct ParallelHashJoinState
 /* The phases for probing each batch, used by for batch_barrier. */
 #define PHJ_BATCH_ELECTING				0
 #define PHJ_BATCH_ALLOCATING			1
-#define PHJ_BATCH_LOADING				2
-#define PHJ_BATCH_PROBING				3
-#define PHJ_BATCH_DONE					4
+#define PHJ_BATCH_STRIPING				2
+#define PHJ_BATCH_DONE					3
+
+/* The phases for probing each stripe of each batch used with stripe barriers */
+#define PHJ_STRIPE_INVALID_PHASE        -1
+#define PHJ_STRIPE_ELECTING				0
+#define PHJ_STRIPE_RESETTING			1
+#define PHJ_STRIPE_LOADING				2
+#define PHJ_STRIPE_PROBING				3
+#define PHJ_STRIPE_DONE				    4
+#define PHJ_STRIPE_NUMBER(n)            ((n) / 5)
+#define PHJ_STRIPE_PHASE(n)             ((n) % 5)
 
 /* The phases of batch growth while hashing, for grow_batches_barrier. */
 #define PHJ_GROW_BATCHES_ELECTING		0
@@ -313,8 +365,6 @@ typedef struct HashJoinTableData
 	int			nbatch_original;	/* nbatch when we started inner scan */
 	int			nbatch_outstart;	/* nbatch when we started outer scan */
 
-	bool		growEnabled;	/* flag to shut off nbatch increases */
-
 	double		totalTuples;	/* # tuples obtained from inner plan */
 	double		partialTuples;	/* # tuples obtained from inner plan by me */
 	double		skewTuples;		/* # tuples inserted into skew tuples */
@@ -329,6 +379,18 @@ typedef struct HashJoinTableData
 	BufFile   **innerBatchFile; /* buffered virtual temp file per batch */
 	BufFile   **outerBatchFile; /* buffered virtual temp file per batch */
 
+	/*
+	 * Adaptive hashjoin variables
+	 */
+	BufFile   **hashloop_fallback;	/* outer match status files if fall back */
+	List	   *fallback_batches_stats; /* per hashjoin batch statistics */
+
+	/*
+	 * current stripe #; 0 during 1st pass, -1 when detached -2 on phantom
+	 * stripe
+	 */
+	int			curstripe;
+
 	/*
 	 * Info about the datatype-specific hash functions for the datatypes being
 	 * hashed. These are arrays of the same length as the number of hash join
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index a97562e7a4..e72bd5702a 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -14,6 +14,7 @@
 #define INSTRUMENT_H
 
 #include "portability/instr_time.h"
+#include "nodes/pg_list.h"
 
 
 typedef struct BufferUsage
@@ -39,6 +40,12 @@ typedef struct WalUsage
 	uint64		wal_bytes;		/* size of WAL records produced */
 } WalUsage;
 
+typedef struct FallbackBatchStats
+{
+	int			batchno;
+	int			numstripes;
+} FallbackBatchStats;
+
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
 typedef enum InstrumentOption
 {
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 64d2ce693c..f85308738b 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -31,6 +31,7 @@ extern void ExecParallelHashTableAlloc(HashJoinTable hashtable,
 extern void ExecHashTableDestroy(HashJoinTable hashtable);
 extern void ExecHashTableDetach(HashJoinTable hashtable);
 extern void ExecHashTableDetachBatch(HashJoinTable hashtable);
+extern bool ExecHashTableDetachStripe(HashJoinTable hashtable);
 extern void ExecParallelHashTableSetCurrentBatch(HashJoinTable hashtable,
 												 int batchno);
 
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index f7df70b5ab..0c0d87d1d3 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -129,6 +129,7 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+	uint32		tts_tuplenum;	/* a tuple id for use when ctid cannot be used */
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -425,6 +426,7 @@ static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
 	slot->tts_ops->clear(slot);
+	slot->tts_tuplenum = 0;		/* TODO: should this be done elsewhere? */
 
 	return slot;
 }
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 98e0072b8a..015879934c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1957,6 +1957,10 @@ typedef struct HashJoinState
 	int			hj_JoinState;
 	bool		hj_MatchedOuter;
 	bool		hj_OuterNotEmpty;
+	/* Adaptive Hashjoin variables */
+	int			hj_CurNumOuterTuples;	/* number of outer tuples in a batch */
+	unsigned int hj_CurOuterMatchStatus;
+	int			hj_EmitOuterTupleId;
 } HashJoinState;
 
 
@@ -2359,6 +2363,7 @@ typedef struct HashInstrumentation
 	int			nbatch;			/* number of batches at end of execution */
 	int			nbatch_original;	/* planned number of batches */
 	Size		space_peak;		/* peak memory usage in bytes */
+	List	   *fallback_batches_stats; /* per hashjoin batch stats */
 } HashInstrumentation;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c55dc1481c..df8010f832 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -855,7 +855,10 @@ typedef enum
 	WAIT_EVENT_EXECUTE_GATHER,
 	WAIT_EVENT_HASH_BATCH_ALLOCATE,
 	WAIT_EVENT_HASH_BATCH_ELECT,
-	WAIT_EVENT_HASH_BATCH_LOAD,
+	WAIT_EVENT_HASH_STRIPE_ELECT,
+	WAIT_EVENT_HASH_STRIPE_RESET,
+	WAIT_EVENT_HASH_STRIPE_LOAD,
+	WAIT_EVENT_HASH_STRIPE_PROBE,
 	WAIT_EVENT_HASH_BUILD_ALLOCATE,
 	WAIT_EVENT_HASH_BUILD_ELECT,
 	WAIT_EVENT_HASH_BUILD_HASH_INNER,
diff --git a/src/include/utils/sharedbits.h b/src/include/utils/sharedbits.h
new file mode 100644
index 0000000000..de43279de8
--- /dev/null
+++ b/src/include/utils/sharedbits.h
@@ -0,0 +1,39 @@
+/*-------------------------------------------------------------------------
+ *
+ * sharedbits.h
+ *	  Simple mechanism for sharing bits between backends.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/sharedbits.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SHAREDBITS_H
+#define SHAREDBITS_H
+
+#include "storage/sharedfileset.h"
+
+struct SharedBits;
+typedef struct SharedBits SharedBits;
+
+struct SharedBitsParticipant;
+typedef struct SharedBitsParticipant SharedBitsParticipant;
+
+struct SharedBitsAccessor;
+typedef struct SharedBitsAccessor SharedBitsAccessor;
+
+extern SharedBitsAccessor *sb_attach(SharedBits *sbits, int my_participant_number, SharedFileSet *fileset);
+extern SharedBitsAccessor *sb_initialize(SharedBits *sbits, int participants, int my_participant_number, SharedFileSet *fileset, char *name);
+extern void sb_initialize_accessor(SharedBitsAccessor *accessor, uint32 nbits);
+extern size_t sb_estimate(int participants);
+
+extern void sb_setbit(SharedBitsAccessor *accessor, uint64 bit);
+extern bool sb_checkbit(SharedBitsAccessor *accessor, uint32 n);
+extern BufFile *sb_combine(SharedBitsAccessor *accessor);
+
+extern void sb_end_write(SharedBitsAccessor *sba);
+extern void sb_end_read(SharedBitsAccessor *accessor);
+
+#endif							/* SHAREDBITS_H */
diff --git a/src/include/utils/sharedtuplestore.h b/src/include/utils/sharedtuplestore.h
index 9754504cc5..99aead8a4a 100644
--- a/src/include/utils/sharedtuplestore.h
+++ b/src/include/utils/sharedtuplestore.h
@@ -22,6 +22,17 @@ typedef struct SharedTuplestore SharedTuplestore;
 
 struct SharedTuplestoreAccessor;
 typedef struct SharedTuplestoreAccessor SharedTuplestoreAccessor;
+struct tupleMetadata;
+typedef struct tupleMetadata tupleMetadata;
+struct tupleMetadata
+{
+	uint32		hashvalue;
+	union
+	{
+		uint32		tupleid;	/* tuple number or id on the outer side */
+		int			stripe;		/* stripe number for inner side */
+	};
+};
 
 /*
  * A flag indicating that the tuplestore will only be scanned once, so backing
@@ -49,6 +60,8 @@ extern void sts_reinitialize(SharedTuplestoreAccessor *accessor);
 
 extern void sts_begin_parallel_scan(SharedTuplestoreAccessor *accessor);
 
+extern void sts_resume_parallel_scan(SharedTuplestoreAccessor *accessor);
+
 extern void sts_end_parallel_scan(SharedTuplestoreAccessor *accessor);
 
 extern void sts_puttuple(SharedTuplestoreAccessor *accessor,
@@ -58,4 +71,10 @@ extern void sts_puttuple(SharedTuplestoreAccessor *accessor,
 extern MinimalTuple sts_parallel_scan_next(SharedTuplestoreAccessor *accessor,
 										   void *meta_data);
 
+extern void sts_parallel_scan_rewind(SharedTuplestoreAccessor *accessor);
+
+extern void sts_reset_rewound(SharedTuplestoreAccessor *accessor);
+extern uint32 sts_increment_ntuples(SharedTuplestoreAccessor *accessor);
+extern uint32 sts_get_tuplenum(SharedTuplestoreAccessor *accessor);
+
 #endif							/* SHAREDTUPLESTORE_H */
diff --git a/src/test/regress/expected/join_hash.out b/src/test/regress/expected/join_hash.out
index 3a91c144a2..98a90a85e4 100644
--- a/src/test/regress/expected/join_hash.out
+++ b/src/test/regress/expected/join_hash.out
@@ -443,7 +443,7 @@ $$
 $$);
  original | final 
 ----------+-------
-        1 |     2
+        1 |     4
 (1 row)
 
 rollback to settings;
@@ -478,7 +478,7 @@ $$
 $$);
  original | final 
 ----------+-------
-        1 |     2
+        1 |     4
 (1 row)
 
 rollback to settings;
@@ -1013,3 +1013,944 @@ WHERE
 (1 row)
 
 ROLLBACK;
+-- Serial Adaptive Hash Join
+CREATE TYPE stub AS (hash INTEGER, value CHAR(8098));
+CREATE FUNCTION stub_hash(item stub)
+RETURNS INTEGER AS $$
+DECLARE
+  batch_size INTEGER;
+BEGIN
+  batch_size := 4;
+  RETURN item.hash << (batch_size - 1);
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+CREATE FUNCTION stub_eq(item1 stub, item2 stub)
+RETURNS BOOLEAN AS $$
+BEGIN
+  RETURN item1.hash = item2.hash AND item1.value = item2.value;
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+CREATE OPERATOR = (
+  FUNCTION = stub_eq,
+  LEFTARG = stub,
+  RIGHTARG = stub,
+  COMMUTATOR = =,
+  HASHES, MERGES
+);
+CREATE OPERATOR CLASS stub_hash_ops
+DEFAULT FOR TYPE stub USING hash AS
+  OPERATOR 1 =(stub, stub),
+  FUNCTION 1 stub_hash(stub);
+CREATE TABLE probeside(a stub);
+ALTER TABLE probeside ALTER COLUMN a SET STORAGE PLAIN;
+-- non-fallback batch with unmatched outer tuple
+INSERT INTO probeside SELECT '(2, "")' FROM generate_series(1, 1);
+-- fallback batch unmatched outer tuple (in first stripe maybe)
+INSERT INTO probeside SELECT '(1, "unmatched outer tuple")' FROM generate_series(1, 1);
+-- fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(1, "")' FROM generate_series(1, 5);
+-- fallback batch unmatched outer tuple (in last stripe maybe)
+-- When numbatches=4, hash 5 maps to batch 1, but after numbatches doubles to
+-- 8 batches hash 5 maps to batch 5.
+INSERT INTO probeside SELECT '(5, "")' FROM generate_series(1, 1);
+-- non-fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(3, "")' FROM generate_series(1, 1);
+-- batch with 3 stripes where non-first/non-last stripe contains unmatched outer tuple
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 5);
+INSERT INTO probeside SELECT '(6, "unmatched outer tuple")' FROM generate_series(1, 1);
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 1);
+CREATE TABLE hashside_wide(a stub, id int);
+ALTER TABLE hashside_wide ALTER COLUMN a SET STORAGE PLAIN;
+-- falls back with an unmatched inner tuple that is in fist, middle, and last
+-- stripe
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in first stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in middle stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in last stripe")', 1 FROM generate_series(1, 1);
+-- doesn't fall back -- matched tuple
+INSERT INTO hashside_wide SELECT '(3, "")', 3 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(6, "")', 6 FROM generate_series(1, 20);
+ANALYZE probeside, hashside_wide;
+SET enable_nestloop TO off;
+SET enable_mergejoin TO off;
+SET work_mem = 64;
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash | btrim 
+------+-----------------------+----+------+-------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+(215 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Left Join (actual rows=215 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash | btrim | id | hash |                 btrim                  
+------+-------+----+------+----------------------------------------
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    3 |       |  3 |    3 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+      |       |  1 |    1 | unmatched inner tuple in first stripe
+      |       |  1 |    1 | unmatched inner tuple in last stripe
+      |       |  1 |    1 | unmatched inner tuple in middle stripe
+(214 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Right Join (actual rows=214 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+FULL OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash |                 btrim                  
+------+-----------------------+----+------+----------------------------------------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+      |                       |  1 |    1 | unmatched inner tuple in first stripe
+      |                       |  1 |    1 | unmatched inner tuple in last stripe
+      |                       |  1 |    1 | unmatched inner tuple in middle stripe
+(218 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+FULL OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Full Join (actual rows=218 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+/*
+-- semi-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+*/
+-- anti-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Anti Join (actual rows=4 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+ hash |         btrim         
+------+-----------------------
+    1 | unmatched outer tuple
+    2 | 
+    5 | 
+    6 | unmatched outer tuple
+(4 rows)
+
+-- Test spill of batch 0 gives correct results.
+CREATE TABLE probeside_batch0(a stub);
+ALTER TABLE probeside_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO probeside_batch0 SELECT '(0, "")' FROM generate_series(1, 13);
+INSERT INTO probeside_batch0 SELECT '(0, "unmatched outer")' FROM generate_series(1, 1);
+CREATE TABLE hashside_wide_batch0(a stub, id int);
+ALTER TABLE hashside_wide_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO hashside_wide_batch0 SELECT '(0, "")', 1 FROM generate_series(1, 9);
+ANALYZE probeside_batch0, hashside_wide_batch0;
+SELECT (probeside_batch0.a).hash, ((((probeside_batch0.a).hash << 7) >> 3) & 31) AS batchno, TRIM((probeside_batch0.a).value), hashside_wide_batch0.id, hashside_wide_batch0.ctid, (hashside_wide_batch0.a).hash, TRIM((hashside_wide_batch0.a).value)
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash | batchno |      btrim      | id | ctid  | hash | btrim 
+------+---------+-----------------+----+-------+------+-------
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (0,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (1,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (2,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (3,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (4,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (5,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (6,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (7,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 |                 |  1 | (8,1) |    0 | 
+    0 |       0 | unmatched outer |    |       |      | 
+(118 rows)
+
diff --git a/src/test/regress/sql/join_hash.sql b/src/test/regress/sql/join_hash.sql
index 68c1a8c7b6..1f70300d02 100644
--- a/src/test/regress/sql/join_hash.sql
+++ b/src/test/regress/sql/join_hash.sql
@@ -538,3 +538,130 @@ WHERE
     AND hjtest_1.a <> hjtest_2.b;
 
 ROLLBACK;
+
+-- Serial Adaptive Hash Join
+
+CREATE TYPE stub AS (hash INTEGER, value CHAR(8098));
+
+CREATE FUNCTION stub_hash(item stub)
+RETURNS INTEGER AS $$
+DECLARE
+  batch_size INTEGER;
+BEGIN
+  batch_size := 4;
+  RETURN item.hash << (batch_size - 1);
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+
+CREATE FUNCTION stub_eq(item1 stub, item2 stub)
+RETURNS BOOLEAN AS $$
+BEGIN
+  RETURN item1.hash = item2.hash AND item1.value = item2.value;
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+
+CREATE OPERATOR = (
+  FUNCTION = stub_eq,
+  LEFTARG = stub,
+  RIGHTARG = stub,
+  COMMUTATOR = =,
+  HASHES, MERGES
+);
+
+CREATE OPERATOR CLASS stub_hash_ops
+DEFAULT FOR TYPE stub USING hash AS
+  OPERATOR 1 =(stub, stub),
+  FUNCTION 1 stub_hash(stub);
+
+CREATE TABLE probeside(a stub);
+ALTER TABLE probeside ALTER COLUMN a SET STORAGE PLAIN;
+-- non-fallback batch with unmatched outer tuple
+INSERT INTO probeside SELECT '(2, "")' FROM generate_series(1, 1);
+-- fallback batch unmatched outer tuple (in first stripe maybe)
+INSERT INTO probeside SELECT '(1, "unmatched outer tuple")' FROM generate_series(1, 1);
+-- fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(1, "")' FROM generate_series(1, 5);
+-- fallback batch unmatched outer tuple (in last stripe maybe)
+-- When numbatches=4, hash 5 maps to batch 1, but after numbatches doubles to
+-- 8 batches hash 5 maps to batch 5.
+INSERT INTO probeside SELECT '(5, "")' FROM generate_series(1, 1);
+-- non-fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(3, "")' FROM generate_series(1, 1);
+-- batch with 3 stripes where non-first/non-last stripe contains unmatched outer tuple
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 5);
+INSERT INTO probeside SELECT '(6, "unmatched outer tuple")' FROM generate_series(1, 1);
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 1);
+
+CREATE TABLE hashside_wide(a stub, id int);
+ALTER TABLE hashside_wide ALTER COLUMN a SET STORAGE PLAIN;
+-- falls back with an unmatched inner tuple that is in fist, middle, and last
+-- stripe
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in first stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in middle stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in last stripe")', 1 FROM generate_series(1, 1);
+
+-- doesn't fall back -- matched tuple
+INSERT INTO hashside_wide SELECT '(3, "")', 3 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(6, "")', 6 FROM generate_series(1, 20);
+
+ANALYZE probeside, hashside_wide;
+
+SET enable_nestloop TO off;
+SET enable_mergejoin TO off;
+SET work_mem = 64;
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+FULL OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+FULL OUTER JOIN hashside_wide USING (a);
+
+/*
+-- semi-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+*/
+
+-- anti-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+
+-- Test spill of batch 0 gives correct results.
+CREATE TABLE probeside_batch0(a stub);
+ALTER TABLE probeside_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO probeside_batch0 SELECT '(0, "")' FROM generate_series(1, 13);
+INSERT INTO probeside_batch0 SELECT '(0, "unmatched outer")' FROM generate_series(1, 1);
+
+CREATE TABLE hashside_wide_batch0(a stub, id int);
+ALTER TABLE hashside_wide_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO hashside_wide_batch0 SELECT '(0, "")', 1 FROM generate_series(1, 9);
+ANALYZE probeside_batch0, hashside_wide_batch0;
+
+SELECT (probeside_batch0.a).hash, ((((probeside_batch0.a).hash << 7) >> 3) & 31) AS batchno, TRIM((probeside_batch0.a).value), hashside_wide_batch0.id, hashside_wide_batch0.ctid, (hashside_wide_batch0.a).hash, TRIM((hashside_wide_batch0.a).value)
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5;
-- 
2.20.1

#55

Melanie Plageman

melanieplageman@gmail.com

over 5 years ago

In reply to: Melanie Plageman (#54)

1 attachment(s)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Wed, May 27, 2020 at 7:25 PM Melanie Plageman <melanieplageman@gmail.com>
wrote:

I've attached a rebased patch which includes the "provisionally detach"
deadlock hazard fix approach

Alas, the "provisional detach" logic proved incorrect (see last point in
the list of changes included in the patch at bottom).

Also, we kept the batch 0 spilling patch David Kimura authored [1]
separate so it could be discussed separately because we still had some
questions.

The serial batch 0 spilling is in the attached patch. Parallel batch 0
spilling is still in a separate batch that David Kimura is working on.

I've attached a rebased and updated patch with a few fixes:

- semi-join fallback works now
- serial batch 0 spilling in main patch
- added instrumentation for stripes to the parallel case
- SharedBits uses same SharedFileset as SharedTuplestore
- reverted the optimization to allow workers to re-attach to a batch and
help out with stripes if they are sure they pose no deadlock risk

For the last point, I discovered a pretty glaring problem with this
optimization: I did not include the bitmap created by a worker while
working on its first participating stripe in the final combined bitmap.
I only was combining the last bitmap file each worker worked on.

I had the workers make new bitmaps for each time that they attached to
the batch and participated because having them keep an open file
tracking information for a batch they are no longer attached to on the
chance that they might return and work on that batch was a
synchronization nightmare. It was difficult to figure out when to close
the file if they never returned and hard to make sure that the combining
worker is actually combining all the files from all participants who
were ever active.

I am sure I can hack around those, but I think we need a better solution
overall. After reverting those changes, loading and probing of stripes
after stripe 0 is serial. This is not only sub-optimal, it also means
that all the synchronization variables and code complexity around
coordinating work on fallback batches is practically wasted.
So, they have to be able to collaborate on stripes after the first
stripe. This version of the patch has correct results and no deadlock
hazard, however, it lacks parallelism on stripes after stripe 0.
I am looking for ideas on how to address the deadlock hazard more
efficiently.

The next big TODOs are:
- come up with a better solution to the potential tuple emitting/barrier
waiting deadlock issue
- parallel batch 0 spilling complete

--
Melanie Plageman

Attachments:

v9-0001-Implement-Adaptive-Hashjoin.patchtext/x-patch; charset=US-ASCII; name=v9-0001-Implement-Adaptive-Hashjoin.patchDownload

From c2f1b7f4316cedb2c11f698e60e62d487d4de943 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 8 Jun 2020 17:01:28 -0700
Subject: [PATCH v9] Implement Adaptive Hashjoin

If the inner side tuples of a hashjoin will not fit in memory, the
hashjoin can be executed in multiple batches. If the statistics on the
inner side relation are accurate, planner chooses a multi-batch
strategy and sets the number of batches.
The query executor measures the real size of the hashtable and increases
the number of batches if the hashtable grows too large.

The number of batches is always a power of two, so an increase in the
number of batches doubles it.

Serial hashjoin measures batch size lazily -- waiting until it is
loading a batch to determine if it will fit in memory.

Parallel hashjoin, on the other hand, completes all changes to the
number of batches during the build phase. If it doubles the number of
batches, it dumps all the tuples out, reassigns them to batches,
measures each batch, and checks that it will fit in the space allowed.

In both cases, the executor currently makes a best effort. If a
particular batch won't fit in memory, and, upon changing the number of
batches none of the tuples move to a new batch, the executor disables
growth in the number of batches globally. After growth is disabled, all
batches that would have previously triggered an increase in the number
of batches instead exceed the space allowed.

There is no mechanism to perform a hashjoin within memory constraints if
a run of tuples hash to the same batch. Also, hashjoin will continue to
double the number of batches if *some* tuples move each time -- even if
the batch will never fit in memory -- resulting in an explosion in the
number of batches (affecting performance negatively for multiple
reasons).

Adaptive hashjoin is a mechanism to process a run of inner side tuples
with join keys which hash to the same batch in a manner that is
efficient and respects the space allowed.

When an offending batch causes the number of batches to be doubled and
some percentage of the tuples would not move to a new batch, that batch
can be marked to "fall back". This mechanism replaces serial hashjoin's
"grow_enabled" flag and replaces part of the functionality of parallel
hashjoin's "growth = PHJ_GROWTH_DISABLED" flag. However, instead of
disabling growth in the number of batches for all batches, it only
prevents this batch from causing another increase in the number of
batches.

When the inner side of this batch is loaded into memory, stripes of
arbitrary tuples totaling work_mem in size are loaded into the
hashtable. After probing this stripe, the outer side batch is rewound
and the next stripe is loaded. Each stripe of inner is probed until all
tuples have been processed.

Tuples that match are emitted (depending on the join semantics of the
particular join type) during probing of a stripe. In order to make
left outer join work, unmatched tuples cannot be emitted NULL-extended
until all stripes have been probed. To address this, a bitmap is created
with a bit for each tuple of the outer side. If a tuple on the outer
side matches a tuple from the inner, the corresponding bit is set. At
the end of probing all stripes, the executor scans the bitmap and emits
unmatched outer tuples.

Batch 0 falls back for serial hashjoin but does not yet fall back for
parallel hashjoin. David Kimura is working on a separate patch for this.

TODOs:
- Better solution to deadlock hazard with waiting on a barrier after
  emitting tuples
- Experiment with different fallback threshholds
  (currently hardcoded to 80% but parameterizable)
- Improve stripe instrumentation implementation for serial and parallel
- Assorted TODOs in the code

Co-authored-by: Jesse Zhang <sbjesse@gmail.com>
Co-authored-by: David Kimura <dkimura@pivotal.io>
---
 src/backend/commands/explain.c            |   45 +-
 src/backend/executor/nodeHash.c           |  387 +++++-
 src/backend/executor/nodeHashjoin.c       |  787 +++++++++--
 src/backend/postmaster/pgstat.c           |   13 +-
 src/backend/utils/sort/Makefile           |    1 +
 src/backend/utils/sort/sharedbits.c       |  285 ++++
 src/backend/utils/sort/sharedtuplestore.c |  112 +-
 src/include/commands/explain.h            |    1 +
 src/include/executor/hashjoin.h           |   86 +-
 src/include/executor/instrument.h         |    7 +
 src/include/executor/nodeHash.h           |    1 +
 src/include/executor/tuptable.h           |    2 +
 src/include/nodes/execnodes.h             |    5 +
 src/include/pgstat.h                      |    5 +-
 src/include/utils/sharedbits.h            |   39 +
 src/include/utils/sharedtuplestore.h      |   19 +
 src/test/regress/expected/join_hash.out   | 1451 +++++++++++++++++++++
 src/test/regress/sql/join_hash.sql        |  146 +++
 18 files changed, 3220 insertions(+), 172 deletions(-)
 create mode 100644 src/backend/utils/sort/sharedbits.c
 create mode 100644 src/include/utils/sharedbits.h

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index efd7201d61..26b3664b4b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -184,6 +184,8 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 			es->wal = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "settings") == 0)
 			es->settings = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "usage") == 0)
+			es->usage = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "timing") == 0)
 		{
 			timing_set = true;
@@ -312,6 +314,7 @@ NewExplainState(void)
 
 	/* Set default options (most fields can be left as zeroes). */
 	es->costs = true;
+	es->usage = true;
 	/* Prepare output buffer. */
 	es->str = makeStringInfo();
 
@@ -2999,6 +3002,8 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 											  worker_hi->nbatch_original);
 			hinstrument.space_peak = Max(hinstrument.space_peak,
 										 worker_hi->space_peak);
+			if (!hinstrument.fallback_batches_stats && worker_hi->fallback_batches_stats)
+				hinstrument.fallback_batches_stats = worker_hi->fallback_batches_stats;
 		}
 	}
 
@@ -3022,22 +3027,50 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		else if (hinstrument.nbatch_original != hinstrument.nbatch ||
 				 hinstrument.nbuckets_original != hinstrument.nbuckets)
 		{
+			ListCell   *lc;
+
 			ExplainIndentText(es);
 			appendStringInfo(es->str,
-							 "Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 "Buckets: %d (originally %d)  Batches: %d (originally %d)",
 							 hinstrument.nbuckets,
 							 hinstrument.nbuckets_original,
 							 hinstrument.nbatch,
-							 hinstrument.nbatch_original,
-							 spacePeakKb);
+							 hinstrument.nbatch_original);
+			if (es->usage)
+				appendStringInfo(es->str, "  Memory Usage: %ldkB\n", spacePeakKb);
+			else
+				appendStringInfo(es->str, "\n");
+
+			foreach(lc, hinstrument.fallback_batches_stats)
+			{
+				FallbackBatchStats *fbs = lfirst(lc);
+
+				ExplainIndentText(es);
+				appendStringInfo(es->str, "Batch: %d  Stripes: %d\n", fbs->batchno, fbs->numstripes);
+			}
 		}
 		else
 		{
+			ListCell   *lc;
+
 			ExplainIndentText(es);
 			appendStringInfo(es->str,
-							 "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
-							 hinstrument.nbuckets, hinstrument.nbatch,
-							 spacePeakKb);
+							 "Buckets: %d  Batches: %d",
+							 hinstrument.nbuckets, hinstrument.nbatch);
+			if (es->usage)
+				appendStringInfo(es->str, "  Memory Usage: %ldkB\n", spacePeakKb);
+			else
+				appendStringInfo(es->str, "\n");
+			foreach(lc, hinstrument.fallback_batches_stats)
+			{
+				FallbackBatchStats *fbs = lfirst(lc);
+
+				ExplainIndentText(es);
+				appendStringInfo(es->str,
+								 "Batch: %d  Stripes: %d\n",
+								 fbs->batchno,
+								 fbs->numstripes);
+			}
 		}
 	}
 }
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 45b342011f..2fc3de982a 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -80,7 +80,6 @@ static bool ExecParallelHashTuplePrealloc(HashJoinTable hashtable,
 static void ExecParallelHashMergeCounters(HashJoinTable hashtable);
 static void ExecParallelHashCloseBatchAccessors(HashJoinTable hashtable);
 
-
 /* ----------------------------------------------------------------
  *		ExecHash
  *
@@ -183,13 +182,52 @@ MultiExecPrivateHash(HashState *node)
 			}
 			else
 			{
-				/* Not subject to skew optimization, so insert normally */
-				ExecHashTableInsert(hashtable, slot, hashvalue);
+				/*
+				 * Not subject to skew optimization, so either insert normally
+				 * or save to batch file if it belongs to another stripe
+				 */
+				int			bucketno;
+				int			batchno;
+				bool		shouldFree;
+				MinimalTuple tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+				ExecHashGetBucketAndBatch(hashtable, hashvalue,
+										  &bucketno, &batchno);
+
+				/*
+				 * If we set batch 0 to fallback on the previous tuple Save
+				 * the tuples in this batch which will not fit in the
+				 * hashtable should I be checking that hashtable->curstripe !=
+				 * 0?
+				 */
+				if (hashtable->hashloop_fallback && hashtable->hashloop_fallback[0])
+					ExecHashJoinSaveTuple(tuple,
+										  hashvalue,
+										  &hashtable->innerBatchFile[batchno]);
+				else
+					ExecHashTableInsert(hashtable, slot, hashvalue);
+
+				if (shouldFree)
+					heap_free_minimal_tuple(tuple);
 			}
 			hashtable->totalTuples += 1;
 		}
 	}
 
+	/*
+	 * If batch 0 fell back, rewind the inner side file where we saved the
+	 * tuples which did not fit in memory to prepare it for loading upon
+	 * finishing probing stripe 0 of batch 0
+	 */
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[0])
+	{
+		if (BufFileSeek(hashtable->innerBatchFile[0], 0, 0L, SEEK_SET))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not rewind hash-join temporary file: %m")));
+	}
+
+
 	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
 	if (hashtable->nbuckets != hashtable->nbuckets_optimal)
 		ExecHashIncreaseNumBuckets(hashtable);
@@ -321,6 +359,40 @@ MultiExecParallelHash(HashState *node)
 				 * skew).
 				 */
 				pstate->growth = PHJ_GROWTH_DISABLED;
+
+				/*
+				 * In the current design, batch 0 cannot fall back. That
+				 * behavior is an artifact of the existing design where batch
+				 * 0 fills the initial hash table and as an optimization it
+				 * doesn't need a batch file. But, there is no real reason
+				 * that batch 0 shouldn't be allowed to spill.
+				 *
+				 * Consider a hash table where majority of tuples with
+				 * hashvalue 0. These tuples will never relocate no matter how
+				 * many batches exist. If you cannot exceed work_mem, then you
+				 * will be stuck infinitely trying to double the number of
+				 * batches in order to accommodate the tuples that can only
+				 * ever be in batch 0. So, we allow it to be set to fall back
+				 * during the build phase to avoid excessive batch increases
+				 * but we don't check it when loading the actual tuples, so we
+				 * may exceed space_allowed. We set it back to false here so
+				 * that it isn't true during any of the checks that may happen
+				 * during probing.
+				 */
+				hashtable->batches[0].shared->hashloop_fallback = false;
+
+				for (i = 0; i < hashtable->nbatch; ++i)
+				{
+					FallbackBatchStats *fallback_batch_stats;
+					ParallelHashJoinBatch *batch = hashtable->batches[i].shared;
+
+					if (!batch->hashloop_fallback)
+						continue;
+					fallback_batch_stats = palloc0(sizeof(FallbackBatchStats));
+					fallback_batch_stats->batchno = i;
+					fallback_batch_stats->numstripes = batch->maximum_stripe_number + 1;
+					hashtable->fallback_batches_stats = lappend(hashtable->fallback_batches_stats, fallback_batch_stats);
+				}
 			}
 	}
 
@@ -495,12 +567,14 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 	hashtable->curbatch = 0;
 	hashtable->nbatch_original = nbatch;
 	hashtable->nbatch_outstart = nbatch;
-	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
 	hashtable->partialTuples = 0;
 	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
+	hashtable->hashloop_fallback = NULL;
+	hashtable->fallback_batches_stats = NULL;
+	hashtable->curstripe = STRIPE_DETACHED;
 	hashtable->spaceUsed = 0;
 	hashtable->spacePeak = 0;
 	hashtable->spaceAllowed = space_allowed;
@@ -572,6 +646,8 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 			palloc0(nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			palloc0(nbatch * sizeof(BufFile *));
+		hashtable->hashloop_fallback = (BufFile **)
+			palloc0(nbatch * sizeof(BufFile *));
 		/* The files will not be opened until needed... */
 		/* ... but make sure we have temp tablespaces established for them */
 		PrepareTempTablespaces();
@@ -866,6 +942,8 @@ ExecHashTableDestroy(HashJoinTable hashtable)
 				BufFileClose(hashtable->innerBatchFile[i]);
 			if (hashtable->outerBatchFile[i])
 				BufFileClose(hashtable->outerBatchFile[i]);
+			if (hashtable->hashloop_fallback[i])
+				BufFileClose(hashtable->hashloop_fallback[i]);
 		}
 	}
 
@@ -876,6 +954,18 @@ ExecHashTableDestroy(HashJoinTable hashtable)
 	pfree(hashtable);
 }
 
+/*
+ * Threshhold for tuple relocation during batch split for parallel and serial
+ * hashjoin.
+ * While growing the number of batches, for the batch which triggered the growth,
+ * if more than MAX_RELOCATION % of its tuples move to its child batch, then
+ * it likely has skewed data and so the child batch (the new home to the skewed
+ * tuples) will be marked as a "fallback" batch and processed using the hashloop
+ * join algorithm. The reverse is true as well: if more than MAX_RELOCATION
+ * remain in the parent, it too should be marked to "fallback".
+ */
+#define MAX_RELOCATION 0.8
+
 /*
  * ExecHashIncreaseNumBatches
  *		increase the original number of batches in order to reduce
@@ -886,14 +976,18 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 {
 	int			oldnbatch = hashtable->nbatch;
 	int			curbatch = hashtable->curbatch;
+	int			childbatch;
 	int			nbatch;
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
 	HashMemoryChunk oldchunks;
+	int			curbatch_outgoing_tuples;
+	int			childbatch_outgoing_tuples;
+	int			target_batch;
+	FallbackBatchStats *fallback_batch_stats;
 
-	/* do nothing if we've decided to shut off growth */
-	if (!hashtable->growEnabled)
+	if (hashtable->hashloop_fallback && hashtable->hashloop_fallback[curbatch])
 		return;
 
 	/* safety check to avoid overflow */
@@ -917,6 +1011,8 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			palloc0(nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			palloc0(nbatch * sizeof(BufFile *));
+		hashtable->hashloop_fallback = (BufFile **)
+			palloc0(nbatch * sizeof(BufFile *));
 		/* time to establish the temp tablespaces, too */
 		PrepareTempTablespaces();
 	}
@@ -927,10 +1023,14 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			repalloc(hashtable->innerBatchFile, nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			repalloc(hashtable->outerBatchFile, nbatch * sizeof(BufFile *));
+		hashtable->hashloop_fallback = (BufFile **)
+			repalloc(hashtable->hashloop_fallback, nbatch * sizeof(BufFile *));
 		MemSet(hashtable->innerBatchFile + oldnbatch, 0,
 			   (nbatch - oldnbatch) * sizeof(BufFile *));
 		MemSet(hashtable->outerBatchFile + oldnbatch, 0,
 			   (nbatch - oldnbatch) * sizeof(BufFile *));
+		MemSet(hashtable->hashloop_fallback + oldnbatch, 0,
+			   (nbatch - oldnbatch) * sizeof(BufFile *));
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -942,6 +1042,8 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 * no longer of the current batch.
 	 */
 	ninmemory = nfreed = 0;
+	curbatch_outgoing_tuples = childbatch_outgoing_tuples = 0;
+	childbatch = (1U << (my_log2(hashtable->nbatch) - 1)) | hashtable->curbatch;
 
 	/* If know we need to resize nbuckets, we can do it while rebatching. */
 	if (hashtable->nbuckets_optimal != hashtable->nbuckets)
@@ -987,8 +1089,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			ninmemory++;
 			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
 									  &bucketno, &batchno);
-
-			if (batchno == curbatch)
+			if (batchno == curbatch && (curbatch != 0 || hashtable->spaceUsed < hashtable->spaceAllowed))
 			{
 				/* keep tuple in memory - copy it into the new chunk */
 				HashJoinTuple copyTuple;
@@ -999,17 +1100,28 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 				/* and add it back to the appropriate bucket */
 				copyTuple->next.unshared = hashtable->buckets.unshared[bucketno];
 				hashtable->buckets.unshared[bucketno] = copyTuple;
+				curbatch_outgoing_tuples++;
 			}
 			else
 			{
 				/* dump it out */
-				Assert(batchno > curbatch);
+				Assert(batchno >= curbatch);
 				ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(hashTuple),
 									  hashTuple->hashvalue,
 									  &hashtable->innerBatchFile[batchno]);
 
 				hashtable->spaceUsed -= hashTupleSize;
 				nfreed++;
+
+				/*
+				 * TODO: what to do about tuples that don't go to the child
+				 * batch or stay in the current batch? (this is why we are
+				 * counting tuples to child and curbatch with two diff
+				 * variables in case the tuples go to a batch that isn't the
+				 * child)
+				 */
+				if (batchno == childbatch)
+					childbatch_outgoing_tuples++;
 			}
 
 			/* next tuple in this chunk */
@@ -1029,22 +1141,35 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 		   hashtable, nfreed, ninmemory, hashtable->spaceUsed);
 #endif
 
+
 	/*
-	 * If we dumped out either all or none of the tuples in the table, disable
-	 * further expansion of nbatch.  This situation implies that we have
-	 * enough tuples of identical hashvalues to overflow spaceAllowed.
-	 * Increasing nbatch will not fix it since there's no way to subdivide the
-	 * group any more finely. We have to just gut it out and hope the server
-	 * has enough RAM.
+	 * The same batch should not be marked to fall back more than once
 	 */
-	if (nfreed == 0 || nfreed == ninmemory)
-	{
-		hashtable->growEnabled = false;
 #ifdef HJDEBUG
-		printf("Hashjoin %p: disabling further increase of nbatch\n",
-			   hashtable);
+	if ((childbatch_outgoing_tuples / (float) ninmemory) >= 0.8)
+		printf("childbatch %i targeted to fallback.", childbatch);
+	if ((curbatch_outgoing_tuples / (float) ninmemory) >= 0.8)
+		printf("curbatch %i targeted to fallback.", curbatch);
 #endif
-	}
+
+	/*
+	 * If too many tuples remain in the parent or too many tuples migrate to
+	 * the child, there is likely skew and continuing to increase the number
+	 * of batches will not help. Mark the batch which contains the skewed
+	 * tuples to be processed with block nested hashloop join.
+	 */
+	if ((childbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION)
+		target_batch = childbatch;
+	else if ((curbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION)
+		target_batch = curbatch;
+	else
+		return;
+	hashtable->hashloop_fallback[target_batch] = BufFileCreateTemp(false);
+
+	fallback_batch_stats = palloc0(sizeof(FallbackBatchStats));
+	fallback_batch_stats->batchno = target_batch;
+	fallback_batch_stats->numstripes = 0;
+	hashtable->fallback_batches_stats = lappend(hashtable->fallback_batches_stats, fallback_batch_stats);
 }
 
 /*
@@ -1213,7 +1338,6 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 									 WAIT_EVENT_HASH_GROW_BATCHES_DECIDE))
 			{
 				bool		space_exhausted = false;
-				bool		extreme_skew_detected = false;
 
 				/* Make sure that we have the current dimensions and buckets. */
 				ExecParallelHashEnsureBatchAccessors(hashtable);
@@ -1224,27 +1348,58 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 				{
 					ParallelHashJoinBatch *batch = hashtable->batches[i].shared;
 
+					/*
+					 * All batches were just created anew during
+					 * repartitioning
+					 */
+					Assert(!batch->hashloop_fallback);
+
+					/*
+					 * At the time of repartitioning, each batch updates its
+					 * estimated_size to reflect the size of the batch file on
+					 * disk. It is also updated when increasing preallocated
+					 * space in ExecParallelHashTuplePrealloc().  However,
+					 * batch 0 does not store anything on disk so it has no
+					 * estimated_size.
+					 *
+					 * We still want to allow batch 0 to trigger batch growth.
+					 * In order to do that, for batch 0 check whether the
+					 * actual size exceeds space_allowed. It is a little
+					 * backwards at this point as we would have already
+					 * exceeded inserted the allowed space.
+					 */
 					if (batch->space_exhausted ||
-						batch->estimated_size > pstate->space_allowed)
+						batch->estimated_size > pstate->space_allowed ||
+						batch->size > pstate->space_allowed)
 					{
 						int			parent;
+						float		frac_moved;
 
 						space_exhausted = true;
 
+						parent = i % pstate->old_nbatch;
+						frac_moved = batch->ntuples / (float) hashtable->batches[parent].shared->old_ntuples;
+
 						/*
-						 * Did this batch receive ALL of the tuples from its
-						 * parent batch?  That would indicate that further
-						 * repartitioning isn't going to help (the hash values
-						 * are probably all the same).
+						 * If too many tuples remain in the parent or too many
+						 * tuples migrate to the child, there is likely skew
+						 * and continuing to increase the number of batches
+						 * will not help. Mark the batch which contains the
+						 * skewed tuples to be processed with block nested
+						 * hashloop join.
 						 */
-						parent = i % pstate->old_nbatch;
-						if (batch->ntuples == hashtable->batches[parent].shared->old_ntuples)
-							extreme_skew_detected = true;
+						if (frac_moved >= MAX_RELOCATION)
+						{
+							batch->hashloop_fallback = true;
+							space_exhausted = false;
+						}
 					}
+					if (space_exhausted)
+						break;
 				}
 
-				/* Don't keep growing if it's not helping or we'd overflow. */
-				if (extreme_skew_detected || hashtable->nbatch >= INT_MAX / 2)
+				/* Don't keep growing if we'd overflow. */
+				if (hashtable->nbatch >= INT_MAX / 2)
 					pstate->growth = PHJ_GROWTH_DISABLED;
 				else if (space_exhausted)
 					pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
@@ -1311,11 +1466,28 @@ ExecParallelHashRepartitionFirst(HashJoinTable hashtable)
 			{
 				size_t		tuple_size =
 				MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
+				tupleMetadata metadata;
 
 				/* It belongs in a later batch. */
+				ParallelHashJoinBatch *batch = hashtable->batches[batchno].shared;
+
+				LWLockAcquire(&batch->lock, LW_EXCLUSIVE);
+
+				if (batch->estimated_stripe_size + tuple_size > hashtable->parallel_state->space_allowed)
+				{
+					batch->maximum_stripe_number++;
+					batch->estimated_stripe_size = 0;
+				}
+
+				batch->estimated_stripe_size += tuple_size;
+
+				metadata.hashvalue = hashTuple->hashvalue;
+				metadata.stripe = batch->maximum_stripe_number;
+				LWLockRelease(&batch->lock);
+
 				hashtable->batches[batchno].estimated_size += tuple_size;
-				sts_puttuple(hashtable->batches[batchno].inner_tuples,
-							 &hashTuple->hashvalue, tuple);
+
+				sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata, tuple);
 			}
 
 			/* Count this tuple. */
@@ -1363,27 +1535,41 @@ ExecParallelHashRepartitionRest(HashJoinTable hashtable)
 	for (i = 1; i < old_nbatch; ++i)
 	{
 		MinimalTuple tuple;
-		uint32		hashvalue;
+		tupleMetadata metadata;
 
 		/* Scan one partition from the previous generation. */
 		sts_begin_parallel_scan(old_inner_tuples[i]);
-		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i], &hashvalue)))
+
+		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i], &metadata.hashvalue)))
 		{
 			size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 			int			bucketno;
 			int			batchno;
+			ParallelHashJoinBatch *batch;
 
 			/* Decide which partition it goes to in the new generation. */
-			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
+			ExecHashGetBucketAndBatch(hashtable, metadata.hashvalue, &bucketno,
 									  &batchno);
 
 			hashtable->batches[batchno].estimated_size += tuple_size;
 			++hashtable->batches[batchno].ntuples;
 			++hashtable->batches[i].old_ntuples;
 
+			batch = hashtable->batches[batchno].shared;
+
+			/* Store the tuple its new batch. */
+			LWLockAcquire(&batch->lock, LW_EXCLUSIVE);
+
+			if (batch->estimated_stripe_size + tuple_size > pstate->space_allowed)
+			{
+				batch->maximum_stripe_number++;
+				batch->estimated_stripe_size = 0;
+			}
+			batch->estimated_stripe_size += tuple_size;
+			metadata.stripe = batch->maximum_stripe_number;
+			LWLockRelease(&batch->lock);
 			/* Store the tuple its new batch. */
-			sts_puttuple(hashtable->batches[batchno].inner_tuples,
-						 &hashvalue, tuple);
+			sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata, tuple);
 
 			CHECK_FOR_INTERRUPTS();
 		}
@@ -1693,6 +1879,12 @@ retry:
 
 	if (batchno == 0)
 	{
+		/*
+		 * TODO: if spilling is enabled for batch 0 so that it can fall back,
+		 * we will need to stop loading batch 0 into the hashtable somewhere--
+		 * maybe here-- and switch to saving tuples to a file. Currently, this
+		 * will simply exceed the space allowed
+		 */
 		HashJoinTuple hashTuple;
 
 		/* Try to load it into memory. */
@@ -1715,10 +1907,17 @@ retry:
 	else
 	{
 		size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
+		ParallelHashJoinBatch *batch;
+		tupleMetadata metadata;
 
 		Assert(batchno > 0);
 
 		/* Try to preallocate space in the batch if necessary. */
+
+		/*
+		 * TODO: is it okay to only count the tuple when it doesn't fit in the
+		 * preallocated memory?
+		 */
 		if (hashtable->batches[batchno].preallocated < tuple_size)
 		{
 			if (!ExecParallelHashTuplePrealloc(hashtable, batchno, tuple_size))
@@ -1727,8 +1926,14 @@ retry:
 
 		Assert(hashtable->batches[batchno].preallocated >= tuple_size);
 		hashtable->batches[batchno].preallocated -= tuple_size;
-		sts_puttuple(hashtable->batches[batchno].inner_tuples, &hashvalue,
-					 tuple);
+		batch = hashtable->batches[batchno].shared;
+
+		metadata.hashvalue = hashvalue;
+		LWLockAcquire(&batch->lock, LW_SHARED);
+		metadata.stripe = batch->maximum_stripe_number;
+		LWLockRelease(&batch->lock);
+
+		sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata, tuple);
 	}
 	++hashtable->batches[batchno].ntuples;
 
@@ -2697,6 +2902,7 @@ ExecHashAccumInstrumentation(HashInstrumentation *instrument,
 									  hashtable->nbatch_original);
 	instrument->space_peak = Max(instrument->space_peak,
 								 hashtable->spacePeak);
+	instrument->fallback_batches_stats = hashtable->fallback_batches_stats;
 }
 
 /*
@@ -2850,6 +3056,8 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 	/* Check if it's time to grow batches or buckets. */
 	if (pstate->growth != PHJ_GROWTH_DISABLED)
 	{
+		ParallelHashJoinBatchAccessor batch = hashtable->batches[0];
+
 		Assert(curbatch == 0);
 		Assert(BarrierPhase(&pstate->build_barrier) == PHJ_BUILD_HASHING_INNER);
 
@@ -2858,8 +3066,13 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 		 * very large tuples or very low work_mem setting, we'll always allow
 		 * each backend to allocate at least one chunk.
 		 */
-		if (hashtable->batches[0].at_least_one_chunk &&
-			hashtable->batches[0].shared->size +
+
+		/*
+		 * TODO: get rid of this check for batch 0 and make it so that batch 0
+		 * always has to keep trying to increase the number of batches
+		 */
+		if (!batch.shared->hashloop_fallback && batch.at_least_one_chunk &&
+			batch.shared->size +
 			chunk_size > pstate->space_allowed)
 		{
 			pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
@@ -2891,6 +3104,11 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 
 	/* We are cleared to allocate a new chunk. */
 	chunk_shared = dsa_allocate(hashtable->area, chunk_size);
+
+	/*
+	 * TODO: if batch 0 will have stripes, need to account for this memory
+	 * there
+	 */
 	hashtable->batches[curbatch].shared->size += chunk_size;
 	hashtable->batches[curbatch].at_least_one_chunk = true;
 
@@ -2960,21 +3178,38 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 	{
 		ParallelHashJoinBatchAccessor *accessor = &hashtable->batches[i];
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
+		SharedBits *sbits = ParallelHashJoinBatchOuterBits(shared, pstate->nparticipants);
 		char		name[MAXPGPATH];
+		char		sbname[MAXPGPATH];
+
+		shared->hashloop_fallback = false;
+		/* TODO: is it okay to use the same tranche for this lock? */
+		LWLockInitialize(&shared->lock, LWTRANCHE_PARALLEL_HASH_JOIN);
+		shared->maximum_stripe_number = 0;
+		shared->estimated_stripe_size = 0;
 
 		/*
 		 * All members of shared were zero-initialized.  We just need to set
 		 * up the Barrier.
 		 */
 		BarrierInit(&shared->batch_barrier, 0);
+		BarrierInit(&shared->stripe_barrier, 0);
+
+		/* Batch 0 doesn't need to be loaded. */
 		if (i == 0)
 		{
-			/* Batch 0 doesn't need to be loaded. */
 			BarrierAttach(&shared->batch_barrier);
-			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_PROBING)
+			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_STRIPING)
 				BarrierArriveAndWait(&shared->batch_barrier, 0);
 			BarrierDetach(&shared->batch_barrier);
+
+			BarrierAttach(&shared->stripe_barrier);
+			while (BarrierPhase(&shared->stripe_barrier) < PHJ_STRIPE_PROBING)
+				BarrierArriveAndWait(&shared->stripe_barrier, 0);
+			BarrierDetach(&shared->stripe_barrier);
 		}
+		/* why isn't done initialized here ? */
+		accessor->done = PHJ_BATCH_ACCESSOR_NOT_DONE;
 
 		/* Initialize accessor state.  All members were zero-initialized. */
 		accessor->shared = shared;
@@ -2985,7 +3220,7 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 			sts_initialize(ParallelHashJoinBatchInner(shared),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
@@ -2995,10 +3230,14 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 													  pstate->nparticipants),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
+		snprintf(sbname, MAXPGPATH, "%s.bitmaps", name);
+		/* Use the same SharedFileset for the SharedTupleStore and SharedBits */
+		accessor->sba = sb_initialize(sbits, pstate->nparticipants,
+									  ParallelWorkerNumber + 1, &pstate->fileset, sbname);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -3047,8 +3286,8 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 	 * It's possible for a backend to start up very late so that the whole
 	 * join is finished and the shm state for tracking batches has already
 	 * been freed by ExecHashTableDetach().  In that case we'll just leave
-	 * hashtable->batches as NULL so that ExecParallelHashJoinNewBatch() gives
-	 * up early.
+	 * hashtable->batches as NULL so that ExecParallelHashJoinAdvanceBatch()
+	 * gives up early.
 	 */
 	if (!DsaPointerIsValid(pstate->batches))
 		return;
@@ -3070,10 +3309,11 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 	{
 		ParallelHashJoinBatchAccessor *accessor = &hashtable->batches[i];
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
+		SharedBits *sbits = ParallelHashJoinBatchOuterBits(shared, pstate->nparticipants);
 
 		accessor->shared = shared;
 		accessor->preallocated = 0;
-		accessor->done = false;
+		accessor->done = PHJ_BATCH_ACCESSOR_NOT_DONE;
 		accessor->inner_tuples =
 			sts_attach(ParallelHashJoinBatchInner(shared),
 					   ParallelWorkerNumber + 1,
@@ -3083,6 +3323,7 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 												  pstate->nparticipants),
 					   ParallelWorkerNumber + 1,
 					   &pstate->fileset);
+		accessor->sba = sb_attach(sbits, ParallelWorkerNumber + 1, &pstate->fileset);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -3165,6 +3406,18 @@ ExecHashTableDetachBatch(HashJoinTable hashtable)
 	}
 }
 
+bool
+ExecHashTableDetachStripe(HashJoinTable hashtable)
+{
+	int			curbatch = hashtable->curbatch;
+	ParallelHashJoinBatch *batch = hashtable->batches[curbatch].shared;
+	Barrier    *stripe_barrier = &batch->stripe_barrier;
+
+	BarrierDetach(stripe_barrier);
+	hashtable->curstripe = STRIPE_DETACHED;
+	return false;
+}
+
 /*
  * Detach from all shared resources.  If we are last to detach, clean up.
  */
@@ -3350,13 +3603,35 @@ ExecParallelHashTuplePrealloc(HashJoinTable hashtable, int batchno, size_t size)
 	{
 		/*
 		 * We have determined that this batch would exceed the space budget if
-		 * loaded into memory.  Command all participants to help repartition.
+		 * loaded into memory.
 		 */
-		batch->shared->space_exhausted = true;
-		pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
-		LWLockRelease(&pstate->lock);
-
-		return false;
+		/* TODO: the nested lock is a deadlock waiting to happen. */
+		LWLockAcquire(&batch->shared->lock, LW_EXCLUSIVE);
+		if (!batch->shared->hashloop_fallback)
+		{
+			/*
+			 * This batch is not marked to fall back so command all
+			 * participants to help repartition.
+			 */
+			batch->shared->space_exhausted = true;
+			pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
+			LWLockRelease(&batch->shared->lock);
+			LWLockRelease(&pstate->lock);
+			return false;
+		}
+		else if (batch->shared->estimated_stripe_size + want +
+				 HASH_CHUNK_HEADER_SIZE > pstate->space_allowed)
+		{
+			/*
+			 * This batch is marked to fall back and the current (last) stripe
+			 * does not have enough space to handle the request so we must
+			 * increment the number of stripes in the batch and reset the size
+			 * of its new last stripe.
+			 */
+			batch->shared->maximum_stripe_number++;
+			batch->shared->estimated_stripe_size = 0;
+		}
+		LWLockRelease(&batch->shared->lock);
 	}
 
 	batch->at_least_one_chunk = true;
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 2cdc38a601..cbb0edfed0 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -92,6 +92,27 @@
  * work_mem of all participants to create a large shared hash table.  If that
  * turns out either at planning or execution time to be impossible then we
  * fall back to regular work_mem sized hash tables.
+ * If a given batch causes the number of batches to be doubled and data skew
+ * causes too few or too many tuples to be relocated to the child of this batch,
+ * the batch which is now home to the skewed tuples is marked as a "fallback"
+ * batch. This means that it will be processed using multiple loops --
+ * each loop probing an arbitrary stripe of tuples from this batch
+ * which fit in work_mem or combined work_mem.
+ * This batch is no longer permitted to cause growth in the number of batches.
+ *
+ * When the inner side of a fallback batch is loaded into memory, stripes of
+ * arbitrary tuples totaling work_mem or combined work_mem in size are loaded
+ * into the hashtable. After probing this stripe, the outer side batch is
+ * rewound and the next stripe is loaded. Each stripe of the inner batch is
+ * probed until all tuples from that batch have been processed.
+ *
+ * Tuples that match are emitted (depending on the join semantics of the
+ * particular join type) during probing of the stripe. However, in order to make
+ * left outer join work, unmatched tuples cannot be emitted NULL-extended until
+ * all stripes have been probed. To address this, a bitmap is created with a bit
+ * for each tuple of the outer side. If a tuple on the outer side matches a
+ * tuple from the inner, the corresponding bit is set. At the end of probing all
+ * stripes, the executor scans the bitmap and emits unmatched outer tuples.
  *
  * To avoid deadlocks, we never wait for any barrier unless it is known that
  * all other backends attached to it are actively executing the node or have
@@ -126,7 +147,7 @@
 #define HJ_SCAN_BUCKET			3
 #define HJ_FILL_OUTER_TUPLE		4
 #define HJ_FILL_INNER_TUPLES	5
-#define HJ_NEED_NEW_BATCH		6
+#define HJ_NEED_NEW_STRIPE      6
 
 /* Returns true if doing null-fill on outer relation */
 #define HJ_FILL_OUTER(hjstate)	((hjstate)->hj_NullInnerTupleSlot != NULL)
@@ -143,10 +164,91 @@ static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 												 BufFile *file,
 												 uint32 *hashvalue,
 												 TupleTableSlot *tupleSlot);
+static int	ExecHashJoinLoadStripe(HashJoinState *hjstate);
 static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
 static bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
+static bool ExecParallelHashJoinLoadStripe(HashJoinState *hjstate);
 static void ExecParallelHashJoinPartitionOuter(HashJoinState *node);
+static bool checkbit(HashJoinState *hjstate);
+static void set_match_bit(HashJoinState *hjstate);
+
+static pg_attribute_always_inline bool
+			IsHashloopFallback(HashJoinTable hashtable);
+
+#define UINT_BITS (sizeof(unsigned int) * CHAR_BIT)
+
+static void
+set_match_bit(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	BufFile    *statusFile = hashtable->hashloop_fallback[hashtable->curbatch];
+	int			tupindex = hjstate->hj_CurNumOuterTuples - 1;
+	size_t		unit_size = sizeof(hjstate->hj_CurOuterMatchStatus);
+	off_t		offset = tupindex / UINT_BITS * unit_size;
+
+	int			fileno;
+	off_t		cursor;
+
+	BufFileTell(statusFile, &fileno, &cursor);
 
+	/* Extend the statusFile if this is stripe zero. */
+	if (hashtable->curstripe == 0)
+	{
+		for (; cursor < offset + unit_size; cursor += unit_size)
+		{
+			hjstate->hj_CurOuterMatchStatus = 0;
+			BufFileWrite(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+		}
+	}
+
+	if (cursor != offset)
+		BufFileSeek(statusFile, 0, offset, SEEK_SET);
+
+	BufFileRead(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+	BufFileSeek(statusFile, 0, -unit_size, SEEK_CUR);
+
+	hjstate->hj_CurOuterMatchStatus |= 1U << tupindex % UINT_BITS;
+	BufFileWrite(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+}
+
+/* return true if bit is set and false if not */
+static bool
+checkbit(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			curbatch = hashtable->curbatch;
+	BufFile    *outer_match_statuses;
+
+	int			bitno = hjstate->hj_EmitOuterTupleId % UINT_BITS;
+
+	hjstate->hj_EmitOuterTupleId++;
+	outer_match_statuses = hjstate->hj_HashTable->hashloop_fallback[curbatch];
+
+	/*
+	 * if current chunk of bitmap is exhausted, read next chunk of bitmap from
+	 * outer_match_status_file
+	 */
+	if (bitno == 0)
+		BufFileRead(outer_match_statuses, &hjstate->hj_CurOuterMatchStatus,
+					sizeof(hjstate->hj_CurOuterMatchStatus));
+
+	/*
+	 * check if current tuple's match bit is set in outer match status file
+	 */
+	return hjstate->hj_CurOuterMatchStatus & (1U << bitno);
+}
+
+static bool
+IsHashloopFallback(HashJoinTable hashtable)
+{
+	if (hashtable->parallel_state)
+		return hashtable->batches[hashtable->curbatch].shared->hashloop_fallback;
+
+	if (!hashtable->hashloop_fallback)
+		return false;
+
+	return hashtable->hashloop_fallback[hashtable->curbatch];
+}
 
 /* ----------------------------------------------------------------
  *		ExecHashJoinImpl
@@ -290,6 +392,12 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				hashNode->hashtable = hashtable;
 				(void) MultiExecProcNode((PlanState *) hashNode);
 
+				/*
+				 * After building the hashtable, stripe 0 of batch 0 will have
+				 * been loaded.
+				 */
+				hashtable->curstripe = 0;
+
 				/*
 				 * If the inner relation is completely empty, and we're not
 				 * doing a left outer join, we can quit without scanning the
@@ -333,12 +441,11 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 
 					/* Each backend should now select a batch to work on. */
 					hashtable->curbatch = -1;
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
 
-					continue;
+					if (!ExecParallelHashJoinNewBatch(node))
+						return NULL;
 				}
-				else
-					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
 				/* FALL THRU */
 
@@ -365,12 +472,18 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 						node->hj_JoinState = HJ_FILL_INNER_TUPLES;
 					}
 					else
-						node->hj_JoinState = HJ_NEED_NEW_BATCH;
+						node->hj_JoinState = HJ_NEED_NEW_STRIPE;
 					continue;
 				}
 
 				econtext->ecxt_outertuple = outerTupleSlot;
-				node->hj_MatchedOuter = false;
+
+				/*
+				 * Don't reset hj_MatchedOuter after the first stripe as it
+				 * would cancel out whatever we found before
+				 */
+				if (node->hj_HashTable->curstripe == 0)
+					node->hj_MatchedOuter = false;
 
 				/*
 				 * Find the corresponding bucket for this tuple in the main
@@ -386,9 +499,15 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				/*
 				 * The tuple might not belong to the current batch (where
 				 * "current batch" includes the skew buckets if any).
+				 *
+				 * This should only be done once per tuple per batch. If a
+				 * batch "falls back", its inner side will be split into
+				 * stripes. Any displaced outer tuples should only be
+				 * relocated while probing the first stripe of the inner side.
 				 */
 				if (batchno != hashtable->curbatch &&
-					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO)
+					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO &&
+					node->hj_HashTable->curstripe == 0)
 				{
 					bool		shouldFree;
 					MinimalTuple mintuple = ExecFetchSlotMinimalTuple(outerTupleSlot,
@@ -410,6 +529,32 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					continue;
 				}
 
+				if (batchno == 0 && node->hj_HashTable->curstripe == 0 && IsHashloopFallback(hashtable))
+				{
+					bool		shouldFree;
+					MinimalTuple mintuple = ExecFetchSlotMinimalTuple(outerTupleSlot,
+																	  &shouldFree);
+
+					/*
+					 * Need to save this outer tuple to a batch since batch 0
+					 * is fallback and we must later rewind.
+					 */
+					Assert(parallel_state == NULL);
+					ExecHashJoinSaveTuple(mintuple, hashvalue,
+										  &hashtable->outerBatchFile[batchno]);
+
+					if (shouldFree)
+						heap_free_minimal_tuple(mintuple);
+				}
+
+
+				/*
+				 * While probing the phantom stripe, don't increment
+				 * hj_CurNumOuterTuples or extend the bitmap
+				 */
+				if (!parallel && hashtable->curstripe != PHANTOM_STRIPE)
+					node->hj_CurNumOuterTuples++;
+
 				/* OK, let's scan the bucket for matches */
 				node->hj_JoinState = HJ_SCAN_BUCKET;
 
@@ -455,6 +600,25 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				{
 					node->hj_MatchedOuter = true;
 
+					if (HJ_FILL_OUTER(node) && IsHashloopFallback(hashtable))
+					{
+						/*
+						 * Each bit corresponds to a single tuple. Setting the
+						 * match bit keeps track of which tuples were matched
+						 * for batches which are using the block nested
+						 * hashloop fallback method. It persists this match
+						 * status across multiple stripes of tuples, each of
+						 * which is loaded into the hashtable and probed. The
+						 * outer match status file is the cumulative match
+						 * status of outer tuples for a given batch across all
+						 * stripes of that inner side batch.
+						 */
+						if (parallel)
+							sb_setbit(hashtable->batches[hashtable->curbatch].sba, econtext->ecxt_outertuple->tts_tuplenum);
+						else
+							set_match_bit(node);
+					}
+
 					if (parallel)
 					{
 						/*
@@ -488,8 +652,17 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					 * continue with next outer tuple.
 					 */
 					if (node->js.single_match)
+					{
 						node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
+						/*
+						 * Only consider returning the tuple while on the
+						 * first stripe.
+						 */
+						if (node->hj_HashTable->curstripe != 0)
+							continue;
+					}
+
 					if (otherqual == NULL || ExecQual(otherqual, econtext))
 						return ExecProject(node->js.ps.ps_ProjInfo);
 					else
@@ -508,6 +681,22 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				 */
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
+				if (IsHashloopFallback(hashtable) && HJ_FILL_OUTER(node))
+				{
+					if (hashtable->curstripe != PHANTOM_STRIPE)
+						continue;
+
+					if (parallel)
+					{
+						ParallelHashJoinBatchAccessor *accessor =
+						&node->hj_HashTable->batches[node->hj_HashTable->curbatch];
+
+						node->hj_MatchedOuter = sb_checkbit(accessor->sba, econtext->ecxt_outertuple->tts_tuplenum);
+					}
+					else
+						node->hj_MatchedOuter = checkbit(node);
+				}
+
 				if (!node->hj_MatchedOuter &&
 					HJ_FILL_OUTER(node))
 				{
@@ -534,7 +723,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				if (!ExecScanHashTableForUnmatched(node, econtext))
 				{
 					/* no more unmatched tuples */
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					node->hj_JoinState = HJ_NEED_NEW_STRIPE;
 					continue;
 				}
 
@@ -550,19 +739,23 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					InstrCountFiltered2(node, 1);
 				break;
 
-			case HJ_NEED_NEW_BATCH:
+			case HJ_NEED_NEW_STRIPE:
 
 				/*
-				 * Try to advance to next batch.  Done if there are no more.
+				 * Try to advance to next stripe. Then try to advance to the
+				 * next batch if there are no more stripes in this batch. Done
+				 * if there are no more batches.
 				 */
 				if (parallel)
 				{
-					if (!ExecParallelHashJoinNewBatch(node))
+					if (!ExecParallelHashJoinLoadStripe(node) &&
+						!ExecParallelHashJoinNewBatch(node))
 						return NULL;	/* end of parallel-aware join */
 				}
 				else
 				{
-					if (!ExecHashJoinNewBatch(node))
+					if (!ExecHashJoinLoadStripe(node) &&
+						!ExecHashJoinNewBatch(node))
 						return NULL;	/* end of parallel-oblivious join */
 				}
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
@@ -751,6 +944,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate->hj_JoinState = HJ_BUILD_HASHTABLE;
 	hjstate->hj_MatchedOuter = false;
 	hjstate->hj_OuterNotEmpty = false;
+	hjstate->hj_CurNumOuterTuples = 0;
+	hjstate->hj_CurOuterMatchStatus = 0;
 
 	return hjstate;
 }
@@ -917,15 +1112,24 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 	}
 	else if (curbatch < hashtable->nbatch)
 	{
+		tupleMetadata metadata;
 		MinimalTuple tuple;
 
 		tuple = sts_parallel_scan_next(hashtable->batches[curbatch].outer_tuples,
-									   hashvalue);
+									   &metadata);
+		*hashvalue = metadata.hashvalue;
+
 		if (tuple != NULL)
 		{
 			ExecForceStoreMinimalTuple(tuple,
 									   hjstate->hj_OuterTupleSlot,
 									   false);
+
+			/*
+			 * TODO: should we use tupleid instead of position in the serial
+			 * case too?
+			 */
+			hjstate->hj_OuterTupleSlot->tts_tuplenum = metadata.tupleid;
 			slot = hjstate->hj_OuterTupleSlot;
 			return slot;
 		}
@@ -949,24 +1153,37 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	int			nbatch;
 	int			curbatch;
-	BufFile    *innerFile;
-	TupleTableSlot *slot;
-	uint32		hashvalue;
+	BufFile    *innerFile = NULL;
+	BufFile    *outerFile = NULL;
 
 	nbatch = hashtable->nbatch;
 	curbatch = hashtable->curbatch;
 
-	if (curbatch > 0)
+	/*
+	 * We no longer need the previous outer batch file; close it right away to
+	 * free disk space.
+	 */
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
 	{
-		/*
-		 * We no longer need the previous outer batch file; close it right
-		 * away to free disk space.
-		 */
-		if (hashtable->outerBatchFile[curbatch])
-			BufFileClose(hashtable->outerBatchFile[curbatch]);
+		BufFileClose(hashtable->outerBatchFile[curbatch]);
 		hashtable->outerBatchFile[curbatch] = NULL;
 	}
-	else						/* we just finished the first batch */
+	if (IsHashloopFallback(hashtable))
+	{
+		BufFileClose(hashtable->hashloop_fallback[curbatch]);
+		hashtable->hashloop_fallback[curbatch] = NULL;
+	}
+
+	/*
+	 * We are surely done with the inner batch file now
+	 */
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[curbatch])
+	{
+		BufFileClose(hashtable->innerBatchFile[curbatch]);
+		hashtable->innerBatchFile[curbatch] = NULL;
+	}
+
+	if (curbatch == 0)			/* we just finished the first batch */
 	{
 		/*
 		 * Reset some of the skew optimization state variables, since we no
@@ -1030,45 +1247,68 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 		return false;			/* no more batches */
 
 	hashtable->curbatch = curbatch;
+	hashtable->curstripe = STRIPE_DETACHED;
+	hjstate->hj_CurNumOuterTuples = 0;
 
-	/*
-	 * Reload the hash table with the new inner batch (which could be empty)
-	 */
-	ExecHashTableReset(hashtable);
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[curbatch])
+		innerFile = hashtable->innerBatchFile[curbatch];
 
-	innerFile = hashtable->innerBatchFile[curbatch];
+	if (innerFile && BufFileSeek(innerFile, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	/* Need to rewind outer when this is the first stripe of a new batch */
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
+		outerFile = hashtable->outerBatchFile[curbatch];
+
+	if (outerFile && BufFileSeek(outerFile, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
 
-	if (innerFile != NULL)
+	ExecHashJoinLoadStripe(hjstate);
+	return true;
+}
+
+static inline void
+InstrIncrBatchStripes(List *fallback_batches_stats, int curbatch)
+{
+	ListCell   *lc;
+
+	foreach(lc, fallback_batches_stats)
 	{
-		if (BufFileSeek(innerFile, 0, 0L, SEEK_SET))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not rewind hash-join temporary file: %m")));
+		FallbackBatchStats *fallback_batch_stats = lfirst(lc);
 
-		while ((slot = ExecHashJoinGetSavedTuple(hjstate,
-												 innerFile,
-												 &hashvalue,
-												 hjstate->hj_HashTupleSlot)))
+		if (fallback_batch_stats->batchno == curbatch)
 		{
-			/*
-			 * NOTE: some tuples may be sent to future batches.  Also, it is
-			 * possible for hashtable->nbatch to be increased here!
-			 */
-			ExecHashTableInsert(hashtable, slot, hashvalue);
+			fallback_batch_stats->numstripes++;
+			break;
 		}
-
-		/*
-		 * after we build the hash table, the inner batch file is no longer
-		 * needed
-		 */
-		BufFileClose(innerFile);
-		hashtable->innerBatchFile[curbatch] = NULL;
 	}
+}
+
+/*
+ * Returns false when the inner batch file is exhausted
+ */
+static int
+ExecHashJoinLoadStripe(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			curbatch = hashtable->curbatch;
+	TupleTableSlot *slot;
+	uint32		hashvalue;
+	bool		loaded_inner = false;
+
+	if (hashtable->curstripe == PHANTOM_STRIPE)
+		return false;
 
 	/*
 	 * Rewind outer batch file (if present), so that we can start reading it.
+	 * TODO: This is only necessary if this is not the first stripe of the
+	 * batch
 	 */
-	if (hashtable->outerBatchFile[curbatch] != NULL)
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
 	{
 		if (BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET))
 			ereport(ERROR,
@@ -1076,9 +1316,79 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 					 errmsg("could not rewind hash-join temporary file: %m")));
 	}
 
-	return true;
+	hashtable->curstripe++;
+
+	if (!hashtable->innerBatchFile || !hashtable->innerBatchFile[curbatch])
+		return false;
+
+	/*
+	 * Reload the hash table with the new inner stripe
+	 */
+	ExecHashTableReset(hashtable);
+
+	while ((slot = ExecHashJoinGetSavedTuple(hjstate,
+											 hashtable->innerBatchFile[curbatch],
+											 &hashvalue,
+											 hjstate->hj_HashTupleSlot)))
+	{
+		/*
+		 * NOTE: some tuples may be sent to future batches.  Also, it is
+		 * possible for hashtable->nbatch to be increased here!
+		 */
+		uint32		hashTupleSize;
+
+		/*
+		 * TODO: wouldn't it be cool if this returned the size of the tuple
+		 * inserted
+		 */
+		ExecHashTableInsert(hashtable, slot, hashvalue);
+		loaded_inner = true;
+
+		if (!IsHashloopFallback(hashtable))
+			continue;
+
+		hashTupleSize = slot->tts_ops->get_minimal_tuple(slot)->t_len + HJTUPLE_OVERHEAD;
+
+		if (hashtable->spaceUsed + hashTupleSize +
+			hashtable->nbuckets_optimal * sizeof(HashJoinTuple)
+			> hashtable->spaceAllowed)
+			break;
+	}
+
+	/*
+	 * if we didn't load anything and it is a FOJ/LOJ fallback batch, we will
+	 * transition to emit unmatched outer tuples next. we want to know how
+	 * many tuples were in the batch in that case, so don't zero it out then
+	 */
+
+	/*
+	 * if we loaded anything into the hashtable or it is the phantom stripe,
+	 * must proceed to probing
+	 */
+	if (loaded_inner)
+	{
+		hjstate->hj_CurNumOuterTuples = 0;
+		InstrIncrBatchStripes(hashtable->fallback_batches_stats, curbatch);
+		return true;
+	}
+
+	if (IsHashloopFallback(hashtable) && HJ_FILL_OUTER(hjstate))
+	{
+		/*
+		 * if we didn't load anything and it is a fallback batch, we will
+		 * prepare to emit outer tuples during the phantom stripe probing
+		 */
+		hashtable->curstripe = PHANTOM_STRIPE;
+		hjstate->hj_EmitOuterTupleId = 0;
+		hjstate->hj_CurOuterMatchStatus = 0;
+		BufFileSeek(hashtable->hashloop_fallback[curbatch], 0, 0, SEEK_SET);
+		BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET);
+		return true;
+	}
+	return false;
 }
 
+
 /*
  * Choose a batch to work on, and attach to it.  Returns true if successful,
  * false if there are no more batches.
@@ -1101,11 +1411,21 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 	/*
 	 * If we were already attached to a batch, remember not to bother checking
 	 * it again, and detach from it (possibly freeing the hash table if we are
-	 * last to detach).
+	 * last to detach). curbatch is set when the batch_barrier phase is either
+	 * PHJ_BATCH_LOADING or PHJ_BATCH_STRIPING (note that the
+	 * PHJ_BATCH_LOADING case will fall through to the PHJ_BATCH_STRIPING
+	 * case). The PHJ_BATCH_STRIPING case returns to the caller. So when this
+	 * function is reentered with a curbatch >= 0 then we must be done
+	 * probing.
 	 */
+
 	if (hashtable->curbatch >= 0)
 	{
-		hashtable->batches[hashtable->curbatch].done = true;
+		ParallelHashJoinBatchAccessor *batch_accessor = &hashtable->batches[hashtable->curbatch];
+
+		if (IsHashloopFallback(hashtable))
+			sb_end_write(hashtable->batches[hashtable->curbatch].sba);
+		batch_accessor->done = PHJ_BATCH_ACCESSOR_DONE;
 		ExecHashTableDetachBatch(hashtable);
 	}
 
@@ -1119,13 +1439,8 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 		hashtable->nbatch;
 	do
 	{
-		uint32		hashvalue;
-		MinimalTuple tuple;
-		TupleTableSlot *slot;
-
-		if (!hashtable->batches[batchno].done)
+		if (hashtable->batches[batchno].done != PHJ_BATCH_ACCESSOR_DONE)
 		{
-			SharedTuplestoreAccessor *inner_tuples;
 			Barrier    *batch_barrier =
 			&hashtable->batches[batchno].shared->batch_barrier;
 
@@ -1136,7 +1451,15 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 					/* One backend allocates the hash table. */
 					if (BarrierArriveAndWait(batch_barrier,
 											 WAIT_EVENT_HASH_BATCH_ELECT))
+					{
 						ExecParallelHashTableAlloc(hashtable, batchno);
+
+						/*
+						 * one worker needs to 0 out the read_pages of all the
+						 * participants in the new batch
+						 */
+						sts_reinitialize(hashtable->batches[batchno].inner_tuples);
+					}
 					/* Fall through. */
 
 				case PHJ_BATCH_ALLOCATING:
@@ -1145,41 +1468,31 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 										 WAIT_EVENT_HASH_BATCH_ALLOCATE);
 					/* Fall through. */
 
-				case PHJ_BATCH_LOADING:
-					/* Start (or join in) loading tuples. */
-					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					inner_tuples = hashtable->batches[batchno].inner_tuples;
-					sts_begin_parallel_scan(inner_tuples);
-					while ((tuple = sts_parallel_scan_next(inner_tuples,
-														   &hashvalue)))
-					{
-						ExecForceStoreMinimalTuple(tuple,
-												   hjstate->hj_HashTupleSlot,
-												   false);
-						slot = hjstate->hj_HashTupleSlot;
-						ExecParallelHashTableInsertCurrentBatch(hashtable, slot,
-																hashvalue);
-					}
-					sts_end_parallel_scan(inner_tuples);
-					BarrierArriveAndWait(batch_barrier,
-										 WAIT_EVENT_HASH_BATCH_LOAD);
-					/* Fall through. */
+				case PHJ_BATCH_STRIPING:
 
-				case PHJ_BATCH_PROBING:
+					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
+					sts_begin_parallel_scan(hashtable->batches[batchno].inner_tuples);
+					if (hashtable->batches[batchno].shared->hashloop_fallback)
+						sb_initialize_accessor(hashtable->batches[hashtable->curbatch].sba,
+											   sts_get_tuplenum(hashtable->batches[hashtable->curbatch].outer_tuples));
+					hashtable->curstripe = STRIPE_DETACHED;
+					if (ExecParallelHashJoinLoadStripe(hjstate))
+						return true;
 
 					/*
-					 * This batch is ready to probe.  Return control to
-					 * caller. We stay attached to batch_barrier so that the
-					 * hash table stays alive until everyone's finished
-					 * probing it, but no participant is allowed to wait at
-					 * this barrier again (or else a deadlock could occur).
-					 * All attached participants must eventually call
-					 * BarrierArriveAndDetach() so that the final phase
-					 * PHJ_BATCH_DONE can be reached.
+					 * ExecParallelHashJoinLoadStripe() will return false from
+					 * here when no more work can be done by this worker on
+					 * this batch. Until further optimized, this worker will
+					 * have detached from the stripe_barrier and should close
+					 * its outer match statuses bitmap and then detach from
+					 * the batch. In order to reuse the code below, fall
+					 * through, even though the phase will not have been
+					 * advanced
 					 */
-					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					sts_begin_parallel_scan(hashtable->batches[batchno].outer_tuples);
-					return true;
+					if (hashtable->batches[batchno].shared->hashloop_fallback)
+						sb_end_write(hashtable->batches[batchno].sba);
+
+					/* Fall through. */
 
 				case PHJ_BATCH_DONE:
 
@@ -1188,7 +1501,7 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 					 * remain).
 					 */
 					BarrierDetach(batch_barrier);
-					hashtable->batches[batchno].done = true;
+					hashtable->batches[batchno].done = PHJ_BATCH_ACCESSOR_DONE;
 					hashtable->curbatch = -1;
 					break;
 
@@ -1203,6 +1516,274 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 	return false;
 }
 
+
+
+/*
+ * Returns true if ready to probe and false if the inner is exhausted
+ * (there are no more stripes)
+ */
+bool
+ExecParallelHashJoinLoadStripe(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			batchno = hashtable->curbatch;
+	ParallelHashJoinBatch *batch = hashtable->batches[batchno].shared;
+	Barrier    *stripe_barrier = &batch->stripe_barrier;
+	SharedTuplestoreAccessor *outer_tuples;
+	SharedTuplestoreAccessor *inner_tuples;
+	ParallelHashJoinBatchAccessor *accessor;
+	dsa_pointer_atomic *buckets;
+
+	outer_tuples = hashtable->batches[batchno].outer_tuples;
+	inner_tuples = hashtable->batches[batchno].inner_tuples;
+
+	if (hashtable->curstripe >= 0)
+	{
+		/*
+		 * If a worker is already attached to a stripe, wait until all
+		 * participants have finished probing and detach. The last worker,
+		 * however, can re-attach to the stripe_barrier and proceed to load
+		 * and probe the other stripes
+		 */
+		/*
+		 * After finishing with participating in a stripe, if a worker is the
+		 * only one working on a batch, it will continue working on it.
+		 * However, if a worker is not the only worker working on a batch, it
+		 * would risk deadlock if it waits on the barrier. Instead, it will
+		 * detach from the stripe, and, eventually the batch.
+		 *
+		 * This means all stripes after the first stripe will be executed
+		 * serially. TODO: allow workers to provisionally detach from the
+		 * batch and reattach later if there is still work to be done. I had a
+		 * patch that did this. Workers who were not the last worker saved the
+		 * state of the stripe barrier upon detaching and then mark the batch
+		 * as "provisionally" done (not done). Later, when the worker comes
+		 * back to the batch in the batch phase machine, if the batch is not
+		 * complete and the phase has advanced since the worker was last
+		 * participating, then the worker can join back in. This had problems.
+		 * There were synchronization issues with workers having multiple
+		 * outer match status bitmap files open at the same time, so, I had
+		 * workers close their bitmap and make a new one the next time they
+		 * joined in. This didn't work with the current code because the
+		 * original outer match status bitmap file that the worker had created
+		 * while probing stripe 1 did not get combined into the combined
+		 * bitmap This could be specifically fixed, but I think it is better
+		 * to address the lack of parallel execution for stripes after stripe
+		 * 0 more holistically.
+		 */
+		if (!BarrierArriveAndDetach(stripe_barrier))
+		{
+			sb_end_write(hashtable->batches[hashtable->curbatch].sba);
+			hashtable->curstripe = STRIPE_DETACHED;
+			return false;
+		}
+
+		/*
+		 * This isn't a race condition if no other workers can stay attached
+		 * to this barrier in the intervening time. Basically, if you attach
+		 * to a stripe barrier in the PHJ_STRIPE_DONE phase, detach
+		 * immediately and move on.
+		 */
+		BarrierAttach(stripe_barrier);
+	}
+	else if (hashtable->curstripe == STRIPE_DETACHED)
+	{
+		int			phase = BarrierAttach(stripe_barrier);
+
+		/*
+		 * If a worker enters this phase machine on a stripe number greater
+		 * than the batch's maximum stripe number, then: 1) The batch is done,
+		 * or 2) The batch is on the phantom stripe that's used for hashloop
+		 * fallback Either way the worker can't contribute so just detach and
+		 * move on.
+		 */
+
+		if (PHJ_STRIPE_NUMBER(phase) > batch->maximum_stripe_number ||
+			PHJ_STRIPE_PHASE(phase) == PHJ_STRIPE_DONE)
+			return ExecHashTableDetachStripe(hashtable);
+	}
+	else if (hashtable->curstripe == PHANTOM_STRIPE)
+	{
+		sts_end_parallel_scan(outer_tuples);
+
+		/*
+		 * TODO: ideally this would go somewhere in the batch phase machine
+		 * Putting it in ExecHashTableDetachBatch didn't do the trick
+		 */
+		sb_end_read(hashtable->batches[batchno].sba);
+		return ExecHashTableDetachStripe(hashtable);
+	}
+
+	hashtable->curstripe = PHJ_STRIPE_NUMBER(BarrierPhase(stripe_barrier));
+
+	/*
+	 * The outer side is exhausted and either 1) the current stripe of the
+	 * inner side is exhausted and it is time to advance the stripe 2) the
+	 * last stripe of the inner side is exhausted and it is time to advance
+	 * the batch
+	 */
+	for (;;)
+	{
+		int			phase = BarrierPhase(stripe_barrier);
+
+		switch (PHJ_STRIPE_PHASE(phase))
+		{
+			case PHJ_STRIPE_ELECTING:
+				if (BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_ELECT))
+				{
+					sts_reinitialize(outer_tuples);
+
+					/*
+					 * set the rewound flag back to false to prepare for the
+					 * next stripe
+					 */
+					sts_reset_rewound(inner_tuples);
+				}
+
+				/* FALLTHROUGH */
+
+			case PHJ_STRIPE_RESETTING:
+				/* TODO: not needed for phantom stripe */
+				BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_RESET);
+				/* FALLTHROUGH */
+
+			case PHJ_STRIPE_LOADING:
+				{
+					MinimalTuple tuple;
+					tupleMetadata metadata;
+
+					/*
+					 * Start (or join in) loading the next stripe of inner
+					 * tuples.
+					 */
+
+					/*
+					 * I'm afraid there potential issue if a worker joins in
+					 * this phase and doesn't do the actions and resetting of
+					 * variables in sts_resume_parallel_scan. that is, if it
+					 * doesn't reset start_page and read_next_page in between
+					 * stripes. For now, call it. However, I think it might be
+					 * able to be removed.
+					 */
+
+					/*
+					 * TODO: sts_resume_parallel_scan() is overkill for stripe
+					 * 0 of each batch
+					 */
+					sts_resume_parallel_scan(inner_tuples);
+
+					while ((tuple = sts_parallel_scan_next(inner_tuples, &metadata)))
+					{
+						/* The tuple is from a previous stripe. Skip it */
+						if (metadata.stripe < PHJ_STRIPE_NUMBER(phase))
+							continue;
+
+						/*
+						 * tuple from future. time to back out read_page. end
+						 * of stripe
+						 */
+						if (metadata.stripe > PHJ_STRIPE_NUMBER(phase))
+						{
+							sts_parallel_scan_rewind(inner_tuples);
+							continue;
+						}
+
+						ExecForceStoreMinimalTuple(tuple, hjstate->hj_HashTupleSlot, false);
+						ExecParallelHashTableInsertCurrentBatch(
+																hashtable,
+																hjstate->hj_HashTupleSlot,
+																metadata.hashvalue);
+					}
+					BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOAD);
+				}
+				/* FALLTHROUGH */
+
+			case PHJ_STRIPE_PROBING:
+
+				/*
+				 * do this again here in case a worker began the scan and then
+				 * entered after loading before probing
+				 */
+				sts_end_parallel_scan(inner_tuples);
+				sts_begin_parallel_scan(outer_tuples);
+				return true;
+
+			case PHJ_STRIPE_DONE:
+
+				if (PHJ_STRIPE_NUMBER(phase) >= batch->maximum_stripe_number)
+				{
+					/*
+					 * Handle the phantom stripe case.
+					 */
+					if (batch->hashloop_fallback && HJ_FILL_OUTER(hjstate))
+						goto fallback_stripe;
+
+					/* Return if this is the last stripe */
+					return ExecHashTableDetachStripe(hashtable);
+				}
+
+				/* this, effectively, increments the stripe number */
+				if (BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOAD))
+				{
+					/*
+					 * reset inner's hashtable and recycle the existing bucket
+					 * array.
+					 */
+					buckets = (dsa_pointer_atomic *)
+						dsa_get_address(hashtable->area, batch->buckets);
+
+					for (size_t i = 0; i < hashtable->nbuckets; ++i)
+						dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
+				}
+
+				hashtable->curstripe++;
+				continue;
+
+			default:
+				elog(ERROR, "unexpected stripe phase %d. pid %i. batch %i.", BarrierPhase(stripe_barrier), MyProcPid, batchno);
+		}
+	}
+
+fallback_stripe:
+	accessor = &hashtable->batches[hashtable->curbatch];
+	sb_end_write(accessor->sba);
+
+	/* Ensure that only a single worker is attached to the barrier */
+	if (!BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOAD))
+		return ExecHashTableDetachStripe(hashtable);
+
+
+	/* No one except the last worker will run this code */
+	hashtable->curstripe = PHANTOM_STRIPE;
+
+	/*
+	 * reset inner's hashtable and recycle the existing bucket array.
+	 */
+	buckets = (dsa_pointer_atomic *)
+		dsa_get_address(hashtable->area, batch->buckets);
+
+	for (size_t i = 0; i < hashtable->nbuckets; ++i)
+		dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
+
+	/*
+	 * If all workers (including this one) have finished probing the batch,
+	 * one worker is elected to Loop through the outer match status files from
+	 * all workers that were attached to this batch Combine them into one
+	 * bitmap Use the bitmap, loop through the outer batch file again, and
+	 * emit unmatched tuples All workers will detach from the batch barrier
+	 * and the last worker will clean up the hashtable. All workers except the
+	 * last worker will end their scans of the outer and inner side. The last
+	 * worker will end its scan of the inner side
+	 */
+
+	sb_combine(accessor->sba);
+	sts_reinitialize(outer_tuples);
+
+	sts_begin_parallel_scan(outer_tuples);
+
+	return true;
+}
+
 /*
  * ExecHashJoinSaveTuple
  *		save a tuple to a batch file.
@@ -1372,6 +1953,9 @@ ExecReScanHashJoin(HashJoinState *node)
 	node->hj_MatchedOuter = false;
 	node->hj_FirstOuterTupleSlot = NULL;
 
+	node->hj_CurNumOuterTuples = 0;
+	node->hj_CurOuterMatchStatus = 0;
+
 	/*
 	 * if chgParam of subnode is not null then plan will be re-scanned by
 	 * first ExecProcNode.
@@ -1402,7 +1986,6 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 	ExprContext *econtext = hjstate->js.ps.ps_ExprContext;
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	TupleTableSlot *slot;
-	uint32		hashvalue;
 	int			i;
 
 	Assert(hjstate->hj_FirstOuterTupleSlot == NULL);
@@ -1410,6 +1993,8 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 	/* Execute outer plan, writing all tuples to shared tuplestores. */
 	for (;;)
 	{
+		tupleMetadata metadata;
+
 		slot = ExecProcNode(outerState);
 		if (TupIsNull(slot))
 			break;
@@ -1418,17 +2003,23 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 								 hjstate->hj_OuterHashKeys,
 								 true,	/* outer tuple */
 								 HJ_FILL_OUTER(hjstate),
-								 &hashvalue))
+								 &metadata.hashvalue))
 		{
 			int			batchno;
 			int			bucketno;
 			bool		shouldFree;
+			SharedTuplestoreAccessor *accessor;
+
 			MinimalTuple mintup = ExecFetchSlotMinimalTuple(slot, &shouldFree);
 
-			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
+			ExecHashGetBucketAndBatch(hashtable, metadata.hashvalue, &bucketno,
 									  &batchno);
-			sts_puttuple(hashtable->batches[batchno].outer_tuples,
-						 &hashvalue, mintup);
+			accessor = hashtable->batches[batchno].outer_tuples;
+
+			/* cannot count on deterministic order of tupleids */
+			metadata.tupleid = sts_increment_ntuples(accessor);
+
+			sts_puttuple(hashtable->batches[batchno].outer_tuples, &metadata.hashvalue, mintup);
 
 			if (shouldFree)
 				heap_free_minimal_tuple(mintup);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 309378ae54..7972c89060 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3783,8 +3783,17 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_BATCH_ELECT:
 			event_name = "HashBatchElect";
 			break;
-		case WAIT_EVENT_HASH_BATCH_LOAD:
-			event_name = "HashBatchLoad";
+		case WAIT_EVENT_HASH_STRIPE_ELECT:
+			event_name = "HashStripeElect";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_RESET:
+			event_name = "HashStripeReset";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_LOAD:
+			event_name = "HashStripeLoad";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_PROBE:
+			event_name = "HashStripeProbe";
 			break;
 		case WAIT_EVENT_HASH_BUILD_ALLOCATE:
 			event_name = "HashBuildAllocate";
diff --git a/src/backend/utils/sort/Makefile b/src/backend/utils/sort/Makefile
index 7ac3659261..f11fe85aeb 100644
--- a/src/backend/utils/sort/Makefile
+++ b/src/backend/utils/sort/Makefile
@@ -16,6 +16,7 @@ override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS)
 
 OBJS = \
 	logtape.o \
+	sharedbits.o \
 	sharedtuplestore.o \
 	sortsupport.o \
 	tuplesort.o \
diff --git a/src/backend/utils/sort/sharedbits.c b/src/backend/utils/sort/sharedbits.c
new file mode 100644
index 0000000000..f93f900d16
--- /dev/null
+++ b/src/backend/utils/sort/sharedbits.c
@@ -0,0 +1,285 @@
+#include "postgres.h"
+#include "storage/buffile.h"
+#include "utils/sharedbits.h"
+
+/*
+ * TODO: put a comment about not currently supporting parallel scan of the SharedBits
+ * To support parallel scan, need to introduce many more mechanisms
+ */
+
+/* Per-participant shared state */
+struct SharedBitsParticipant
+{
+	bool		present;
+	bool		writing;
+};
+
+/* Shared control object */
+struct SharedBits
+{
+	int			nparticipants;	/* Number of participants that can write. */
+	int64		nbits;
+	char		name[NAMEDATALEN];	/* A name for this bitstore. */
+
+	SharedBitsParticipant participants[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/* backend-local state */
+struct SharedBitsAccessor
+{
+	int			participant;
+	SharedBits *bits;
+	SharedFileSet *fileset;
+	BufFile    *write_file;
+	BufFile    *combined;
+};
+
+SharedBitsAccessor *
+sb_attach(SharedBits *sbits, int my_participant_number, SharedFileSet *fileset)
+{
+	SharedBitsAccessor *accessor = palloc0(sizeof(SharedBitsAccessor));
+
+	accessor->participant = my_participant_number;
+	accessor->bits = sbits;
+	accessor->fileset = fileset;
+	accessor->write_file = NULL;
+	accessor->combined = NULL;
+	return accessor;
+}
+
+SharedBitsAccessor *
+sb_initialize(SharedBits *sbits,
+			  int participants,
+			  int my_participant_number,
+			  SharedFileSet *fileset,
+			  char *name)
+{
+	SharedBitsAccessor *accessor;
+
+	sbits->nparticipants = participants;
+	strcpy(sbits->name, name);
+	sbits->nbits = 0;			/* TODO: maybe delete this */
+
+	accessor = palloc0(sizeof(SharedBitsAccessor));
+	accessor->participant = my_participant_number;
+	accessor->bits = sbits;
+	accessor->fileset = fileset;
+	accessor->write_file = NULL;
+	accessor->combined = NULL;
+	return accessor;
+}
+
+/*  TODO: is "initialize_accessor" a clear enough API for this? (making the file)? */
+void
+sb_initialize_accessor(SharedBitsAccessor *accessor, uint32 nbits)
+{
+	char		name[MAXPGPATH];
+	uint32		num_to_write;
+
+	snprintf(name, MAXPGPATH, "%s.p%d.bitmap", accessor->bits->name, accessor->participant);
+
+	accessor->write_file =
+		BufFileCreateShared(accessor->fileset, name);
+
+	accessor->bits->participants[accessor->participant].present = true;
+	/* TODO: check this math. tuplenumber will be too high? */
+	num_to_write = nbits / 8 + 1;
+
+	/*
+	 * TODO: add tests that could exercise a problem with junk being written
+	 * to bitmap
+	 */
+
+	/*
+	 * TODO: is there a better way to write the bytes to the file without
+	 * calling BufFileWrite() like this? palloc()ing an undetermined number of
+	 * bytes feels like it is against the spirit of this patch to begin with,
+	 * but the many function calls seem expensive
+	 */
+	for (int i = 0; i < num_to_write; i++)
+	{
+		unsigned char byteToWrite = 0;
+
+		BufFileWrite(accessor->write_file, &byteToWrite, 1);
+	}
+
+	if (BufFileSeek(accessor->write_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+}
+
+size_t
+sb_estimate(int participants)
+{
+	return offsetof(SharedBits, participants) + participants * sizeof(SharedBitsParticipant);
+}
+
+
+void
+sb_setbit(SharedBitsAccessor *accessor, uint64 bit)
+{
+	SharedBitsParticipant *const participant =
+	&accessor->bits->participants[accessor->participant];
+
+	/* TODO: use an unsigned int instead of a byte */
+	unsigned char current_outer_byte;
+
+	Assert(accessor->write_file);
+
+	if (!participant->writing)
+	{
+		participant->writing = true;
+	}
+
+	BufFileSeek(accessor->write_file, 0, bit / 8, SEEK_SET);
+	BufFileRead(accessor->write_file, &current_outer_byte, 1);
+
+	current_outer_byte |= 1U << (bit % 8);
+
+	BufFileSeek(accessor->write_file, 0, -1, SEEK_CUR);
+	BufFileWrite(accessor->write_file, &current_outer_byte, 1);
+}
+
+bool
+sb_checkbit(SharedBitsAccessor *accessor, uint32 n)
+{
+	bool		match;
+	uint32		bytenum = n / 8;
+	unsigned char bit = n % 8;
+	unsigned char byte_to_check = 0;
+
+	Assert(accessor->combined);
+
+	/* seek to byte to check */
+	if (BufFileSeek(accessor->combined,
+					0,
+					bytenum,
+					SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg(
+						"could not rewind shared outer temporary file: %m")));
+	/* read byte containing ntuple bit */
+	if (BufFileRead(accessor->combined, &byte_to_check, 1) == 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg(
+						"could not read byte in outer match status bitmap: %m.")));
+	/* if bit is set */
+	match = ((byte_to_check) >> bit) & 1;
+
+	return match;
+}
+
+BufFile *
+sb_combine(SharedBitsAccessor *accessor)
+{
+	/*
+	 * TODO: this tries to close an outer match status file for each
+	 * participant in the tuplestore. technically, only participants in the
+	 * barrier could have outer match status files, however, all but one
+	 * participant continue on and detach from the barrier so we won't have a
+	 * reliable way to close only files for those attached to the barrier
+	 */
+	BufFile   **statuses;
+	BufFile    *combined_bitmap_file;
+	int			statuses_length;
+
+	int			nbparticipants = 0;
+
+	for (int l = 0; l < accessor->bits->nparticipants; l++)
+	{
+		SharedBitsParticipant participant = accessor->bits->participants[l];
+
+		if (participant.present)
+		{
+			Assert(!participant.writing);
+			nbparticipants++;
+		}
+	}
+	statuses = palloc(sizeof(BufFile *) * nbparticipants);
+
+	/*
+	 * Open the bitmap shared BufFile from each participant. TODO: explain why
+	 * file can be NULLs
+	 */
+	statuses_length = 0;
+
+	for (int i = 0; i < accessor->bits->nparticipants; i++)
+	{
+		char		bitmap_filename[MAXPGPATH];
+		BufFile    *file;
+
+		/* TODO: make a function that will do this */
+		snprintf(bitmap_filename, MAXPGPATH, "%s.p%d.bitmap", accessor->bits->name, i);
+
+		if (!accessor->bits->participants[i].present)
+			continue;
+		file = BufFileOpenShared(accessor->fileset, bitmap_filename);
+		/* TODO: can we be sure that this file is at beginning? */
+		Assert(file);
+
+		statuses[statuses_length++] = file;
+	}
+
+	combined_bitmap_file = BufFileCreateTemp(false);
+
+	for (int64 cur = 0; cur < BufFileSize(statuses[0]); cur++)	/* make it while not EOF */
+	{
+		/*
+		 * TODO: make this use an unsigned int instead of a byte so it isn't
+		 * so slow
+		 */
+		unsigned char combined_byte = 0;
+
+		for (int i = 0; i < statuses_length; i++)
+		{
+			unsigned char read_byte;
+
+			BufFileRead(statuses[i], &read_byte, 1);
+			combined_byte |= read_byte;
+		}
+
+		BufFileWrite(combined_bitmap_file, &combined_byte, 1);
+	}
+
+	if (BufFileSeek(combined_bitmap_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	for (int i = 0; i < statuses_length; i++)
+		BufFileClose(statuses[i]);
+	pfree(statuses);
+
+	accessor->combined = combined_bitmap_file;
+	return combined_bitmap_file;
+}
+
+void
+sb_end_write(SharedBitsAccessor *sba)
+{
+	SharedBitsParticipant
+			   *const participant = &sba->bits->participants[sba->participant];
+
+	participant->writing = false;
+
+	/*
+	 * TODO: this should not be needed if flow is correct. need to fix that
+	 * and get rid of this check
+	 */
+	if (sba->write_file)
+		BufFileClose(sba->write_file);
+	sba->write_file = NULL;
+}
+
+void
+sb_end_read(SharedBitsAccessor *accessor)
+{
+	if (accessor->combined == NULL)
+		return;
+
+	BufFileClose(accessor->combined);
+	accessor->combined = NULL;
+}
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index c3ab494a45..0e3b3de2b6 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -52,6 +52,7 @@ typedef struct SharedTuplestoreParticipant
 {
 	LWLock		lock;
 	BlockNumber read_page;		/* Page number for next read. */
+	bool		rewound;
 	BlockNumber npages;			/* Number of pages written. */
 	bool		writing;		/* Used only for assertions. */
 } SharedTuplestoreParticipant;
@@ -60,6 +61,7 @@ typedef struct SharedTuplestoreParticipant
 struct SharedTuplestore
 {
 	int			nparticipants;	/* Number of participants that can write. */
+	pg_atomic_uint32 ntuples;	/* Number of tuples in this tuplestore. */
 	int			flags;			/* Flag bits from SHARED_TUPLESTORE_XXX */
 	size_t		meta_data_size; /* Size of per-tuple header. */
 	char		name[NAMEDATALEN];	/* A name for this tuplestore. */
@@ -85,6 +87,8 @@ struct SharedTuplestoreAccessor
 	char	   *read_buffer;	/* A buffer for loading tuples. */
 	size_t		read_buffer_size;
 	BlockNumber read_next_page; /* Lowest block we'll consider reading. */
+	BlockNumber start_page;		/* page to reset p->read_page to if back out
+								 * required */
 
 	/* State for writing. */
 	SharedTuplestoreChunk *write_chunk; /* Buffer for writing. */
@@ -137,6 +141,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 	Assert(my_participant_number < participants);
 
 	sts->nparticipants = participants;
+	pg_atomic_init_u32(&sts->ntuples, 1);
 	sts->meta_data_size = meta_data_size;
 	sts->flags = flags;
 
@@ -158,6 +163,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 		LWLockInitialize(&sts->participants[i].lock,
 						 LWTRANCHE_SHARED_TUPLESTORE);
 		sts->participants[i].read_page = 0;
+		sts->participants[i].rewound = false;
 		sts->participants[i].writing = false;
 	}
 
@@ -277,6 +283,45 @@ sts_begin_parallel_scan(SharedTuplestoreAccessor *accessor)
 	accessor->read_participant = accessor->participant;
 	accessor->read_file = NULL;
 	accessor->read_next_page = 0;
+	accessor->start_page = 0;
+}
+
+void
+sts_resume_parallel_scan(SharedTuplestoreAccessor *accessor)
+{
+	int			i PG_USED_FOR_ASSERTS_ONLY;
+	SharedTuplestoreParticipant *p;
+
+	/* End any existing scan that was in progress. */
+	sts_end_parallel_scan(accessor);
+
+	/*
+	 * Any backend that might have written into this shared tuplestore must
+	 * have called sts_end_write(), so that all buffers are flushed and the
+	 * files have stopped growing.
+	 */
+	for (i = 0; i < accessor->sts->nparticipants; ++i)
+		Assert(!accessor->sts->participants[i].writing);
+
+	/*
+	 * We will start out reading the file that THIS backend wrote.  There may
+	 * be some caching locality advantage to that.
+	 */
+
+	/*
+	 * TODO: does this still apply in the multi-stripe case? It seems like if
+	 * a participant file is exhausted for the current stripe it might be
+	 * better to remember that
+	 */
+	accessor->read_participant = accessor->participant;
+	accessor->read_file = NULL;
+	p = &accessor->sts->participants[accessor->read_participant];
+
+	/* TODO: find a better solution than this for resuming the parallel scan */
+	LWLockAcquire(&p->lock, LW_SHARED);
+	accessor->start_page = p->read_page;
+	LWLockRelease(&p->lock);
+	accessor->read_next_page = 0;
 }
 
 /*
@@ -295,6 +340,7 @@ sts_end_parallel_scan(SharedTuplestoreAccessor *accessor)
 		BufFileClose(accessor->read_file);
 		accessor->read_file = NULL;
 	}
+	accessor->start_page = 0;
 }
 
 /*
@@ -531,7 +577,13 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	for (;;)
 	{
 		/* Can we read more tuples from the current chunk? */
-		if (accessor->read_ntuples < accessor->read_ntuples_available)
+		/*
+		 * Added a check for accessor->read_file being present here, as it
+		 * became relevant for adaptive hashjoin. Not sure if this has other
+		 * consequences for correctness
+		 */
+
+		if (accessor->read_ntuples < accessor->read_ntuples_available && accessor->read_file)
 			return sts_read_tuple(accessor, meta_data);
 
 		/* Find the location of a new chunk to read. */
@@ -541,7 +593,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 		/* We can skip directly past overflow pages we know about. */
 		if (p->read_page < accessor->read_next_page)
 			p->read_page = accessor->read_next_page;
-		eof = p->read_page >= p->npages;
+		eof = p->read_page >= p->npages || p->rewound;
 		if (!eof)
 		{
 			/* Claim the next chunk. */
@@ -549,9 +601,22 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 			/* Advance the read head for the next reader. */
 			p->read_page += STS_CHUNK_PAGES;
 			accessor->read_next_page = p->read_page;
+
+			/*
+			 * initialize start_page to the read_page this participant will
+			 * start reading from
+			 */
+			accessor->start_page = read_page;
 		}
 		LWLockRelease(&p->lock);
 
+		if (!eof)
+		{
+			char		name[MAXPGPATH];
+
+			sts_filename(name, accessor, accessor->read_participant);
+		}
+
 		if (!eof)
 		{
 			SharedTuplestoreChunk chunk_header;
@@ -613,6 +678,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 			if (accessor->read_participant == accessor->participant)
 				break;
 			accessor->read_next_page = 0;
+			accessor->start_page = 0;
 
 			/* Go around again, so we can get a chunk from this file. */
 		}
@@ -621,6 +687,48 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	return NULL;
 }
 
+void
+sts_parallel_scan_rewind(SharedTuplestoreAccessor *accessor)
+{
+	SharedTuplestoreParticipant *p =
+	&accessor->sts->participants[accessor->read_participant];
+
+	/*
+	 * Only set the read_page back to the start of the sts_chunk this worker
+	 * was reading if some other worker has not already done so. It could be
+	 * the case that this worker saw a tuple from a future stripe and another
+	 * worker did also in its sts_chunk and it already set read_page to its
+	 * start_page If so, we want to set read_page to the lowest value to
+	 * ensure that we read all tuples from the stripe (don't miss tuples)
+	 */
+	LWLockAcquire(&p->lock, LW_EXCLUSIVE);
+	p->read_page = Min(p->read_page, accessor->start_page);
+	p->rewound = true;
+	LWLockRelease(&p->lock);
+
+	accessor->read_ntuples_available = 0;
+	accessor->read_next_page = 0;
+}
+
+void
+sts_reset_rewound(SharedTuplestoreAccessor *accessor)
+{
+	for (int i = 0; i < accessor->sts->nparticipants; ++i)
+		accessor->sts->participants[i].rewound = false;
+}
+
+uint32
+sts_increment_ntuples(SharedTuplestoreAccessor *accessor)
+{
+	return pg_atomic_fetch_add_u32(&accessor->sts->ntuples, 1);
+}
+
+uint32
+sts_get_tuplenum(SharedTuplestoreAccessor *accessor)
+{
+	return pg_atomic_read_u32(&accessor->sts->ntuples);
+}
+
 /*
  * Create the name used for the BufFile that a given participant will write.
  */
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index ba661d32a6..0ba9d856c8 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -46,6 +46,7 @@ typedef struct ExplainState
 	bool		timing;			/* print detailed node timing */
 	bool		summary;		/* print total planning and execution timing */
 	bool		settings;		/* print modified settings */
+	bool		usage;			/* print memory usage */
 	ExplainFormat format;		/* output format */
 	/* state for output formatting --- not reset for each new plan tree */
 	int			indent;			/* current indentation level */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 79b634e8ed..dc22525666 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -19,6 +19,7 @@
 #include "storage/barrier.h"
 #include "storage/buffile.h"
 #include "storage/lwlock.h"
+#include "utils/sharedbits.h"
 
 /* ----------------------------------------------------------------
  *				hash-join hash table structures
@@ -142,6 +143,17 @@ typedef struct HashMemoryChunkData *HashMemoryChunk;
 /* tuples exceeding HASH_CHUNK_THRESHOLD bytes are put in their own chunk */
 #define HASH_CHUNK_THRESHOLD	(HASH_CHUNK_SIZE / 4)
 
+/*
+ * HashJoinTableData->curstripe the current stripe number
+ * The phantom stripe refers to the state of the inner side hashtable (empty)
+ * during the final scan of the outer batch file for a batch being processed
+ * using the hashloop fallback algorithm.
+ * In parallel-aware hash join, curstripe is in a detached state
+ * when the worker is not attached to the stripe_barrier.
+ */
+#define PHANTOM_STRIPE -2
+#define STRIPE_DETACHED -1
+
 /*
  * For each batch of a Parallel Hash Join, we have a ParallelHashJoinBatch
  * object in shared memory to coordinate access to it.  Since they are
@@ -152,6 +164,7 @@ typedef struct ParallelHashJoinBatch
 {
 	dsa_pointer buckets;		/* array of hash table buckets */
 	Barrier		batch_barrier;	/* synchronization for joining this batch */
+	Barrier		stripe_barrier; /* synchronization for stripes */
 
 	dsa_pointer chunks;			/* chunks of tuples loaded */
 	size_t		size;			/* size of buckets + chunks in memory */
@@ -160,6 +173,17 @@ typedef struct ParallelHashJoinBatch
 	size_t		old_ntuples;	/* number of tuples before repartitioning */
 	bool		space_exhausted;
 
+	/* Adaptive HashJoin */
+
+	/*
+	 * after finishing build phase, hashloop_fallback cannot change, and does
+	 * not require a lock to read
+	 */
+	bool		hashloop_fallback;
+	int			maximum_stripe_number;	/* the number of stripes in the batch */
+	size_t		estimated_stripe_size;	/* size of last stripe in batch */
+	LWLock		lock;
+
 	/*
 	 * Variable-sized SharedTuplestore objects follow this struct in memory.
 	 * See the accessor macros below.
@@ -177,10 +201,17 @@ typedef struct ParallelHashJoinBatch
 	 ((char *) ParallelHashJoinBatchInner(batch) +						\
 	  MAXALIGN(sts_estimate(nparticipants))))
 
+/* Accessor for sharedbits following a ParallelHashJoinBatch. */
+#define ParallelHashJoinBatchOuterBits(batch, nparticipants) \
+	((SharedBits *)												\
+	 ((char *) ParallelHashJoinBatchOuter(batch, nparticipants) +						\
+	  MAXALIGN(sts_estimate(nparticipants))))
+
 /* Total size of a ParallelHashJoinBatch and tuplestores. */
 #define EstimateParallelHashJoinBatch(hashtable)						\
 	(MAXALIGN(sizeof(ParallelHashJoinBatch)) +							\
-	 MAXALIGN(sts_estimate((hashtable)->parallel_state->nparticipants)) * 2)
+	 MAXALIGN(sts_estimate((hashtable)->parallel_state->nparticipants)) * 2 + \
+	 MAXALIGN(sb_estimate((hashtable)->parallel_state->nparticipants)))
 
 /* Accessor for the nth ParallelHashJoinBatch given the base. */
 #define NthParallelHashJoinBatch(base, n)								\
@@ -204,9 +235,19 @@ typedef struct ParallelHashJoinBatchAccessor
 	size_t		old_ntuples;	/* how many tuples before repartitioning? */
 	bool		at_least_one_chunk; /* has this backend allocated a chunk? */
 
-	bool		done;			/* flag to remember that a batch is done */
+	int			done;			/* flag to remember that a batch is done */
+	/* -1 for not done, 0 for tentatively done, 1 for done */
 	SharedTuplestoreAccessor *inner_tuples;
 	SharedTuplestoreAccessor *outer_tuples;
+	SharedBitsAccessor *sba;
+
+	/*
+	 * All participants except the last worker working on a batch which has
+	 * fallen back to hashloop processing save the stripe barrier phase and
+	 * detach to avoid the deadlock hazard of waiting on a barrier after
+	 * tuples have been emitted.
+	 */
+	int			last_participating_stripe_phase;
 } ParallelHashJoinBatchAccessor;
 
 /*
@@ -227,6 +268,18 @@ typedef enum ParallelHashGrowth
 	PHJ_GROWTH_DISABLED
 } ParallelHashGrowth;
 
+typedef enum ParallelHashJoinBatchAccessorStatus
+{
+	/* No more useful work can be done on this batch by this worker */
+	PHJ_BATCH_ACCESSOR_DONE,
+
+	/*
+	 * The worker has not yet checked this batch to see if it can do useful
+	 * work
+	 */
+	PHJ_BATCH_ACCESSOR_NOT_DONE
+}			ParallelHashJoinBatchAccessorStatus;
+
 /*
  * The shared state used to coordinate a Parallel Hash Join.  This is stored
  * in the DSM segment.
@@ -263,9 +316,18 @@ typedef struct ParallelHashJoinState
 /* The phases for probing each batch, used by for batch_barrier. */
 #define PHJ_BATCH_ELECTING				0
 #define PHJ_BATCH_ALLOCATING			1
-#define PHJ_BATCH_LOADING				2
-#define PHJ_BATCH_PROBING				3
-#define PHJ_BATCH_DONE					4
+#define PHJ_BATCH_STRIPING				2
+#define PHJ_BATCH_DONE					3
+
+/* The phases for probing each stripe of each batch used with stripe barriers */
+#define PHJ_STRIPE_INVALID_PHASE        -1
+#define PHJ_STRIPE_ELECTING				0
+#define PHJ_STRIPE_RESETTING			1
+#define PHJ_STRIPE_LOADING				2
+#define PHJ_STRIPE_PROBING				3
+#define PHJ_STRIPE_DONE				    4
+#define PHJ_STRIPE_NUMBER(n)            ((n) / 5)
+#define PHJ_STRIPE_PHASE(n)             ((n) % 5)
 
 /* The phases of batch growth while hashing, for grow_batches_barrier. */
 #define PHJ_GROW_BATCHES_ELECTING		0
@@ -313,8 +375,6 @@ typedef struct HashJoinTableData
 	int			nbatch_original;	/* nbatch when we started inner scan */
 	int			nbatch_outstart;	/* nbatch when we started outer scan */
 
-	bool		growEnabled;	/* flag to shut off nbatch increases */
-
 	double		totalTuples;	/* # tuples obtained from inner plan */
 	double		partialTuples;	/* # tuples obtained from inner plan by me */
 	double		skewTuples;		/* # tuples inserted into skew tuples */
@@ -329,6 +389,18 @@ typedef struct HashJoinTableData
 	BufFile   **innerBatchFile; /* buffered virtual temp file per batch */
 	BufFile   **outerBatchFile; /* buffered virtual temp file per batch */
 
+	/*
+	 * Adaptive hashjoin variables
+	 */
+	BufFile   **hashloop_fallback;	/* outer match status files if fall back */
+	List	   *fallback_batches_stats; /* per hashjoin batch statistics */
+
+	/*
+	 * current stripe #; 0 during 1st pass, -1 (macro STRIPE_DETACHED) when
+	 * detached, -2 on phantom stripe (macro PHANTOM_STRIPE)
+	 */
+	int			curstripe;
+
 	/*
 	 * Info about the datatype-specific hash functions for the datatypes being
 	 * hashed. These are arrays of the same length as the number of hash join
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index a97562e7a4..e72bd5702a 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -14,6 +14,7 @@
 #define INSTRUMENT_H
 
 #include "portability/instr_time.h"
+#include "nodes/pg_list.h"
 
 
 typedef struct BufferUsage
@@ -39,6 +40,12 @@ typedef struct WalUsage
 	uint64		wal_bytes;		/* size of WAL records produced */
 } WalUsage;
 
+typedef struct FallbackBatchStats
+{
+	int			batchno;
+	int			numstripes;
+} FallbackBatchStats;
+
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
 typedef enum InstrumentOption
 {
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 64d2ce693c..f85308738b 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -31,6 +31,7 @@ extern void ExecParallelHashTableAlloc(HashJoinTable hashtable,
 extern void ExecHashTableDestroy(HashJoinTable hashtable);
 extern void ExecHashTableDetach(HashJoinTable hashtable);
 extern void ExecHashTableDetachBatch(HashJoinTable hashtable);
+extern bool ExecHashTableDetachStripe(HashJoinTable hashtable);
 extern void ExecParallelHashTableSetCurrentBatch(HashJoinTable hashtable,
 												 int batchno);
 
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index f7df70b5ab..0c0d87d1d3 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -129,6 +129,7 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+	uint32		tts_tuplenum;	/* a tuple id for use when ctid cannot be used */
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -425,6 +426,7 @@ static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
 	slot->tts_ops->clear(slot);
+	slot->tts_tuplenum = 0;		/* TODO: should this be done elsewhere? */
 
 	return slot;
 }
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 98e0072b8a..015879934c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1957,6 +1957,10 @@ typedef struct HashJoinState
 	int			hj_JoinState;
 	bool		hj_MatchedOuter;
 	bool		hj_OuterNotEmpty;
+	/* Adaptive Hashjoin variables */
+	int			hj_CurNumOuterTuples;	/* number of outer tuples in a batch */
+	unsigned int hj_CurOuterMatchStatus;
+	int			hj_EmitOuterTupleId;
 } HashJoinState;
 
 
@@ -2359,6 +2363,7 @@ typedef struct HashInstrumentation
 	int			nbatch;			/* number of batches at end of execution */
 	int			nbatch_original;	/* planned number of batches */
 	Size		space_peak;		/* peak memory usage in bytes */
+	List	   *fallback_batches_stats; /* per hashjoin batch stats */
 } HashInstrumentation;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c55dc1481c..df8010f832 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -855,7 +855,10 @@ typedef enum
 	WAIT_EVENT_EXECUTE_GATHER,
 	WAIT_EVENT_HASH_BATCH_ALLOCATE,
 	WAIT_EVENT_HASH_BATCH_ELECT,
-	WAIT_EVENT_HASH_BATCH_LOAD,
+	WAIT_EVENT_HASH_STRIPE_ELECT,
+	WAIT_EVENT_HASH_STRIPE_RESET,
+	WAIT_EVENT_HASH_STRIPE_LOAD,
+	WAIT_EVENT_HASH_STRIPE_PROBE,
 	WAIT_EVENT_HASH_BUILD_ALLOCATE,
 	WAIT_EVENT_HASH_BUILD_ELECT,
 	WAIT_EVENT_HASH_BUILD_HASH_INNER,
diff --git a/src/include/utils/sharedbits.h b/src/include/utils/sharedbits.h
new file mode 100644
index 0000000000..de43279de8
--- /dev/null
+++ b/src/include/utils/sharedbits.h
@@ -0,0 +1,39 @@
+/*-------------------------------------------------------------------------
+ *
+ * sharedbits.h
+ *	  Simple mechanism for sharing bits between backends.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/sharedbits.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SHAREDBITS_H
+#define SHAREDBITS_H
+
+#include "storage/sharedfileset.h"
+
+struct SharedBits;
+typedef struct SharedBits SharedBits;
+
+struct SharedBitsParticipant;
+typedef struct SharedBitsParticipant SharedBitsParticipant;
+
+struct SharedBitsAccessor;
+typedef struct SharedBitsAccessor SharedBitsAccessor;
+
+extern SharedBitsAccessor *sb_attach(SharedBits *sbits, int my_participant_number, SharedFileSet *fileset);
+extern SharedBitsAccessor *sb_initialize(SharedBits *sbits, int participants, int my_participant_number, SharedFileSet *fileset, char *name);
+extern void sb_initialize_accessor(SharedBitsAccessor *accessor, uint32 nbits);
+extern size_t sb_estimate(int participants);
+
+extern void sb_setbit(SharedBitsAccessor *accessor, uint64 bit);
+extern bool sb_checkbit(SharedBitsAccessor *accessor, uint32 n);
+extern BufFile *sb_combine(SharedBitsAccessor *accessor);
+
+extern void sb_end_write(SharedBitsAccessor *sba);
+extern void sb_end_read(SharedBitsAccessor *accessor);
+
+#endif							/* SHAREDBITS_H */
diff --git a/src/include/utils/sharedtuplestore.h b/src/include/utils/sharedtuplestore.h
index 9754504cc5..99aead8a4a 100644
--- a/src/include/utils/sharedtuplestore.h
+++ b/src/include/utils/sharedtuplestore.h
@@ -22,6 +22,17 @@ typedef struct SharedTuplestore SharedTuplestore;
 
 struct SharedTuplestoreAccessor;
 typedef struct SharedTuplestoreAccessor SharedTuplestoreAccessor;
+struct tupleMetadata;
+typedef struct tupleMetadata tupleMetadata;
+struct tupleMetadata
+{
+	uint32		hashvalue;
+	union
+	{
+		uint32		tupleid;	/* tuple number or id on the outer side */
+		int			stripe;		/* stripe number for inner side */
+	};
+};
 
 /*
  * A flag indicating that the tuplestore will only be scanned once, so backing
@@ -49,6 +60,8 @@ extern void sts_reinitialize(SharedTuplestoreAccessor *accessor);
 
 extern void sts_begin_parallel_scan(SharedTuplestoreAccessor *accessor);
 
+extern void sts_resume_parallel_scan(SharedTuplestoreAccessor *accessor);
+
 extern void sts_end_parallel_scan(SharedTuplestoreAccessor *accessor);
 
 extern void sts_puttuple(SharedTuplestoreAccessor *accessor,
@@ -58,4 +71,10 @@ extern void sts_puttuple(SharedTuplestoreAccessor *accessor,
 extern MinimalTuple sts_parallel_scan_next(SharedTuplestoreAccessor *accessor,
 										   void *meta_data);
 
+extern void sts_parallel_scan_rewind(SharedTuplestoreAccessor *accessor);
+
+extern void sts_reset_rewound(SharedTuplestoreAccessor *accessor);
+extern uint32 sts_increment_ntuples(SharedTuplestoreAccessor *accessor);
+extern uint32 sts_get_tuplenum(SharedTuplestoreAccessor *accessor);
+
 #endif							/* SHAREDTUPLESTORE_H */
diff --git a/src/test/regress/expected/join_hash.out b/src/test/regress/expected/join_hash.out
index 3a91c144a2..463e71238a 100644
--- a/src/test/regress/expected/join_hash.out
+++ b/src/test/regress/expected/join_hash.out
@@ -1013,3 +1013,1454 @@ WHERE
 (1 row)
 
 ROLLBACK;
+-- Serial Adaptive Hash Join
+BEGIN;
+CREATE TYPE stub AS (hash INTEGER, value CHAR(8098));
+CREATE FUNCTION stub_hash(item stub)
+RETURNS INTEGER AS $$
+DECLARE
+  batch_size INTEGER;
+BEGIN
+  batch_size := 4;
+  RETURN item.hash << (batch_size - 1);
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+CREATE FUNCTION stub_eq(item1 stub, item2 stub)
+RETURNS BOOLEAN AS $$
+BEGIN
+  RETURN item1.hash = item2.hash AND item1.value = item2.value;
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+CREATE OPERATOR = (
+  FUNCTION = stub_eq,
+  LEFTARG = stub,
+  RIGHTARG = stub,
+  COMMUTATOR = =,
+  HASHES, MERGES
+);
+CREATE OPERATOR CLASS stub_hash_ops
+DEFAULT FOR TYPE stub USING hash AS
+  OPERATOR 1 =(stub, stub),
+  FUNCTION 1 stub_hash(stub);
+CREATE TABLE probeside(a stub);
+ALTER TABLE probeside ALTER COLUMN a SET STORAGE PLAIN;
+-- non-fallback batch with unmatched outer tuple
+INSERT INTO probeside SELECT '(2, "")' FROM generate_series(1, 1);
+-- fallback batch unmatched outer tuple (in first stripe maybe)
+INSERT INTO probeside SELECT '(1, "unmatched outer tuple")' FROM generate_series(1, 1);
+-- fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(1, "")' FROM generate_series(1, 5);
+-- fallback batch unmatched outer tuple (in last stripe maybe)
+-- When numbatches=4, hash 5 maps to batch 1, but after numbatches doubles to
+-- 8 batches hash 5 maps to batch 5.
+INSERT INTO probeside SELECT '(5, "")' FROM generate_series(1, 1);
+-- non-fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(3, "")' FROM generate_series(1, 1);
+-- batch with 3 stripes where non-first/non-last stripe contains unmatched outer tuple
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 5);
+INSERT INTO probeside SELECT '(6, "unmatched outer tuple")' FROM generate_series(1, 1);
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 1);
+CREATE TABLE hashside_wide(a stub, id int);
+ALTER TABLE hashside_wide ALTER COLUMN a SET STORAGE PLAIN;
+-- falls back with an unmatched inner tuple that is in fist, middle, and last
+-- stripe
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in first stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in middle stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in last stripe")', 1 FROM generate_series(1, 1);
+-- doesn't fall back -- matched tuple
+INSERT INTO hashside_wide SELECT '(3, "")', 3 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(6, "")', 6 FROM generate_series(1, 20);
+ANALYZE probeside, hashside_wide;
+SET enable_nestloop TO off;
+SET enable_mergejoin TO off;
+SET work_mem = 64;
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash | btrim 
+------+-----------------------+----+------+-------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+(215 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Left Join (actual rows=215 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash | btrim | id | hash |                 btrim                  
+------+-------+----+------+----------------------------------------
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    3 |       |  3 |    3 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+      |       |  1 |    1 | unmatched inner tuple in first stripe
+      |       |  1 |    1 | unmatched inner tuple in last stripe
+      |       |  1 |    1 | unmatched inner tuple in middle stripe
+(214 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Right Join (actual rows=214 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+FULL OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash |                 btrim                  
+------+-----------------------+----+------+----------------------------------------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+      |                       |  1 |    1 | unmatched inner tuple in first stripe
+      |                       |  1 |    1 | unmatched inner tuple in last stripe
+      |                       |  1 |    1 | unmatched inner tuple in middle stripe
+(218 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+FULL OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Full Join (actual rows=218 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+-- semi-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Semi Join (actual rows=12 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+ hash | btrim 
+------+-------
+    1 | 
+    1 | 
+    1 | 
+    1 | 
+    1 | 
+    3 | 
+    6 | 
+    6 | 
+    6 | 
+    6 | 
+    6 | 
+    6 | 
+(12 rows)
+
+-- anti-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Anti Join (actual rows=4 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+ hash |         btrim         
+------+-----------------------
+    1 | unmatched outer tuple
+    2 | 
+    5 | 
+    6 | unmatched outer tuple
+(4 rows)
+
+-- parallel LOJ test case with two batches falling back
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_parallel_hash = on;
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Gather (actual rows=215 loops=1)
+   Workers Planned: 1
+   Workers Launched: 1
+   ->  Parallel Hash Left Join (actual rows=108 loops=2)
+         Hash Cond: (probeside.a = hashside_wide.a)
+         ->  Parallel Seq Scan on probeside (actual rows=16 loops=1)
+         ->  Parallel Hash (actual rows=21 loops=2)
+               Buckets: 8 (originally 8)  Batches: 128 (originally 8)
+               Batch: 1  Stripes: 3
+               Batch: 6  Stripes: 2
+               ->  Parallel Seq Scan on hashside_wide (actual rows=42 loops=1)
+(11 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash | btrim 
+------+-----------------------+----+------+-------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+(215 rows)
+
+rollback to settings;
+-- Test spill of batch 0 gives correct results.
+CREATE TABLE probeside_batch0(a stub);
+ALTER TABLE probeside_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO probeside_batch0 SELECT '(0, "")' FROM generate_series(1, 13);
+INSERT INTO probeside_batch0 SELECT '(0, "unmatched outer")' FROM generate_series(1, 1);
+CREATE TABLE hashside_wide_batch0(a stub, id int);
+ALTER TABLE hashside_wide_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO hashside_wide_batch0 SELECT '(0, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide_batch0 SELECT '(0, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide_batch0 SELECT '(0, "")', 1 FROM generate_series(1, 9);
+ANALYZE probeside_batch0, hashside_wide_batch0;
+SELECT (probeside_batch0.a).hash, ((((probeside_batch0.a).hash << 7) >> 3) & 31) AS batchno, TRIM((probeside_batch0.a).value), hashside_wide_batch0.id, hashside_wide_batch0.ctid, (hashside_wide_batch0.a).hash, TRIM((hashside_wide_batch0.a).value)
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash | batchno |      btrim      | id |  ctid  | hash | btrim 
+------+---------+-----------------+----+--------+------+-------
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 | unmatched outer |    |        |      | 
+(352 rows)
+
+ROLLBACK;
diff --git a/src/test/regress/sql/join_hash.sql b/src/test/regress/sql/join_hash.sql
index 68c1a8c7b6..ab41b4d4c3 100644
--- a/src/test/regress/sql/join_hash.sql
+++ b/src/test/regress/sql/join_hash.sql
@@ -538,3 +538,149 @@ WHERE
     AND hjtest_1.a <> hjtest_2.b;
 
 ROLLBACK;
+
+-- Serial Adaptive Hash Join
+
+BEGIN;
+CREATE TYPE stub AS (hash INTEGER, value CHAR(8098));
+
+CREATE FUNCTION stub_hash(item stub)
+RETURNS INTEGER AS $$
+DECLARE
+  batch_size INTEGER;
+BEGIN
+  batch_size := 4;
+  RETURN item.hash << (batch_size - 1);
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+
+CREATE FUNCTION stub_eq(item1 stub, item2 stub)
+RETURNS BOOLEAN AS $$
+BEGIN
+  RETURN item1.hash = item2.hash AND item1.value = item2.value;
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+
+CREATE OPERATOR = (
+  FUNCTION = stub_eq,
+  LEFTARG = stub,
+  RIGHTARG = stub,
+  COMMUTATOR = =,
+  HASHES, MERGES
+);
+
+CREATE OPERATOR CLASS stub_hash_ops
+DEFAULT FOR TYPE stub USING hash AS
+  OPERATOR 1 =(stub, stub),
+  FUNCTION 1 stub_hash(stub);
+
+CREATE TABLE probeside(a stub);
+ALTER TABLE probeside ALTER COLUMN a SET STORAGE PLAIN;
+-- non-fallback batch with unmatched outer tuple
+INSERT INTO probeside SELECT '(2, "")' FROM generate_series(1, 1);
+-- fallback batch unmatched outer tuple (in first stripe maybe)
+INSERT INTO probeside SELECT '(1, "unmatched outer tuple")' FROM generate_series(1, 1);
+-- fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(1, "")' FROM generate_series(1, 5);
+-- fallback batch unmatched outer tuple (in last stripe maybe)
+-- When numbatches=4, hash 5 maps to batch 1, but after numbatches doubles to
+-- 8 batches hash 5 maps to batch 5.
+INSERT INTO probeside SELECT '(5, "")' FROM generate_series(1, 1);
+-- non-fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(3, "")' FROM generate_series(1, 1);
+-- batch with 3 stripes where non-first/non-last stripe contains unmatched outer tuple
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 5);
+INSERT INTO probeside SELECT '(6, "unmatched outer tuple")' FROM generate_series(1, 1);
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 1);
+
+CREATE TABLE hashside_wide(a stub, id int);
+ALTER TABLE hashside_wide ALTER COLUMN a SET STORAGE PLAIN;
+-- falls back with an unmatched inner tuple that is in fist, middle, and last
+-- stripe
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in first stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in middle stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in last stripe")', 1 FROM generate_series(1, 1);
+
+-- doesn't fall back -- matched tuple
+INSERT INTO hashside_wide SELECT '(3, "")', 3 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(6, "")', 6 FROM generate_series(1, 20);
+
+ANALYZE probeside, hashside_wide;
+
+SET enable_nestloop TO off;
+SET enable_mergejoin TO off;
+SET work_mem = 64;
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+FULL OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+FULL OUTER JOIN hashside_wide USING (a);
+
+-- semi-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+
+-- anti-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+
+-- parallel LOJ test case with two batches falling back
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_parallel_hash = on;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+rollback to settings;
+
+-- Test spill of batch 0 gives correct results.
+CREATE TABLE probeside_batch0(a stub);
+ALTER TABLE probeside_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO probeside_batch0 SELECT '(0, "")' FROM generate_series(1, 13);
+INSERT INTO probeside_batch0 SELECT '(0, "unmatched outer")' FROM generate_series(1, 1);
+
+CREATE TABLE hashside_wide_batch0(a stub, id int);
+ALTER TABLE hashside_wide_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO hashside_wide_batch0 SELECT '(0, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide_batch0 SELECT '(0, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide_batch0 SELECT '(0, "")', 1 FROM generate_series(1, 9);
+ANALYZE probeside_batch0, hashside_wide_batch0;
+
+SELECT (probeside_batch0.a).hash, ((((probeside_batch0.a).hash << 7) >> 3) & 31) AS batchno, TRIM((probeside_batch0.a).value), hashside_wide_batch0.id, hashside_wide_batch0.ctid, (hashside_wide_batch0.a).hash, TRIM((hashside_wide_batch0.a).value)
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+ROLLBACK;
-- 
2.20.1

#56

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Melanie Plageman (#55)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Mon, Jun 08, 2020 at 05:12:25PM -0700, Melanie Plageman wrote:

On Wed, May 27, 2020 at 7:25 PM Melanie Plageman <melanieplageman@gmail.com>
wrote:

I've attached a rebased patch which includes the "provisionally detach"
deadlock hazard fix approach

Alas, the "provisional detach" logic proved incorrect (see last point in
the list of changes included in the patch at bottom).

Also, we kept the batch 0 spilling patch David Kimura authored [1]
separate so it could be discussed separately because we still had some
questions.

The serial batch 0 spilling is in the attached patch. Parallel batch 0
spilling is still in a separate batch that David Kimura is working on.

I've attached a rebased and updated patch with a few fixes:

- semi-join fallback works now
- serial batch 0 spilling in main patch
- added instrumentation for stripes to the parallel case
- SharedBits uses same SharedFileset as SharedTuplestore
- reverted the optimization to allow workers to re-attach to a batch and
help out with stripes if they are sure they pose no deadlock risk

For the last point, I discovered a pretty glaring problem with this
optimization: I did not include the bitmap created by a worker while
working on its first participating stripe in the final combined bitmap.
I only was combining the last bitmap file each worker worked on.

I had the workers make new bitmaps for each time that they attached to
the batch and participated because having them keep an open file
tracking information for a batch they are no longer attached to on the
chance that they might return and work on that batch was a
synchronization nightmare. It was difficult to figure out when to close
the file if they never returned and hard to make sure that the combining
worker is actually combining all the files from all participants who
were ever active.

I am sure I can hack around those, but I think we need a better solution
overall. After reverting those changes, loading and probing of stripes
after stripe 0 is serial. This is not only sub-optimal, it also means
that all the synchronization variables and code complexity around
coordinating work on fallback batches is practically wasted.
So, they have to be able to collaborate on stripes after the first
stripe. This version of the patch has correct results and no deadlock
hazard, however, it lacks parallelism on stripes after stripe 0.
I am looking for ideas on how to address the deadlock hazard more
efficiently.

The next big TODOs are:
- come up with a better solution to the potential tuple emitting/barrier
waiting deadlock issue
- parallel batch 0 spilling complete

Hi Melanie,

I started looking at the patch to refresh my knowledge both of this
patch and parallel hash join, but I think it needs a rebase. The
changes in 7897e3bb90 apparently touched some of the code. I assume
you're working on a patch addressing the remaining TODOS, right?

I see you've switched to "stripe" naming - I find that a bit confusing,
because when I hear stripe I think about RAID, where it means pieces of
data interleaved and stored on different devices. But maybe that's just
me and it's a good name. Maybe it'd be better to keep the naming and
only tweak it at the end, not to disrupt reviews unnecessarily.

Now, a couple comments / questions about the code.

nodeHash.c
----------

1) MultiExecPrivateHash says this

/*
* Not subject to skew optimization, so either insert normally
* or save to batch file if it belongs to another stripe
*/

I wonder what it means to "belong to another stripe". I understand what
that means for batches, which are identified by batchno computed from
the hash value. But I thought "stripes" are just work_mem-sized pieces
of a batch, so I don't quite understand this. Especially when the code
does not actually check "which stripe" the row belongs to.

2) I find the fields hashloop_fallback rather confusing. We have one in
HashJoinTable (and it's array of BufFile items) and another one in
ParallelHashJoinBatch (this time just bool).

I think HashJoinTable should be renamed to hashloopBatchFile (similarly
to the other BufFile arrays). Although I'm not sure why we even need
this file, when we have innerBatchFile? BufFile(s) are not exactly free,
in fact it's one of the problems for hashjoins with many batches.

3) I'm a bit puzzled about this formula in ExecHashIncreaseNumBatches

childbatch = (1U << (my_log2(hashtable->nbatch) - 1)) | hashtable->curbatch;

and also about this comment

/*
* TODO: what to do about tuples that don't go to the child
* batch or stay in the current batch? (this is why we are
* counting tuples to child and curbatch with two diff
* variables in case the tuples go to a batch that isn't the
* child)
*/
if (batchno == childbatch)
childbatch_outgoing_tuples++;

I thought each old batch is split into two new ones, and the tuples
either stay in the current one, or are moved to the new one - which I
presume is the childbatch, although I haven't tried to decode that
formula. So where else could the tuple go, as the comment tried to
suggest?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#57

Jesse Zhang

sbjesse@gmail.com

over 5 years ago

In reply to: Tomas Vondra (#56)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Hi Tomas,

On Tue, Jun 23, 2020 at 3:24 PM Tomas Vondra wrote:

Now, a couple comments / questions about the code.

nodeHash.c
----------

1) MultiExecPrivateHash says this

/*
* Not subject to skew optimization, so either insert normally
* or save to batch file if it belongs to another stripe
*/

I wonder what it means to "belong to another stripe". I understand what
that means for batches, which are identified by batchno computed from
the hash value. But I thought "stripes" are just work_mem-sized pieces
of a batch, so I don't quite understand this. Especially when the code
does not actually check "which stripe" the row belongs to.

I have to concur that "stripe" did inspire a RAID vibe when I heard it,
but it seemed to be a better name than what it replaces

3) I'm a bit puzzled about this formula in ExecHashIncreaseNumBatches

childbatch = (1U << (my_log2(hashtable->nbatch) - 1)) | hashtable->curbatch;

and also about this comment

/*
* TODO: what to do about tuples that don't go to the child
* batch or stay in the current batch? (this is why we are
* counting tuples to child and curbatch with two diff
* variables in case the tuples go to a batch that isn't the
* child)
*/
if (batchno == childbatch)
childbatch_outgoing_tuples++;

I thought each old batch is split into two new ones, and the tuples
either stay in the current one, or are moved to the new one - which I
presume is the childbatch, although I haven't tried to decode that
formula. So where else could the tuple go, as the comment tried to
suggest?

True, every old batch is split into two new ones, if you only consider
tuples coming from the batch file that _still belong in there_. i.e.
there are tuples in the old batch file that belong to a future batch. As
an example, if the current nbatch = 8, and we want to expand to nbatch =
16, (old) batch 1 will split into (new) batch 1 and batch 9, but it can
already contain tuples that need to go into (current) batches 3, 5, and
7 (soon-to-be batches 11, 13, and 15).

Cheers,
Jesse

#58

Melanie Plageman

melanieplageman@gmail.com

over 5 years ago

In reply to: Tomas Vondra (#56)

1 attachment(s)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Tue, Jun 23, 2020 at 3:24 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

I started looking at the patch to refresh my knowledge both of this
patch and parallel hash join, but I think it needs a rebase. The
changes in 7897e3bb90 apparently touched some of the code.

Thanks so much for the review, Tomas!

I've attached a rebased patch which also contains updates discussed
below.

I assume
you're working on a patch addressing the remaining TODOS, right?

I wanted to get some feedback on the patch before working through the
TODOs to make sure I was on the right track.
Now that you are reviewing this, I will focus all my attention
on addressing your feedback. If there are any TODOs that you feel are
most important, let me know, so I can start with those. Otherwise, I
will prioritize parallel batch 0 spilling.

I wanted to get some feedback on the patch before working through the
TODOs to make sure I was on the right track.

Now that you are reviewing this, I will focus all my attention
on addressing your feedback. If there are any TODOs that you feel are
most important, let me know, so I can start with those.

Otherwise, I will prioritize parallel batch 0 spilling.
David Kimura plans to do a bit of work on on parallel hash join batch 0
spilling tomorrow. Whatever is left after that, I will pick up next
week. Parallel hash join batch 0 spilling is the last large TODO that I
had.

My plan was to then focus on the feedback (either about which TODOs are
most important or outside of the TODOs I've identified) I get from you
and anyone else who reviews this.

I see you've switched to "stripe" naming - I find that a bit confusing,
because when I hear stripe I think about RAID, where it means pieces of
data interleaved and stored on different devices. But maybe that's just
me and it's a good name. Maybe it'd be better to keep the naming and
only tweak it at the end, not to disrupt reviews unnecessarily.

I hear you about "stripe". I still quite like it, especially as compared
to its predecessor (originally, I called them chunks -- which is
impossible given that SharedTuplestoreChunks are a thing).

For ease of review, as you mentioned, I will keep the name for now. I am
open to changing it later, though.

I've been soliciting ideas for alternatives and, so far, folks have
suggested "stride", "step", "flock", "herd", "cohort", and "school". I'm
still on team "stripe" though, as it stands.

nodeHash.c
----------

1) MultiExecPrivateHash says this

/*
* Not subject to skew optimization, so either insert normally
* or save to batch file if it belongs to another stripe
*/

I wonder what it means to "belong to another stripe". I understand what
that means for batches, which are identified by batchno computed from
the hash value. But I thought "stripes" are just work_mem-sized pieces
of a batch, so I don't quite understand this. Especially when the code
does not actually check "which stripe" the row belongs to.

I agree this was confusing.

"belongs to another stripe" meant here that if batch 0 falls back and we
are still loading it, once we've filled up work_mem, we need to start
saving those tuples to a spill file for batch 0. I've changed the
comment to this:

-        * or save to batch file if it belongs to another stripe
+       * or save to batch file if batch 0 falls back and we have
+       * already filled the hashtable up to space_allowed.

2) I find the fields hashloop_fallback rather confusing. We have one in
HashJoinTable (and it's array of BufFile items) and another one in
ParallelHashJoinBatch (this time just bool).

I think HashJoinTable should be renamed to hashloopBatchFile (similarly
to the other BufFile arrays).

I think you are right about the name. I've changed the name in
HashJoinTableData to hashloopBatchFile.

The array of BufFiles hashloop_fallback was only used by serial
hashjoin. The boolean hashloop_fallback variable is used only by
parallel hashjoin.

The reason I had them named the same thing is that I thought it would be
nice to have a variable with the same name to indicate if a batch "fell
back" for both parallel and serial hashjoin--especially since we check
it in the main hashjoin state machine used by parallel and serial
hashjoin.

In serial hashjoin, the BufFiles aren't identified by name, so I kept
them in that array. In parallel hashjoin, each ParallelHashJoinBatch has
the status saved (in the struct).
So, both represented the fall back status of a batch.

However, I agree with you, so I've renamed the serial one to
hashloopBatchFile.

Although I'm not sure why we even need
this file, when we have innerBatchFile? BufFile(s) are not exactly free,
in fact it's one of the problems for hashjoins with many batches.

Interesting -- it didn't even occur to me to combine the bitmap with the
inner side batch file data.
It definitely seems like a good idea to save the BufFile given that so
little data will likely go in it and that it has a 1-1 relationship with
inner side batches.

How might it work? Would you reserve some space at the beginning of the
file? When would you reserve the bytes (before adding tuples you won't
know how many bytes you need, so it might be hard to make sure there is
enough space.) Would all inner side files have space reserved or just
fallback batches?

--
Melanie Plageman

Attachments:

v10-0001-Implement-Adaptive-Hashjoin.patchtext/x-patch; charset=US-ASCII; name=v10-0001-Implement-Adaptive-Hashjoin.patchDownload

From 563dd3f24fcf9725f846f7a5ad6a1c31a5c7a078 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 25 Jun 2020 14:45:38 -0700
Subject: [PATCH v10] Implement Adaptive Hashjoin

If the inner side tuples of a hashjoin will not fit in memory, the
hashjoin can be executed in multiple batches. If the statistics on the
inner side relation are accurate, planner chooses a multi-batch
strategy and sets the number of batches.
The query executor measures the real size of the hashtable and increases
the number of batches if the hashtable grows too large.

The number of batches is always a power of two, so an increase in the
number of batches doubles it.

Serial hashjoin measures batch size lazily -- waiting until it is
loading a batch to determine if it will fit in memory.

Parallel hashjoin, on the other hand, completes all changes to the
number of batches during the build phase. If it doubles the number of
batches, it dumps all the tuples out, reassigns them to batches,
measures each batch, and checks that it will fit in the space allowed.

In both cases, the executor currently makes a best effort. If a
particular batch won't fit in memory, and, upon changing the number of
batches none of the tuples move to a new batch, the executor disables
growth in the number of batches globally. After growth is disabled, all
batches that would have previously triggered an increase in the number
of batches instead exceed the space allowed.

There is no mechanism to perform a hashjoin within memory constraints if
a run of tuples hash to the same batch. Also, hashjoin will continue to
double the number of batches if *some* tuples move each time -- even if
the batch will never fit in memory -- resulting in an explosion in the
number of batches (affecting performance negatively for multiple
reasons).

Adaptive hashjoin is a mechanism to process a run of inner side tuples
with join keys which hash to the same batch in a manner that is
efficient and respects the space allowed.

When an offending batch causes the number of batches to be doubled and
some percentage of the tuples would not move to a new batch, that batch
can be marked to "fall back". This mechanism replaces serial hashjoin's
"grow_enabled" flag and replaces part of the functionality of parallel
hashjoin's "growth = PHJ_GROWTH_DISABLED" flag. However, instead of
disabling growth in the number of batches for all batches, it only
prevents this batch from causing another increase in the number of
batches.

When the inner side of this batch is loaded into memory, stripes of
arbitrary tuples totaling work_mem in size are loaded into the
hashtable. After probing this stripe, the outer side batch is rewound
and the next stripe is loaded. Each stripe of inner is probed until all
tuples have been processed.

Tuples that match are emitted (depending on the join semantics of the
particular join type) during probing of a stripe. In order to make
left outer join work, unmatched tuples cannot be emitted NULL-extended
until all stripes have been probed. To address this, a bitmap is created
with a bit for each tuple of the outer side. If a tuple on the outer
side matches a tuple from the inner, the corresponding bit is set. At
the end of probing all stripes, the executor scans the bitmap and emits
unmatched outer tuples.

Batch 0 falls back for serial hashjoin but does not yet fall back for
parallel hashjoin. David Kimura is working on a separate patch for this.

TODOs:
- Better solution to deadlock hazard with waiting on a barrier after
  emitting tuples
- Experiment with different fallback threshholds
  (currently hardcoded to 80% but parameterizable)
- Improve stripe instrumentation implementation for serial and parallel
- Assorted TODOs in the code

Co-authored-by: Jesse Zhang <sbjesse@gmail.com>
Co-authored-by: David Kimura <dkimura@pivotal.io>
---
 src/backend/commands/explain.c            |   45 +-
 src/backend/executor/nodeHash.c           |  388 +++++-
 src/backend/executor/nodeHashjoin.c       |  789 +++++++++--
 src/backend/postmaster/pgstat.c           |   13 +-
 src/backend/utils/sort/Makefile           |    1 +
 src/backend/utils/sort/sharedbits.c       |  285 ++++
 src/backend/utils/sort/sharedtuplestore.c |  112 +-
 src/include/commands/explain.h            |    1 +
 src/include/executor/hashjoin.h           |   86 +-
 src/include/executor/instrument.h         |    7 +
 src/include/executor/nodeHash.h           |    1 +
 src/include/executor/tuptable.h           |    2 +
 src/include/nodes/execnodes.h             |    5 +
 src/include/pgstat.h                      |    5 +-
 src/include/utils/sharedbits.h            |   39 +
 src/include/utils/sharedtuplestore.h      |   19 +
 src/test/regress/expected/join_hash.out   | 1451 +++++++++++++++++++++
 src/test/regress/sql/join_hash.sql        |  146 +++
 18 files changed, 3222 insertions(+), 173 deletions(-)
 create mode 100644 src/backend/utils/sort/sharedbits.c
 create mode 100644 src/include/utils/sharedbits.h

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a131d15ac0..9748db6cc4 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -184,6 +184,8 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 			es->wal = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "settings") == 0)
 			es->settings = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "usage") == 0)
+			es->usage = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "timing") == 0)
 		{
 			timing_set = true;
@@ -312,6 +314,7 @@ NewExplainState(void)
 
 	/* Set default options (most fields can be left as zeroes). */
 	es->costs = true;
+	es->usage = true;
 	/* Prepare output buffer. */
 	es->str = makeStringInfo();
 
@@ -3000,6 +3003,8 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 											  worker_hi->nbatch_original);
 			hinstrument.space_peak = Max(hinstrument.space_peak,
 										 worker_hi->space_peak);
+			if (!hinstrument.fallback_batches_stats && worker_hi->fallback_batches_stats)
+				hinstrument.fallback_batches_stats = worker_hi->fallback_batches_stats;
 		}
 	}
 
@@ -3023,22 +3028,50 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		else if (hinstrument.nbatch_original != hinstrument.nbatch ||
 				 hinstrument.nbuckets_original != hinstrument.nbuckets)
 		{
+			ListCell   *lc;
+
 			ExplainIndentText(es);
 			appendStringInfo(es->str,
-							 "Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 "Buckets: %d (originally %d)  Batches: %d (originally %d)",
 							 hinstrument.nbuckets,
 							 hinstrument.nbuckets_original,
 							 hinstrument.nbatch,
-							 hinstrument.nbatch_original,
-							 spacePeakKb);
+							 hinstrument.nbatch_original);
+			if (es->usage)
+				appendStringInfo(es->str, "  Memory Usage: %ldkB\n", spacePeakKb);
+			else
+				appendStringInfo(es->str, "\n");
+
+			foreach(lc, hinstrument.fallback_batches_stats)
+			{
+				FallbackBatchStats *fbs = lfirst(lc);
+
+				ExplainIndentText(es);
+				appendStringInfo(es->str, "Batch: %d  Stripes: %d\n", fbs->batchno, fbs->numstripes);
+			}
 		}
 		else
 		{
+			ListCell   *lc;
+
 			ExplainIndentText(es);
 			appendStringInfo(es->str,
-							 "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
-							 hinstrument.nbuckets, hinstrument.nbatch,
-							 spacePeakKb);
+							 "Buckets: %d  Batches: %d",
+							 hinstrument.nbuckets, hinstrument.nbatch);
+			if (es->usage)
+				appendStringInfo(es->str, "  Memory Usage: %ldkB\n", spacePeakKb);
+			else
+				appendStringInfo(es->str, "\n");
+			foreach(lc, hinstrument.fallback_batches_stats)
+			{
+				FallbackBatchStats *fbs = lfirst(lc);
+
+				ExplainIndentText(es);
+				appendStringInfo(es->str,
+								 "Batch: %d  Stripes: %d\n",
+								 fbs->batchno,
+								 fbs->numstripes);
+			}
 		}
 	}
 }
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 45b342011f..903f9f6180 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -80,7 +80,6 @@ static bool ExecParallelHashTuplePrealloc(HashJoinTable hashtable,
 static void ExecParallelHashMergeCounters(HashJoinTable hashtable);
 static void ExecParallelHashCloseBatchAccessors(HashJoinTable hashtable);
 
-
 /* ----------------------------------------------------------------
  *		ExecHash
  *
@@ -183,13 +182,53 @@ MultiExecPrivateHash(HashState *node)
 			}
 			else
 			{
-				/* Not subject to skew optimization, so insert normally */
-				ExecHashTableInsert(hashtable, slot, hashvalue);
+				/*
+				 * Not subject to skew optimization, so either insert normally
+				 * or save to batch file if batch 0 falls back and we have
+				 * already filled the hashtable up to space_allowed.
+				 */
+				int			bucketno;
+				int			batchno;
+				bool		shouldFree;
+				MinimalTuple tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+				ExecHashGetBucketAndBatch(hashtable, hashvalue,
+										  &bucketno, &batchno);
+
+				/*
+				 * If we set batch 0 to fallback on the previous tuple Save
+				 * the tuples in this batch which will not fit in the
+				 * hashtable should I be checking that hashtable->curstripe !=
+				 * 0?
+				 */
+				if (hashtable->hashloopBatchFile && hashtable->hashloopBatchFile[0])
+					ExecHashJoinSaveTuple(tuple,
+										  hashvalue,
+										  &hashtable->innerBatchFile[batchno]);
+				else
+					ExecHashTableInsert(hashtable, slot, hashvalue);
+
+				if (shouldFree)
+					heap_free_minimal_tuple(tuple);
 			}
 			hashtable->totalTuples += 1;
 		}
 	}
 
+	/*
+	 * If batch 0 fell back, rewind the inner side file where we saved the
+	 * tuples which did not fit in memory to prepare it for loading upon
+	 * finishing probing stripe 0 of batch 0
+	 */
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[0])
+	{
+		if (BufFileSeek(hashtable->innerBatchFile[0], 0, 0L, SEEK_SET))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not rewind hash-join temporary file: %m")));
+	}
+
+
 	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
 	if (hashtable->nbuckets != hashtable->nbuckets_optimal)
 		ExecHashIncreaseNumBuckets(hashtable);
@@ -321,6 +360,40 @@ MultiExecParallelHash(HashState *node)
 				 * skew).
 				 */
 				pstate->growth = PHJ_GROWTH_DISABLED;
+
+				/*
+				 * In the current design, batch 0 cannot fall back. That
+				 * behavior is an artifact of the existing design where batch
+				 * 0 fills the initial hash table and as an optimization it
+				 * doesn't need a batch file. But, there is no real reason
+				 * that batch 0 shouldn't be allowed to spill.
+				 *
+				 * Consider a hash table where majority of tuples with
+				 * hashvalue 0. These tuples will never relocate no matter how
+				 * many batches exist. If you cannot exceed work_mem, then you
+				 * will be stuck infinitely trying to double the number of
+				 * batches in order to accommodate the tuples that can only
+				 * ever be in batch 0. So, we allow it to be set to fall back
+				 * during the build phase to avoid excessive batch increases
+				 * but we don't check it when loading the actual tuples, so we
+				 * may exceed space_allowed. We set it back to false here so
+				 * that it isn't true during any of the checks that may happen
+				 * during probing.
+				 */
+				hashtable->batches[0].shared->hashloop_fallback = false;
+
+				for (i = 0; i < hashtable->nbatch; ++i)
+				{
+					FallbackBatchStats *fallback_batch_stats;
+					ParallelHashJoinBatch *batch = hashtable->batches[i].shared;
+
+					if (!batch->hashloop_fallback)
+						continue;
+					fallback_batch_stats = palloc0(sizeof(FallbackBatchStats));
+					fallback_batch_stats->batchno = i;
+					fallback_batch_stats->numstripes = batch->maximum_stripe_number + 1;
+					hashtable->fallback_batches_stats = lappend(hashtable->fallback_batches_stats, fallback_batch_stats);
+				}
 			}
 	}
 
@@ -495,12 +568,14 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 	hashtable->curbatch = 0;
 	hashtable->nbatch_original = nbatch;
 	hashtable->nbatch_outstart = nbatch;
-	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
 	hashtable->partialTuples = 0;
 	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
+	hashtable->hashloopBatchFile = NULL;
+	hashtable->fallback_batches_stats = NULL;
+	hashtable->curstripe = STRIPE_DETACHED;
 	hashtable->spaceUsed = 0;
 	hashtable->spacePeak = 0;
 	hashtable->spaceAllowed = space_allowed;
@@ -572,6 +647,8 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 			palloc0(nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			palloc0(nbatch * sizeof(BufFile *));
+		hashtable->hashloopBatchFile = (BufFile **)
+			palloc0(nbatch * sizeof(BufFile *));
 		/* The files will not be opened until needed... */
 		/* ... but make sure we have temp tablespaces established for them */
 		PrepareTempTablespaces();
@@ -866,6 +943,8 @@ ExecHashTableDestroy(HashJoinTable hashtable)
 				BufFileClose(hashtable->innerBatchFile[i]);
 			if (hashtable->outerBatchFile[i])
 				BufFileClose(hashtable->outerBatchFile[i]);
+			if (hashtable->hashloopBatchFile[i])
+				BufFileClose(hashtable->hashloopBatchFile[i]);
 		}
 	}
 
@@ -876,6 +955,18 @@ ExecHashTableDestroy(HashJoinTable hashtable)
 	pfree(hashtable);
 }
 
+/*
+ * Threshhold for tuple relocation during batch split for parallel and serial
+ * hashjoin.
+ * While growing the number of batches, for the batch which triggered the growth,
+ * if more than MAX_RELOCATION % of its tuples move to its child batch, then
+ * it likely has skewed data and so the child batch (the new home to the skewed
+ * tuples) will be marked as a "fallback" batch and processed using the hashloop
+ * join algorithm. The reverse is true as well: if more than MAX_RELOCATION
+ * remain in the parent, it too should be marked to "fallback".
+ */
+#define MAX_RELOCATION 0.8
+
 /*
  * ExecHashIncreaseNumBatches
  *		increase the original number of batches in order to reduce
@@ -886,14 +977,18 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 {
 	int			oldnbatch = hashtable->nbatch;
 	int			curbatch = hashtable->curbatch;
+	int			childbatch;
 	int			nbatch;
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
 	HashMemoryChunk oldchunks;
+	int			curbatch_outgoing_tuples;
+	int			childbatch_outgoing_tuples;
+	int			target_batch;
+	FallbackBatchStats *fallback_batch_stats;
 
-	/* do nothing if we've decided to shut off growth */
-	if (!hashtable->growEnabled)
+	if (hashtable->hashloopBatchFile && hashtable->hashloopBatchFile[curbatch])
 		return;
 
 	/* safety check to avoid overflow */
@@ -917,6 +1012,8 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			palloc0(nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			palloc0(nbatch * sizeof(BufFile *));
+		hashtable->hashloopBatchFile = (BufFile **)
+			palloc0(nbatch * sizeof(BufFile *));
 		/* time to establish the temp tablespaces, too */
 		PrepareTempTablespaces();
 	}
@@ -927,10 +1024,14 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			repalloc(hashtable->innerBatchFile, nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			repalloc(hashtable->outerBatchFile, nbatch * sizeof(BufFile *));
+		hashtable->hashloopBatchFile = (BufFile **)
+			repalloc(hashtable->hashloopBatchFile, nbatch * sizeof(BufFile *));
 		MemSet(hashtable->innerBatchFile + oldnbatch, 0,
 			   (nbatch - oldnbatch) * sizeof(BufFile *));
 		MemSet(hashtable->outerBatchFile + oldnbatch, 0,
 			   (nbatch - oldnbatch) * sizeof(BufFile *));
+		MemSet(hashtable->hashloopBatchFile + oldnbatch, 0,
+			   (nbatch - oldnbatch) * sizeof(BufFile *));
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -942,6 +1043,8 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 * no longer of the current batch.
 	 */
 	ninmemory = nfreed = 0;
+	curbatch_outgoing_tuples = childbatch_outgoing_tuples = 0;
+	childbatch = (1U << (my_log2(hashtable->nbatch) - 1)) | hashtable->curbatch;
 
 	/* If know we need to resize nbuckets, we can do it while rebatching. */
 	if (hashtable->nbuckets_optimal != hashtable->nbuckets)
@@ -987,8 +1090,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			ninmemory++;
 			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
 									  &bucketno, &batchno);
-
-			if (batchno == curbatch)
+			if (batchno == curbatch && (curbatch != 0 || hashtable->spaceUsed < hashtable->spaceAllowed))
 			{
 				/* keep tuple in memory - copy it into the new chunk */
 				HashJoinTuple copyTuple;
@@ -999,17 +1101,28 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 				/* and add it back to the appropriate bucket */
 				copyTuple->next.unshared = hashtable->buckets.unshared[bucketno];
 				hashtable->buckets.unshared[bucketno] = copyTuple;
+				curbatch_outgoing_tuples++;
 			}
 			else
 			{
 				/* dump it out */
-				Assert(batchno > curbatch);
+				Assert(batchno >= curbatch);
 				ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(hashTuple),
 									  hashTuple->hashvalue,
 									  &hashtable->innerBatchFile[batchno]);
 
 				hashtable->spaceUsed -= hashTupleSize;
 				nfreed++;
+
+				/*
+				 * TODO: what to do about tuples that don't go to the child
+				 * batch or stay in the current batch? (this is why we are
+				 * counting tuples to child and curbatch with two diff
+				 * variables in case the tuples go to a batch that isn't the
+				 * child)
+				 */
+				if (batchno == childbatch)
+					childbatch_outgoing_tuples++;
 			}
 
 			/* next tuple in this chunk */
@@ -1029,22 +1142,35 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 		   hashtable, nfreed, ninmemory, hashtable->spaceUsed);
 #endif
 
+
 	/*
-	 * If we dumped out either all or none of the tuples in the table, disable
-	 * further expansion of nbatch.  This situation implies that we have
-	 * enough tuples of identical hashvalues to overflow spaceAllowed.
-	 * Increasing nbatch will not fix it since there's no way to subdivide the
-	 * group any more finely. We have to just gut it out and hope the server
-	 * has enough RAM.
+	 * The same batch should not be marked to fall back more than once
 	 */
-	if (nfreed == 0 || nfreed == ninmemory)
-	{
-		hashtable->growEnabled = false;
 #ifdef HJDEBUG
-		printf("Hashjoin %p: disabling further increase of nbatch\n",
-			   hashtable);
+	if ((childbatch_outgoing_tuples / (float) ninmemory) >= 0.8)
+		printf("childbatch %i targeted to fallback.", childbatch);
+	if ((curbatch_outgoing_tuples / (float) ninmemory) >= 0.8)
+		printf("curbatch %i targeted to fallback.", curbatch);
 #endif
-	}
+
+	/*
+	 * If too many tuples remain in the parent or too many tuples migrate to
+	 * the child, there is likely skew and continuing to increase the number
+	 * of batches will not help. Mark the batch which contains the skewed
+	 * tuples to be processed with block nested hashloop join.
+	 */
+	if ((childbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION)
+		target_batch = childbatch;
+	else if ((curbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION)
+		target_batch = curbatch;
+	else
+		return;
+	hashtable->hashloopBatchFile[target_batch] = BufFileCreateTemp(false);
+
+	fallback_batch_stats = palloc0(sizeof(FallbackBatchStats));
+	fallback_batch_stats->batchno = target_batch;
+	fallback_batch_stats->numstripes = 0;
+	hashtable->fallback_batches_stats = lappend(hashtable->fallback_batches_stats, fallback_batch_stats);
 }
 
 /*
@@ -1213,7 +1339,6 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 									 WAIT_EVENT_HASH_GROW_BATCHES_DECIDE))
 			{
 				bool		space_exhausted = false;
-				bool		extreme_skew_detected = false;
 
 				/* Make sure that we have the current dimensions and buckets. */
 				ExecParallelHashEnsureBatchAccessors(hashtable);
@@ -1224,27 +1349,58 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 				{
 					ParallelHashJoinBatch *batch = hashtable->batches[i].shared;
 
+					/*
+					 * All batches were just created anew during
+					 * repartitioning
+					 */
+					Assert(!batch->hashloop_fallback);
+
+					/*
+					 * At the time of repartitioning, each batch updates its
+					 * estimated_size to reflect the size of the batch file on
+					 * disk. It is also updated when increasing preallocated
+					 * space in ExecParallelHashTuplePrealloc().  However,
+					 * batch 0 does not store anything on disk so it has no
+					 * estimated_size.
+					 *
+					 * We still want to allow batch 0 to trigger batch growth.
+					 * In order to do that, for batch 0 check whether the
+					 * actual size exceeds space_allowed. It is a little
+					 * backwards at this point as we would have already
+					 * exceeded inserted the allowed space.
+					 */
 					if (batch->space_exhausted ||
-						batch->estimated_size > pstate->space_allowed)
+						batch->estimated_size > pstate->space_allowed ||
+						batch->size > pstate->space_allowed)
 					{
 						int			parent;
+						float		frac_moved;
 
 						space_exhausted = true;
 
+						parent = i % pstate->old_nbatch;
+						frac_moved = batch->ntuples / (float) hashtable->batches[parent].shared->old_ntuples;
+
 						/*
-						 * Did this batch receive ALL of the tuples from its
-						 * parent batch?  That would indicate that further
-						 * repartitioning isn't going to help (the hash values
-						 * are probably all the same).
+						 * If too many tuples remain in the parent or too many
+						 * tuples migrate to the child, there is likely skew
+						 * and continuing to increase the number of batches
+						 * will not help. Mark the batch which contains the
+						 * skewed tuples to be processed with block nested
+						 * hashloop join.
 						 */
-						parent = i % pstate->old_nbatch;
-						if (batch->ntuples == hashtable->batches[parent].shared->old_ntuples)
-							extreme_skew_detected = true;
+						if (frac_moved >= MAX_RELOCATION)
+						{
+							batch->hashloop_fallback = true;
+							space_exhausted = false;
+						}
 					}
+					if (space_exhausted)
+						break;
 				}
 
-				/* Don't keep growing if it's not helping or we'd overflow. */
-				if (extreme_skew_detected || hashtable->nbatch >= INT_MAX / 2)
+				/* Don't keep growing if we'd overflow. */
+				if (hashtable->nbatch >= INT_MAX / 2)
 					pstate->growth = PHJ_GROWTH_DISABLED;
 				else if (space_exhausted)
 					pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
@@ -1311,11 +1467,28 @@ ExecParallelHashRepartitionFirst(HashJoinTable hashtable)
 			{
 				size_t		tuple_size =
 				MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
+				tupleMetadata metadata;
 
 				/* It belongs in a later batch. */
+				ParallelHashJoinBatch *batch = hashtable->batches[batchno].shared;
+
+				LWLockAcquire(&batch->lock, LW_EXCLUSIVE);
+
+				if (batch->estimated_stripe_size + tuple_size > hashtable->parallel_state->space_allowed)
+				{
+					batch->maximum_stripe_number++;
+					batch->estimated_stripe_size = 0;
+				}
+
+				batch->estimated_stripe_size += tuple_size;
+
+				metadata.hashvalue = hashTuple->hashvalue;
+				metadata.stripe = batch->maximum_stripe_number;
+				LWLockRelease(&batch->lock);
+
 				hashtable->batches[batchno].estimated_size += tuple_size;
-				sts_puttuple(hashtable->batches[batchno].inner_tuples,
-							 &hashTuple->hashvalue, tuple);
+
+				sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata, tuple);
 			}
 
 			/* Count this tuple. */
@@ -1363,27 +1536,41 @@ ExecParallelHashRepartitionRest(HashJoinTable hashtable)
 	for (i = 1; i < old_nbatch; ++i)
 	{
 		MinimalTuple tuple;
-		uint32		hashvalue;
+		tupleMetadata metadata;
 
 		/* Scan one partition from the previous generation. */
 		sts_begin_parallel_scan(old_inner_tuples[i]);
-		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i], &hashvalue)))
+
+		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i], &metadata.hashvalue)))
 		{
 			size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 			int			bucketno;
 			int			batchno;
+			ParallelHashJoinBatch *batch;
 
 			/* Decide which partition it goes to in the new generation. */
-			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
+			ExecHashGetBucketAndBatch(hashtable, metadata.hashvalue, &bucketno,
 									  &batchno);
 
 			hashtable->batches[batchno].estimated_size += tuple_size;
 			++hashtable->batches[batchno].ntuples;
 			++hashtable->batches[i].old_ntuples;
 
+			batch = hashtable->batches[batchno].shared;
+
+			/* Store the tuple its new batch. */
+			LWLockAcquire(&batch->lock, LW_EXCLUSIVE);
+
+			if (batch->estimated_stripe_size + tuple_size > pstate->space_allowed)
+			{
+				batch->maximum_stripe_number++;
+				batch->estimated_stripe_size = 0;
+			}
+			batch->estimated_stripe_size += tuple_size;
+			metadata.stripe = batch->maximum_stripe_number;
+			LWLockRelease(&batch->lock);
 			/* Store the tuple its new batch. */
-			sts_puttuple(hashtable->batches[batchno].inner_tuples,
-						 &hashvalue, tuple);
+			sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata, tuple);
 
 			CHECK_FOR_INTERRUPTS();
 		}
@@ -1693,6 +1880,12 @@ retry:
 
 	if (batchno == 0)
 	{
+		/*
+		 * TODO: if spilling is enabled for batch 0 so that it can fall back,
+		 * we will need to stop loading batch 0 into the hashtable somewhere--
+		 * maybe here-- and switch to saving tuples to a file. Currently, this
+		 * will simply exceed the space allowed
+		 */
 		HashJoinTuple hashTuple;
 
 		/* Try to load it into memory. */
@@ -1715,10 +1908,17 @@ retry:
 	else
 	{
 		size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
+		ParallelHashJoinBatch *batch;
+		tupleMetadata metadata;
 
 		Assert(batchno > 0);
 
 		/* Try to preallocate space in the batch if necessary. */
+
+		/*
+		 * TODO: is it okay to only count the tuple when it doesn't fit in the
+		 * preallocated memory?
+		 */
 		if (hashtable->batches[batchno].preallocated < tuple_size)
 		{
 			if (!ExecParallelHashTuplePrealloc(hashtable, batchno, tuple_size))
@@ -1727,8 +1927,14 @@ retry:
 
 		Assert(hashtable->batches[batchno].preallocated >= tuple_size);
 		hashtable->batches[batchno].preallocated -= tuple_size;
-		sts_puttuple(hashtable->batches[batchno].inner_tuples, &hashvalue,
-					 tuple);
+		batch = hashtable->batches[batchno].shared;
+
+		metadata.hashvalue = hashvalue;
+		LWLockAcquire(&batch->lock, LW_SHARED);
+		metadata.stripe = batch->maximum_stripe_number;
+		LWLockRelease(&batch->lock);
+
+		sts_puttuple(hashtable->batches[batchno].inner_tuples, &metadata, tuple);
 	}
 	++hashtable->batches[batchno].ntuples;
 
@@ -2697,6 +2903,7 @@ ExecHashAccumInstrumentation(HashInstrumentation *instrument,
 									  hashtable->nbatch_original);
 	instrument->space_peak = Max(instrument->space_peak,
 								 hashtable->spacePeak);
+	instrument->fallback_batches_stats = hashtable->fallback_batches_stats;
 }
 
 /*
@@ -2850,6 +3057,8 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 	/* Check if it's time to grow batches or buckets. */
 	if (pstate->growth != PHJ_GROWTH_DISABLED)
 	{
+		ParallelHashJoinBatchAccessor batch = hashtable->batches[0];
+
 		Assert(curbatch == 0);
 		Assert(BarrierPhase(&pstate->build_barrier) == PHJ_BUILD_HASHING_INNER);
 
@@ -2858,8 +3067,13 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 		 * very large tuples or very low work_mem setting, we'll always allow
 		 * each backend to allocate at least one chunk.
 		 */
-		if (hashtable->batches[0].at_least_one_chunk &&
-			hashtable->batches[0].shared->size +
+
+		/*
+		 * TODO: get rid of this check for batch 0 and make it so that batch 0
+		 * always has to keep trying to increase the number of batches
+		 */
+		if (!batch.shared->hashloop_fallback && batch.at_least_one_chunk &&
+			batch.shared->size +
 			chunk_size > pstate->space_allowed)
 		{
 			pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
@@ -2891,6 +3105,11 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 
 	/* We are cleared to allocate a new chunk. */
 	chunk_shared = dsa_allocate(hashtable->area, chunk_size);
+
+	/*
+	 * TODO: if batch 0 will have stripes, need to account for this memory
+	 * there
+	 */
 	hashtable->batches[curbatch].shared->size += chunk_size;
 	hashtable->batches[curbatch].at_least_one_chunk = true;
 
@@ -2960,21 +3179,38 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 	{
 		ParallelHashJoinBatchAccessor *accessor = &hashtable->batches[i];
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
+		SharedBits *sbits = ParallelHashJoinBatchOuterBits(shared, pstate->nparticipants);
 		char		name[MAXPGPATH];
+		char		sbname[MAXPGPATH];
+
+		shared->hashloop_fallback = false;
+		/* TODO: is it okay to use the same tranche for this lock? */
+		LWLockInitialize(&shared->lock, LWTRANCHE_PARALLEL_HASH_JOIN);
+		shared->maximum_stripe_number = 0;
+		shared->estimated_stripe_size = 0;
 
 		/*
 		 * All members of shared were zero-initialized.  We just need to set
 		 * up the Barrier.
 		 */
 		BarrierInit(&shared->batch_barrier, 0);
+		BarrierInit(&shared->stripe_barrier, 0);
+
+		/* Batch 0 doesn't need to be loaded. */
 		if (i == 0)
 		{
-			/* Batch 0 doesn't need to be loaded. */
 			BarrierAttach(&shared->batch_barrier);
-			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_PROBING)
+			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_STRIPING)
 				BarrierArriveAndWait(&shared->batch_barrier, 0);
 			BarrierDetach(&shared->batch_barrier);
+
+			BarrierAttach(&shared->stripe_barrier);
+			while (BarrierPhase(&shared->stripe_barrier) < PHJ_STRIPE_PROBING)
+				BarrierArriveAndWait(&shared->stripe_barrier, 0);
+			BarrierDetach(&shared->stripe_barrier);
 		}
+		/* why isn't done initialized here ? */
+		accessor->done = PHJ_BATCH_ACCESSOR_NOT_DONE;
 
 		/* Initialize accessor state.  All members were zero-initialized. */
 		accessor->shared = shared;
@@ -2985,7 +3221,7 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 			sts_initialize(ParallelHashJoinBatchInner(shared),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
@@ -2995,10 +3231,14 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 													  pstate->nparticipants),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
+		snprintf(sbname, MAXPGPATH, "%s.bitmaps", name);
+		/* Use the same SharedFileset for the SharedTupleStore and SharedBits */
+		accessor->sba = sb_initialize(sbits, pstate->nparticipants,
+									  ParallelWorkerNumber + 1, &pstate->fileset, sbname);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -3047,8 +3287,8 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 	 * It's possible for a backend to start up very late so that the whole
 	 * join is finished and the shm state for tracking batches has already
 	 * been freed by ExecHashTableDetach().  In that case we'll just leave
-	 * hashtable->batches as NULL so that ExecParallelHashJoinNewBatch() gives
-	 * up early.
+	 * hashtable->batches as NULL so that ExecParallelHashJoinAdvanceBatch()
+	 * gives up early.
 	 */
 	if (!DsaPointerIsValid(pstate->batches))
 		return;
@@ -3070,10 +3310,11 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 	{
 		ParallelHashJoinBatchAccessor *accessor = &hashtable->batches[i];
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
+		SharedBits *sbits = ParallelHashJoinBatchOuterBits(shared, pstate->nparticipants);
 
 		accessor->shared = shared;
 		accessor->preallocated = 0;
-		accessor->done = false;
+		accessor->done = PHJ_BATCH_ACCESSOR_NOT_DONE;
 		accessor->inner_tuples =
 			sts_attach(ParallelHashJoinBatchInner(shared),
 					   ParallelWorkerNumber + 1,
@@ -3083,6 +3324,7 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 												  pstate->nparticipants),
 					   ParallelWorkerNumber + 1,
 					   &pstate->fileset);
+		accessor->sba = sb_attach(sbits, ParallelWorkerNumber + 1, &pstate->fileset);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -3165,6 +3407,18 @@ ExecHashTableDetachBatch(HashJoinTable hashtable)
 	}
 }
 
+bool
+ExecHashTableDetachStripe(HashJoinTable hashtable)
+{
+	int			curbatch = hashtable->curbatch;
+	ParallelHashJoinBatch *batch = hashtable->batches[curbatch].shared;
+	Barrier    *stripe_barrier = &batch->stripe_barrier;
+
+	BarrierDetach(stripe_barrier);
+	hashtable->curstripe = STRIPE_DETACHED;
+	return false;
+}
+
 /*
  * Detach from all shared resources.  If we are last to detach, clean up.
  */
@@ -3350,13 +3604,35 @@ ExecParallelHashTuplePrealloc(HashJoinTable hashtable, int batchno, size_t size)
 	{
 		/*
 		 * We have determined that this batch would exceed the space budget if
-		 * loaded into memory.  Command all participants to help repartition.
+		 * loaded into memory.
 		 */
-		batch->shared->space_exhausted = true;
-		pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
-		LWLockRelease(&pstate->lock);
-
-		return false;
+		/* TODO: the nested lock is a deadlock waiting to happen. */
+		LWLockAcquire(&batch->shared->lock, LW_EXCLUSIVE);
+		if (!batch->shared->hashloop_fallback)
+		{
+			/*
+			 * This batch is not marked to fall back so command all
+			 * participants to help repartition.
+			 */
+			batch->shared->space_exhausted = true;
+			pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
+			LWLockRelease(&batch->shared->lock);
+			LWLockRelease(&pstate->lock);
+			return false;
+		}
+		else if (batch->shared->estimated_stripe_size + want +
+				 HASH_CHUNK_HEADER_SIZE > pstate->space_allowed)
+		{
+			/*
+			 * This batch is marked to fall back and the current (last) stripe
+			 * does not have enough space to handle the request so we must
+			 * increment the number of stripes in the batch and reset the size
+			 * of its new last stripe.
+			 */
+			batch->shared->maximum_stripe_number++;
+			batch->shared->estimated_stripe_size = 0;
+		}
+		LWLockRelease(&batch->shared->lock);
 	}
 
 	batch->at_least_one_chunk = true;
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 9bb23fef1a..33f42405db 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -92,6 +92,27 @@
  * work_mem of all participants to create a large shared hash table.  If that
  * turns out either at planning or execution time to be impossible then we
  * fall back to regular work_mem sized hash tables.
+ * If a given batch causes the number of batches to be doubled and data skew
+ * causes too few or too many tuples to be relocated to the child of this batch,
+ * the batch which is now home to the skewed tuples is marked as a "fallback"
+ * batch. This means that it will be processed using multiple loops --
+ * each loop probing an arbitrary stripe of tuples from this batch
+ * which fit in work_mem or combined work_mem.
+ * This batch is no longer permitted to cause growth in the number of batches.
+ *
+ * When the inner side of a fallback batch is loaded into memory, stripes of
+ * arbitrary tuples totaling work_mem or combined work_mem in size are loaded
+ * into the hashtable. After probing this stripe, the outer side batch is
+ * rewound and the next stripe is loaded. Each stripe of the inner batch is
+ * probed until all tuples from that batch have been processed.
+ *
+ * Tuples that match are emitted (depending on the join semantics of the
+ * particular join type) during probing of the stripe. However, in order to make
+ * left outer join work, unmatched tuples cannot be emitted NULL-extended until
+ * all stripes have been probed. To address this, a bitmap is created with a bit
+ * for each tuple of the outer side. If a tuple on the outer side matches a
+ * tuple from the inner, the corresponding bit is set. At the end of probing all
+ * stripes, the executor scans the bitmap and emits unmatched outer tuples.
  *
  * To avoid deadlocks, we never wait for any barrier unless it is known that
  * all other backends attached to it are actively executing the node or have
@@ -126,7 +147,7 @@
 #define HJ_SCAN_BUCKET			3
 #define HJ_FILL_OUTER_TUPLE		4
 #define HJ_FILL_INNER_TUPLES	5
-#define HJ_NEED_NEW_BATCH		6
+#define HJ_NEED_NEW_STRIPE      6
 
 /* Returns true if doing null-fill on outer relation */
 #define HJ_FILL_OUTER(hjstate)	((hjstate)->hj_NullInnerTupleSlot != NULL)
@@ -143,10 +164,91 @@ static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 												 BufFile *file,
 												 uint32 *hashvalue,
 												 TupleTableSlot *tupleSlot);
+static int	ExecHashJoinLoadStripe(HashJoinState *hjstate);
 static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
 static bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
+static bool ExecParallelHashJoinLoadStripe(HashJoinState *hjstate);
 static void ExecParallelHashJoinPartitionOuter(HashJoinState *node);
+static bool checkbit(HashJoinState *hjstate);
+static void set_match_bit(HashJoinState *hjstate);
+
+static pg_attribute_always_inline bool
+			IsHashloopFallback(HashJoinTable hashtable);
+
+#define UINT_BITS (sizeof(unsigned int) * CHAR_BIT)
+
+static void
+set_match_bit(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	BufFile    *statusFile = hashtable->hashloopBatchFile[hashtable->curbatch];
+	int			tupindex = hjstate->hj_CurNumOuterTuples - 1;
+	size_t		unit_size = sizeof(hjstate->hj_CurOuterMatchStatus);
+	off_t		offset = tupindex / UINT_BITS * unit_size;
+
+	int			fileno;
+	off_t		cursor;
+
+	BufFileTell(statusFile, &fileno, &cursor);
+
+	/* Extend the statusFile if this is stripe zero. */
+	if (hashtable->curstripe == 0)
+	{
+		for (; cursor < offset + unit_size; cursor += unit_size)
+		{
+			hjstate->hj_CurOuterMatchStatus = 0;
+			BufFileWrite(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+		}
+	}
+
+	if (cursor != offset)
+		BufFileSeek(statusFile, 0, offset, SEEK_SET);
+
+	BufFileRead(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+	BufFileSeek(statusFile, 0, -unit_size, SEEK_CUR);
+
+	hjstate->hj_CurOuterMatchStatus |= 1U << tupindex % UINT_BITS;
+	BufFileWrite(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+}
+
+/* return true if bit is set and false if not */
+static bool
+checkbit(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			curbatch = hashtable->curbatch;
+	BufFile    *outer_match_statuses;
+
+	int			bitno = hjstate->hj_EmitOuterTupleId % UINT_BITS;
+
+	hjstate->hj_EmitOuterTupleId++;
+	outer_match_statuses = hjstate->hj_HashTable->hashloopBatchFile[curbatch];
+
+	/*
+	 * if current chunk of bitmap is exhausted, read next chunk of bitmap from
+	 * outer_match_status_file
+	 */
+	if (bitno == 0)
+		BufFileRead(outer_match_statuses, &hjstate->hj_CurOuterMatchStatus,
+					sizeof(hjstate->hj_CurOuterMatchStatus));
+
+	/*
+	 * check if current tuple's match bit is set in outer match status file
+	 */
+	return hjstate->hj_CurOuterMatchStatus & (1U << bitno);
+}
+
+static bool
+IsHashloopFallback(HashJoinTable hashtable)
+{
+	if (hashtable->parallel_state)
+		return hashtable->batches[hashtable->curbatch].shared->hashloop_fallback;
+
+	if (!hashtable->hashloopBatchFile)
+		return false;
 
+	return hashtable->hashloopBatchFile[hashtable->curbatch];
+}
 
 /* ----------------------------------------------------------------
  *		ExecHashJoinImpl
@@ -290,6 +392,12 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				hashNode->hashtable = hashtable;
 				(void) MultiExecProcNode((PlanState *) hashNode);
 
+				/*
+				 * After building the hashtable, stripe 0 of batch 0 will have
+				 * been loaded.
+				 */
+				hashtable->curstripe = 0;
+
 				/*
 				 * If the inner relation is completely empty, and we're not
 				 * doing a left outer join, we can quit without scanning the
@@ -333,12 +441,11 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 
 					/* Each backend should now select a batch to work on. */
 					hashtable->curbatch = -1;
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
 
-					continue;
+					if (!ExecParallelHashJoinNewBatch(node))
+						return NULL;
 				}
-				else
-					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
 				/* FALL THRU */
 
@@ -365,12 +472,18 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 						node->hj_JoinState = HJ_FILL_INNER_TUPLES;
 					}
 					else
-						node->hj_JoinState = HJ_NEED_NEW_BATCH;
+						node->hj_JoinState = HJ_NEED_NEW_STRIPE;
 					continue;
 				}
 
 				econtext->ecxt_outertuple = outerTupleSlot;
-				node->hj_MatchedOuter = false;
+
+				/*
+				 * Don't reset hj_MatchedOuter after the first stripe as it
+				 * would cancel out whatever we found before
+				 */
+				if (node->hj_HashTable->curstripe == 0)
+					node->hj_MatchedOuter = false;
 
 				/*
 				 * Find the corresponding bucket for this tuple in the main
@@ -386,9 +499,15 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				/*
 				 * The tuple might not belong to the current batch (where
 				 * "current batch" includes the skew buckets if any).
+				 *
+				 * This should only be done once per tuple per batch. If a
+				 * batch "falls back", its inner side will be split into
+				 * stripes. Any displaced outer tuples should only be
+				 * relocated while probing the first stripe of the inner side.
 				 */
 				if (batchno != hashtable->curbatch &&
-					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO)
+					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO &&
+					node->hj_HashTable->curstripe == 0)
 				{
 					bool		shouldFree;
 					MinimalTuple mintuple = ExecFetchSlotMinimalTuple(outerTupleSlot,
@@ -410,6 +529,32 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					continue;
 				}
 
+				if (batchno == 0 && node->hj_HashTable->curstripe == 0 && IsHashloopFallback(hashtable))
+				{
+					bool		shouldFree;
+					MinimalTuple mintuple = ExecFetchSlotMinimalTuple(outerTupleSlot,
+																	  &shouldFree);
+
+					/*
+					 * Need to save this outer tuple to a batch since batch 0
+					 * is fallback and we must later rewind.
+					 */
+					Assert(parallel_state == NULL);
+					ExecHashJoinSaveTuple(mintuple, hashvalue,
+										  &hashtable->outerBatchFile[batchno]);
+
+					if (shouldFree)
+						heap_free_minimal_tuple(mintuple);
+				}
+
+
+				/*
+				 * While probing the phantom stripe, don't increment
+				 * hj_CurNumOuterTuples or extend the bitmap
+				 */
+				if (!parallel && hashtable->curstripe != PHANTOM_STRIPE)
+					node->hj_CurNumOuterTuples++;
+
 				/* OK, let's scan the bucket for matches */
 				node->hj_JoinState = HJ_SCAN_BUCKET;
 
@@ -455,6 +600,25 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				{
 					node->hj_MatchedOuter = true;
 
+					if (HJ_FILL_OUTER(node) && IsHashloopFallback(hashtable))
+					{
+						/*
+						 * Each bit corresponds to a single tuple. Setting the
+						 * match bit keeps track of which tuples were matched
+						 * for batches which are using the block nested
+						 * hashloop fallback method. It persists this match
+						 * status across multiple stripes of tuples, each of
+						 * which is loaded into the hashtable and probed. The
+						 * outer match status file is the cumulative match
+						 * status of outer tuples for a given batch across all
+						 * stripes of that inner side batch.
+						 */
+						if (parallel)
+							sb_setbit(hashtable->batches[hashtable->curbatch].sba, econtext->ecxt_outertuple->tts_tuplenum);
+						else
+							set_match_bit(node);
+					}
+
 					if (parallel)
 					{
 						/*
@@ -488,8 +652,17 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					 * continue with next outer tuple.
 					 */
 					if (node->js.single_match)
+					{
 						node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
+						/*
+						 * Only consider returning the tuple while on the
+						 * first stripe.
+						 */
+						if (node->hj_HashTable->curstripe != 0)
+							continue;
+					}
+
 					if (otherqual == NULL || ExecQual(otherqual, econtext))
 						return ExecProject(node->js.ps.ps_ProjInfo);
 					else
@@ -508,6 +681,22 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				 */
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
+				if (IsHashloopFallback(hashtable) && HJ_FILL_OUTER(node))
+				{
+					if (hashtable->curstripe != PHANTOM_STRIPE)
+						continue;
+
+					if (parallel)
+					{
+						ParallelHashJoinBatchAccessor *accessor =
+						&node->hj_HashTable->batches[node->hj_HashTable->curbatch];
+
+						node->hj_MatchedOuter = sb_checkbit(accessor->sba, econtext->ecxt_outertuple->tts_tuplenum);
+					}
+					else
+						node->hj_MatchedOuter = checkbit(node);
+				}
+
 				if (!node->hj_MatchedOuter &&
 					HJ_FILL_OUTER(node))
 				{
@@ -534,7 +723,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				if (!ExecScanHashTableForUnmatched(node, econtext))
 				{
 					/* no more unmatched tuples */
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					node->hj_JoinState = HJ_NEED_NEW_STRIPE;
 					continue;
 				}
 
@@ -550,19 +739,23 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					InstrCountFiltered2(node, 1);
 				break;
 
-			case HJ_NEED_NEW_BATCH:
+			case HJ_NEED_NEW_STRIPE:
 
 				/*
-				 * Try to advance to next batch.  Done if there are no more.
+				 * Try to advance to next stripe. Then try to advance to the
+				 * next batch if there are no more stripes in this batch. Done
+				 * if there are no more batches.
 				 */
 				if (parallel)
 				{
-					if (!ExecParallelHashJoinNewBatch(node))
+					if (!ExecParallelHashJoinLoadStripe(node) &&
+						!ExecParallelHashJoinNewBatch(node))
 						return NULL;	/* end of parallel-aware join */
 				}
 				else
 				{
-					if (!ExecHashJoinNewBatch(node))
+					if (!ExecHashJoinLoadStripe(node) &&
+						!ExecHashJoinNewBatch(node))
 						return NULL;	/* end of parallel-oblivious join */
 				}
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
@@ -751,6 +944,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate->hj_JoinState = HJ_BUILD_HASHTABLE;
 	hjstate->hj_MatchedOuter = false;
 	hjstate->hj_OuterNotEmpty = false;
+	hjstate->hj_CurNumOuterTuples = 0;
+	hjstate->hj_CurOuterMatchStatus = 0;
 
 	return hjstate;
 }
@@ -917,15 +1112,24 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 	}
 	else if (curbatch < hashtable->nbatch)
 	{
+		tupleMetadata metadata;
 		MinimalTuple tuple;
 
 		tuple = sts_parallel_scan_next(hashtable->batches[curbatch].outer_tuples,
-									   hashvalue);
+									   &metadata);
+		*hashvalue = metadata.hashvalue;
+
 		if (tuple != NULL)
 		{
 			ExecForceStoreMinimalTuple(tuple,
 									   hjstate->hj_OuterTupleSlot,
 									   false);
+
+			/*
+			 * TODO: should we use tupleid instead of position in the serial
+			 * case too?
+			 */
+			hjstate->hj_OuterTupleSlot->tts_tuplenum = metadata.tupleid;
 			slot = hjstate->hj_OuterTupleSlot;
 			return slot;
 		}
@@ -949,24 +1153,37 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	int			nbatch;
 	int			curbatch;
-	BufFile    *innerFile;
-	TupleTableSlot *slot;
-	uint32		hashvalue;
+	BufFile    *innerFile = NULL;
+	BufFile    *outerFile = NULL;
 
 	nbatch = hashtable->nbatch;
 	curbatch = hashtable->curbatch;
 
-	if (curbatch > 0)
+	/*
+	 * We no longer need the previous outer batch file; close it right away to
+	 * free disk space.
+	 */
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
 	{
-		/*
-		 * We no longer need the previous outer batch file; close it right
-		 * away to free disk space.
-		 */
-		if (hashtable->outerBatchFile[curbatch])
-			BufFileClose(hashtable->outerBatchFile[curbatch]);
+		BufFileClose(hashtable->outerBatchFile[curbatch]);
 		hashtable->outerBatchFile[curbatch] = NULL;
 	}
-	else						/* we just finished the first batch */
+	if (IsHashloopFallback(hashtable))
+	{
+		BufFileClose(hashtable->hashloopBatchFile[curbatch]);
+		hashtable->hashloopBatchFile[curbatch] = NULL;
+	}
+
+	/*
+	 * We are surely done with the inner batch file now
+	 */
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[curbatch])
+	{
+		BufFileClose(hashtable->innerBatchFile[curbatch]);
+		hashtable->innerBatchFile[curbatch] = NULL;
+	}
+
+	if (curbatch == 0)			/* we just finished the first batch */
 	{
 		/*
 		 * Reset some of the skew optimization state variables, since we no
@@ -1030,55 +1247,148 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 		return false;			/* no more batches */
 
 	hashtable->curbatch = curbatch;
+	hashtable->curstripe = STRIPE_DETACHED;
+	hjstate->hj_CurNumOuterTuples = 0;
 
-	/*
-	 * Reload the hash table with the new inner batch (which could be empty)
-	 */
-	ExecHashTableReset(hashtable);
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[curbatch])
+		innerFile = hashtable->innerBatchFile[curbatch];
+
+	if (innerFile && BufFileSeek(innerFile, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	/* Need to rewind outer when this is the first stripe of a new batch */
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
+		outerFile = hashtable->outerBatchFile[curbatch];
 
-	innerFile = hashtable->innerBatchFile[curbatch];
+	if (outerFile && BufFileSeek(outerFile, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+					errmsg("could not rewind hash-join temporary file: %m")));
+
+	ExecHashJoinLoadStripe(hjstate);
+	return true;
+}
 
-	if (innerFile != NULL)
+static inline void
+InstrIncrBatchStripes(List *fallback_batches_stats, int curbatch)
+{
+	ListCell   *lc;
+
+	foreach(lc, fallback_batches_stats)
 	{
-		if (BufFileSeek(innerFile, 0, 0L, SEEK_SET))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not rewind hash-join temporary file")));
+		FallbackBatchStats *fallback_batch_stats = lfirst(lc);
 
-		while ((slot = ExecHashJoinGetSavedTuple(hjstate,
-												 innerFile,
-												 &hashvalue,
-												 hjstate->hj_HashTupleSlot)))
+		if (fallback_batch_stats->batchno == curbatch)
 		{
-			/*
-			 * NOTE: some tuples may be sent to future batches.  Also, it is
-			 * possible for hashtable->nbatch to be increased here!
-			 */
-			ExecHashTableInsert(hashtable, slot, hashvalue);
+			fallback_batch_stats->numstripes++;
+			break;
 		}
-
-		/*
-		 * after we build the hash table, the inner batch file is no longer
-		 * needed
-		 */
-		BufFileClose(innerFile);
-		hashtable->innerBatchFile[curbatch] = NULL;
 	}
+}
+
+/*
+ * Returns false when the inner batch file is exhausted
+ */
+static int
+ExecHashJoinLoadStripe(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			curbatch = hashtable->curbatch;
+	TupleTableSlot *slot;
+	uint32		hashvalue;
+	bool		loaded_inner = false;
+
+	if (hashtable->curstripe == PHANTOM_STRIPE)
+		return false;
 
 	/*
 	 * Rewind outer batch file (if present), so that we can start reading it.
+	 * TODO: This is only necessary if this is not the first stripe of the
+	 * batch
 	 */
-	if (hashtable->outerBatchFile[curbatch] != NULL)
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
 	{
 		if (BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET))
 			ereport(ERROR,
 					(errcode_for_file_access(),
-					 errmsg("could not rewind hash-join temporary file")));
+					 errmsg("could not rewind hash-join temporary file: %m")));
 	}
 
-	return true;
+	hashtable->curstripe++;
+
+	if (!hashtable->innerBatchFile || !hashtable->innerBatchFile[curbatch])
+		return false;
+
+	/*
+	 * Reload the hash table with the new inner stripe
+	 */
+	ExecHashTableReset(hashtable);
+
+	while ((slot = ExecHashJoinGetSavedTuple(hjstate,
+											 hashtable->innerBatchFile[curbatch],
+											 &hashvalue,
+											 hjstate->hj_HashTupleSlot)))
+	{
+		/*
+		 * NOTE: some tuples may be sent to future batches.  Also, it is
+		 * possible for hashtable->nbatch to be increased here!
+		 */
+		uint32		hashTupleSize;
+
+		/*
+		 * TODO: wouldn't it be cool if this returned the size of the tuple
+		 * inserted
+		 */
+		ExecHashTableInsert(hashtable, slot, hashvalue);
+		loaded_inner = true;
+
+		if (!IsHashloopFallback(hashtable))
+			continue;
+
+		hashTupleSize = slot->tts_ops->get_minimal_tuple(slot)->t_len + HJTUPLE_OVERHEAD;
+
+		if (hashtable->spaceUsed + hashTupleSize +
+			hashtable->nbuckets_optimal * sizeof(HashJoinTuple)
+			> hashtable->spaceAllowed)
+			break;
+	}
+
+	/*
+	 * if we didn't load anything and it is a FOJ/LOJ fallback batch, we will
+	 * transition to emit unmatched outer tuples next. we want to know how
+	 * many tuples were in the batch in that case, so don't zero it out then
+	 */
+
+	/*
+	 * if we loaded anything into the hashtable or it is the phantom stripe,
+	 * must proceed to probing
+	 */
+	if (loaded_inner)
+	{
+		hjstate->hj_CurNumOuterTuples = 0;
+		InstrIncrBatchStripes(hashtable->fallback_batches_stats, curbatch);
+		return true;
+	}
+
+	if (IsHashloopFallback(hashtable) && HJ_FILL_OUTER(hjstate))
+	{
+		/*
+		 * if we didn't load anything and it is a fallback batch, we will
+		 * prepare to emit outer tuples during the phantom stripe probing
+		 */
+		hashtable->curstripe = PHANTOM_STRIPE;
+		hjstate->hj_EmitOuterTupleId = 0;
+		hjstate->hj_CurOuterMatchStatus = 0;
+		BufFileSeek(hashtable->hashloopBatchFile[curbatch], 0, 0, SEEK_SET);
+		BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET);
+		return true;
+	}
+	return false;
 }
 
+
 /*
  * Choose a batch to work on, and attach to it.  Returns true if successful,
  * false if there are no more batches.
@@ -1101,11 +1411,21 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 	/*
 	 * If we were already attached to a batch, remember not to bother checking
 	 * it again, and detach from it (possibly freeing the hash table if we are
-	 * last to detach).
+	 * last to detach). curbatch is set when the batch_barrier phase is either
+	 * PHJ_BATCH_LOADING or PHJ_BATCH_STRIPING (note that the
+	 * PHJ_BATCH_LOADING case will fall through to the PHJ_BATCH_STRIPING
+	 * case). The PHJ_BATCH_STRIPING case returns to the caller. So when this
+	 * function is reentered with a curbatch >= 0 then we must be done
+	 * probing.
 	 */
+
 	if (hashtable->curbatch >= 0)
 	{
-		hashtable->batches[hashtable->curbatch].done = true;
+		ParallelHashJoinBatchAccessor *batch_accessor = &hashtable->batches[hashtable->curbatch];
+
+		if (IsHashloopFallback(hashtable))
+			sb_end_write(hashtable->batches[hashtable->curbatch].sba);
+		batch_accessor->done = PHJ_BATCH_ACCESSOR_DONE;
 		ExecHashTableDetachBatch(hashtable);
 	}
 
@@ -1119,13 +1439,8 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 		hashtable->nbatch;
 	do
 	{
-		uint32		hashvalue;
-		MinimalTuple tuple;
-		TupleTableSlot *slot;
-
-		if (!hashtable->batches[batchno].done)
+		if (hashtable->batches[batchno].done != PHJ_BATCH_ACCESSOR_DONE)
 		{
-			SharedTuplestoreAccessor *inner_tuples;
 			Barrier    *batch_barrier =
 			&hashtable->batches[batchno].shared->batch_barrier;
 
@@ -1136,7 +1451,15 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 					/* One backend allocates the hash table. */
 					if (BarrierArriveAndWait(batch_barrier,
 											 WAIT_EVENT_HASH_BATCH_ELECT))
+					{
 						ExecParallelHashTableAlloc(hashtable, batchno);
+
+						/*
+						 * one worker needs to 0 out the read_pages of all the
+						 * participants in the new batch
+						 */
+						sts_reinitialize(hashtable->batches[batchno].inner_tuples);
+					}
 					/* Fall through. */
 
 				case PHJ_BATCH_ALLOCATING:
@@ -1145,41 +1468,31 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 										 WAIT_EVENT_HASH_BATCH_ALLOCATE);
 					/* Fall through. */
 
-				case PHJ_BATCH_LOADING:
-					/* Start (or join in) loading tuples. */
-					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					inner_tuples = hashtable->batches[batchno].inner_tuples;
-					sts_begin_parallel_scan(inner_tuples);
-					while ((tuple = sts_parallel_scan_next(inner_tuples,
-														   &hashvalue)))
-					{
-						ExecForceStoreMinimalTuple(tuple,
-												   hjstate->hj_HashTupleSlot,
-												   false);
-						slot = hjstate->hj_HashTupleSlot;
-						ExecParallelHashTableInsertCurrentBatch(hashtable, slot,
-																hashvalue);
-					}
-					sts_end_parallel_scan(inner_tuples);
-					BarrierArriveAndWait(batch_barrier,
-										 WAIT_EVENT_HASH_BATCH_LOAD);
-					/* Fall through. */
+				case PHJ_BATCH_STRIPING:
 
-				case PHJ_BATCH_PROBING:
+					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
+					sts_begin_parallel_scan(hashtable->batches[batchno].inner_tuples);
+					if (hashtable->batches[batchno].shared->hashloop_fallback)
+						sb_initialize_accessor(hashtable->batches[hashtable->curbatch].sba,
+											   sts_get_tuplenum(hashtable->batches[hashtable->curbatch].outer_tuples));
+					hashtable->curstripe = STRIPE_DETACHED;
+					if (ExecParallelHashJoinLoadStripe(hjstate))
+						return true;
 
 					/*
-					 * This batch is ready to probe.  Return control to
-					 * caller. We stay attached to batch_barrier so that the
-					 * hash table stays alive until everyone's finished
-					 * probing it, but no participant is allowed to wait at
-					 * this barrier again (or else a deadlock could occur).
-					 * All attached participants must eventually call
-					 * BarrierArriveAndDetach() so that the final phase
-					 * PHJ_BATCH_DONE can be reached.
+					 * ExecParallelHashJoinLoadStripe() will return false from
+					 * here when no more work can be done by this worker on
+					 * this batch. Until further optimized, this worker will
+					 * have detached from the stripe_barrier and should close
+					 * its outer match statuses bitmap and then detach from
+					 * the batch. In order to reuse the code below, fall
+					 * through, even though the phase will not have been
+					 * advanced
 					 */
-					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					sts_begin_parallel_scan(hashtable->batches[batchno].outer_tuples);
-					return true;
+					if (hashtable->batches[batchno].shared->hashloop_fallback)
+						sb_end_write(hashtable->batches[batchno].sba);
+
+					/* Fall through. */
 
 				case PHJ_BATCH_DONE:
 
@@ -1188,7 +1501,7 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 					 * remain).
 					 */
 					BarrierDetach(batch_barrier);
-					hashtable->batches[batchno].done = true;
+					hashtable->batches[batchno].done = PHJ_BATCH_ACCESSOR_DONE;
 					hashtable->curbatch = -1;
 					break;
 
@@ -1203,6 +1516,274 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 	return false;
 }
 
+
+
+/*
+ * Returns true if ready to probe and false if the inner is exhausted
+ * (there are no more stripes)
+ */
+bool
+ExecParallelHashJoinLoadStripe(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			batchno = hashtable->curbatch;
+	ParallelHashJoinBatch *batch = hashtable->batches[batchno].shared;
+	Barrier    *stripe_barrier = &batch->stripe_barrier;
+	SharedTuplestoreAccessor *outer_tuples;
+	SharedTuplestoreAccessor *inner_tuples;
+	ParallelHashJoinBatchAccessor *accessor;
+	dsa_pointer_atomic *buckets;
+
+	outer_tuples = hashtable->batches[batchno].outer_tuples;
+	inner_tuples = hashtable->batches[batchno].inner_tuples;
+
+	if (hashtable->curstripe >= 0)
+	{
+		/*
+		 * If a worker is already attached to a stripe, wait until all
+		 * participants have finished probing and detach. The last worker,
+		 * however, can re-attach to the stripe_barrier and proceed to load
+		 * and probe the other stripes
+		 */
+		/*
+		 * After finishing with participating in a stripe, if a worker is the
+		 * only one working on a batch, it will continue working on it.
+		 * However, if a worker is not the only worker working on a batch, it
+		 * would risk deadlock if it waits on the barrier. Instead, it will
+		 * detach from the stripe, and, eventually the batch.
+		 *
+		 * This means all stripes after the first stripe will be executed
+		 * serially. TODO: allow workers to provisionally detach from the
+		 * batch and reattach later if there is still work to be done. I had a
+		 * patch that did this. Workers who were not the last worker saved the
+		 * state of the stripe barrier upon detaching and then mark the batch
+		 * as "provisionally" done (not done). Later, when the worker comes
+		 * back to the batch in the batch phase machine, if the batch is not
+		 * complete and the phase has advanced since the worker was last
+		 * participating, then the worker can join back in. This had problems.
+		 * There were synchronization issues with workers having multiple
+		 * outer match status bitmap files open at the same time, so, I had
+		 * workers close their bitmap and make a new one the next time they
+		 * joined in. This didn't work with the current code because the
+		 * original outer match status bitmap file that the worker had created
+		 * while probing stripe 1 did not get combined into the combined
+		 * bitmap This could be specifically fixed, but I think it is better
+		 * to address the lack of parallel execution for stripes after stripe
+		 * 0 more holistically.
+		 */
+		if (!BarrierArriveAndDetach(stripe_barrier))
+		{
+			sb_end_write(hashtable->batches[hashtable->curbatch].sba);
+			hashtable->curstripe = STRIPE_DETACHED;
+			return false;
+		}
+
+		/*
+		 * This isn't a race condition if no other workers can stay attached
+		 * to this barrier in the intervening time. Basically, if you attach
+		 * to a stripe barrier in the PHJ_STRIPE_DONE phase, detach
+		 * immediately and move on.
+		 */
+		BarrierAttach(stripe_barrier);
+	}
+	else if (hashtable->curstripe == STRIPE_DETACHED)
+	{
+		int			phase = BarrierAttach(stripe_barrier);
+
+		/*
+		 * If a worker enters this phase machine on a stripe number greater
+		 * than the batch's maximum stripe number, then: 1) The batch is done,
+		 * or 2) The batch is on the phantom stripe that's used for hashloop
+		 * fallback Either way the worker can't contribute so just detach and
+		 * move on.
+		 */
+
+		if (PHJ_STRIPE_NUMBER(phase) > batch->maximum_stripe_number ||
+			PHJ_STRIPE_PHASE(phase) == PHJ_STRIPE_DONE)
+			return ExecHashTableDetachStripe(hashtable);
+	}
+	else if (hashtable->curstripe == PHANTOM_STRIPE)
+	{
+		sts_end_parallel_scan(outer_tuples);
+
+		/*
+		 * TODO: ideally this would go somewhere in the batch phase machine
+		 * Putting it in ExecHashTableDetachBatch didn't do the trick
+		 */
+		sb_end_read(hashtable->batches[batchno].sba);
+		return ExecHashTableDetachStripe(hashtable);
+	}
+
+	hashtable->curstripe = PHJ_STRIPE_NUMBER(BarrierPhase(stripe_barrier));
+
+	/*
+	 * The outer side is exhausted and either 1) the current stripe of the
+	 * inner side is exhausted and it is time to advance the stripe 2) the
+	 * last stripe of the inner side is exhausted and it is time to advance
+	 * the batch
+	 */
+	for (;;)
+	{
+		int			phase = BarrierPhase(stripe_barrier);
+
+		switch (PHJ_STRIPE_PHASE(phase))
+		{
+			case PHJ_STRIPE_ELECTING:
+				if (BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_ELECT))
+				{
+					sts_reinitialize(outer_tuples);
+
+					/*
+					 * set the rewound flag back to false to prepare for the
+					 * next stripe
+					 */
+					sts_reset_rewound(inner_tuples);
+				}
+
+				/* FALLTHROUGH */
+
+			case PHJ_STRIPE_RESETTING:
+				/* TODO: not needed for phantom stripe */
+				BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_RESET);
+				/* FALLTHROUGH */
+
+			case PHJ_STRIPE_LOADING:
+				{
+					MinimalTuple tuple;
+					tupleMetadata metadata;
+
+					/*
+					 * Start (or join in) loading the next stripe of inner
+					 * tuples.
+					 */
+
+					/*
+					 * I'm afraid there potential issue if a worker joins in
+					 * this phase and doesn't do the actions and resetting of
+					 * variables in sts_resume_parallel_scan. that is, if it
+					 * doesn't reset start_page and read_next_page in between
+					 * stripes. For now, call it. However, I think it might be
+					 * able to be removed.
+					 */
+
+					/*
+					 * TODO: sts_resume_parallel_scan() is overkill for stripe
+					 * 0 of each batch
+					 */
+					sts_resume_parallel_scan(inner_tuples);
+
+					while ((tuple = sts_parallel_scan_next(inner_tuples, &metadata)))
+					{
+						/* The tuple is from a previous stripe. Skip it */
+						if (metadata.stripe < PHJ_STRIPE_NUMBER(phase))
+							continue;
+
+						/*
+						 * tuple from future. time to back out read_page. end
+						 * of stripe
+						 */
+						if (metadata.stripe > PHJ_STRIPE_NUMBER(phase))
+						{
+							sts_parallel_scan_rewind(inner_tuples);
+							continue;
+						}
+
+						ExecForceStoreMinimalTuple(tuple, hjstate->hj_HashTupleSlot, false);
+						ExecParallelHashTableInsertCurrentBatch(
+																hashtable,
+																hjstate->hj_HashTupleSlot,
+																metadata.hashvalue);
+					}
+					BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOAD);
+				}
+				/* FALLTHROUGH */
+
+			case PHJ_STRIPE_PROBING:
+
+				/*
+				 * do this again here in case a worker began the scan and then
+				 * entered after loading before probing
+				 */
+				sts_end_parallel_scan(inner_tuples);
+				sts_begin_parallel_scan(outer_tuples);
+				return true;
+
+			case PHJ_STRIPE_DONE:
+
+				if (PHJ_STRIPE_NUMBER(phase) >= batch->maximum_stripe_number)
+				{
+					/*
+					 * Handle the phantom stripe case.
+					 */
+					if (batch->hashloop_fallback && HJ_FILL_OUTER(hjstate))
+						goto fallback_stripe;
+
+					/* Return if this is the last stripe */
+					return ExecHashTableDetachStripe(hashtable);
+				}
+
+				/* this, effectively, increments the stripe number */
+				if (BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOAD))
+				{
+					/*
+					 * reset inner's hashtable and recycle the existing bucket
+					 * array.
+					 */
+					buckets = (dsa_pointer_atomic *)
+						dsa_get_address(hashtable->area, batch->buckets);
+
+					for (size_t i = 0; i < hashtable->nbuckets; ++i)
+						dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
+				}
+
+				hashtable->curstripe++;
+				continue;
+
+			default:
+				elog(ERROR, "unexpected stripe phase %d. pid %i. batch %i.", BarrierPhase(stripe_barrier), MyProcPid, batchno);
+		}
+	}
+
+fallback_stripe:
+	accessor = &hashtable->batches[hashtable->curbatch];
+	sb_end_write(accessor->sba);
+
+	/* Ensure that only a single worker is attached to the barrier */
+	if (!BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOAD))
+		return ExecHashTableDetachStripe(hashtable);
+
+
+	/* No one except the last worker will run this code */
+	hashtable->curstripe = PHANTOM_STRIPE;
+
+	/*
+	 * reset inner's hashtable and recycle the existing bucket array.
+	 */
+	buckets = (dsa_pointer_atomic *)
+		dsa_get_address(hashtable->area, batch->buckets);
+
+	for (size_t i = 0; i < hashtable->nbuckets; ++i)
+		dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
+
+	/*
+	 * If all workers (including this one) have finished probing the batch,
+	 * one worker is elected to Loop through the outer match status files from
+	 * all workers that were attached to this batch Combine them into one
+	 * bitmap Use the bitmap, loop through the outer batch file again, and
+	 * emit unmatched tuples All workers will detach from the batch barrier
+	 * and the last worker will clean up the hashtable. All workers except the
+	 * last worker will end their scans of the outer and inner side. The last
+	 * worker will end its scan of the inner side
+	 */
+
+	sb_combine(accessor->sba);
+	sts_reinitialize(outer_tuples);
+
+	sts_begin_parallel_scan(outer_tuples);
+
+	return true;
+}
+
 /*
  * ExecHashJoinSaveTuple
  *		save a tuple to a batch file.
@@ -1364,6 +1945,9 @@ ExecReScanHashJoin(HashJoinState *node)
 	node->hj_MatchedOuter = false;
 	node->hj_FirstOuterTupleSlot = NULL;
 
+	node->hj_CurNumOuterTuples = 0;
+	node->hj_CurOuterMatchStatus = 0;
+
 	/*
 	 * if chgParam of subnode is not null then plan will be re-scanned by
 	 * first ExecProcNode.
@@ -1394,7 +1978,6 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 	ExprContext *econtext = hjstate->js.ps.ps_ExprContext;
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	TupleTableSlot *slot;
-	uint32		hashvalue;
 	int			i;
 
 	Assert(hjstate->hj_FirstOuterTupleSlot == NULL);
@@ -1402,6 +1985,8 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 	/* Execute outer plan, writing all tuples to shared tuplestores. */
 	for (;;)
 	{
+		tupleMetadata metadata;
+
 		slot = ExecProcNode(outerState);
 		if (TupIsNull(slot))
 			break;
@@ -1410,17 +1995,23 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 								 hjstate->hj_OuterHashKeys,
 								 true,	/* outer tuple */
 								 HJ_FILL_OUTER(hjstate),
-								 &hashvalue))
+								 &metadata.hashvalue))
 		{
 			int			batchno;
 			int			bucketno;
 			bool		shouldFree;
+			SharedTuplestoreAccessor *accessor;
+
 			MinimalTuple mintup = ExecFetchSlotMinimalTuple(slot, &shouldFree);
 
-			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
+			ExecHashGetBucketAndBatch(hashtable, metadata.hashvalue, &bucketno,
 									  &batchno);
-			sts_puttuple(hashtable->batches[batchno].outer_tuples,
-						 &hashvalue, mintup);
+			accessor = hashtable->batches[batchno].outer_tuples;
+
+			/* cannot count on deterministic order of tupleids */
+			metadata.tupleid = sts_increment_ntuples(accessor);
+
+			sts_puttuple(hashtable->batches[batchno].outer_tuples, &metadata.hashvalue, mintup);
 
 			if (shouldFree)
 				heap_free_minimal_tuple(mintup);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c022597bc0..5d7d57deac 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3779,8 +3779,17 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_BATCH_ELECT:
 			event_name = "HashBatchElect";
 			break;
-		case WAIT_EVENT_HASH_BATCH_LOAD:
-			event_name = "HashBatchLoad";
+		case WAIT_EVENT_HASH_STRIPE_ELECT:
+			event_name = "HashStripeElect";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_RESET:
+			event_name = "HashStripeReset";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_LOAD:
+			event_name = "HashStripeLoad";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_PROBE:
+			event_name = "HashStripeProbe";
 			break;
 		case WAIT_EVENT_HASH_BUILD_ALLOCATE:
 			event_name = "HashBuildAllocate";
diff --git a/src/backend/utils/sort/Makefile b/src/backend/utils/sort/Makefile
index 7ac3659261..f11fe85aeb 100644
--- a/src/backend/utils/sort/Makefile
+++ b/src/backend/utils/sort/Makefile
@@ -16,6 +16,7 @@ override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS)
 
 OBJS = \
 	logtape.o \
+	sharedbits.o \
 	sharedtuplestore.o \
 	sortsupport.o \
 	tuplesort.o \
diff --git a/src/backend/utils/sort/sharedbits.c b/src/backend/utils/sort/sharedbits.c
new file mode 100644
index 0000000000..f93f900d16
--- /dev/null
+++ b/src/backend/utils/sort/sharedbits.c
@@ -0,0 +1,285 @@
+#include "postgres.h"
+#include "storage/buffile.h"
+#include "utils/sharedbits.h"
+
+/*
+ * TODO: put a comment about not currently supporting parallel scan of the SharedBits
+ * To support parallel scan, need to introduce many more mechanisms
+ */
+
+/* Per-participant shared state */
+struct SharedBitsParticipant
+{
+	bool		present;
+	bool		writing;
+};
+
+/* Shared control object */
+struct SharedBits
+{
+	int			nparticipants;	/* Number of participants that can write. */
+	int64		nbits;
+	char		name[NAMEDATALEN];	/* A name for this bitstore. */
+
+	SharedBitsParticipant participants[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/* backend-local state */
+struct SharedBitsAccessor
+{
+	int			participant;
+	SharedBits *bits;
+	SharedFileSet *fileset;
+	BufFile    *write_file;
+	BufFile    *combined;
+};
+
+SharedBitsAccessor *
+sb_attach(SharedBits *sbits, int my_participant_number, SharedFileSet *fileset)
+{
+	SharedBitsAccessor *accessor = palloc0(sizeof(SharedBitsAccessor));
+
+	accessor->participant = my_participant_number;
+	accessor->bits = sbits;
+	accessor->fileset = fileset;
+	accessor->write_file = NULL;
+	accessor->combined = NULL;
+	return accessor;
+}
+
+SharedBitsAccessor *
+sb_initialize(SharedBits *sbits,
+			  int participants,
+			  int my_participant_number,
+			  SharedFileSet *fileset,
+			  char *name)
+{
+	SharedBitsAccessor *accessor;
+
+	sbits->nparticipants = participants;
+	strcpy(sbits->name, name);
+	sbits->nbits = 0;			/* TODO: maybe delete this */
+
+	accessor = palloc0(sizeof(SharedBitsAccessor));
+	accessor->participant = my_participant_number;
+	accessor->bits = sbits;
+	accessor->fileset = fileset;
+	accessor->write_file = NULL;
+	accessor->combined = NULL;
+	return accessor;
+}
+
+/*  TODO: is "initialize_accessor" a clear enough API for this? (making the file)? */
+void
+sb_initialize_accessor(SharedBitsAccessor *accessor, uint32 nbits)
+{
+	char		name[MAXPGPATH];
+	uint32		num_to_write;
+
+	snprintf(name, MAXPGPATH, "%s.p%d.bitmap", accessor->bits->name, accessor->participant);
+
+	accessor->write_file =
+		BufFileCreateShared(accessor->fileset, name);
+
+	accessor->bits->participants[accessor->participant].present = true;
+	/* TODO: check this math. tuplenumber will be too high? */
+	num_to_write = nbits / 8 + 1;
+
+	/*
+	 * TODO: add tests that could exercise a problem with junk being written
+	 * to bitmap
+	 */
+
+	/*
+	 * TODO: is there a better way to write the bytes to the file without
+	 * calling BufFileWrite() like this? palloc()ing an undetermined number of
+	 * bytes feels like it is against the spirit of this patch to begin with,
+	 * but the many function calls seem expensive
+	 */
+	for (int i = 0; i < num_to_write; i++)
+	{
+		unsigned char byteToWrite = 0;
+
+		BufFileWrite(accessor->write_file, &byteToWrite, 1);
+	}
+
+	if (BufFileSeek(accessor->write_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+}
+
+size_t
+sb_estimate(int participants)
+{
+	return offsetof(SharedBits, participants) + participants * sizeof(SharedBitsParticipant);
+}
+
+
+void
+sb_setbit(SharedBitsAccessor *accessor, uint64 bit)
+{
+	SharedBitsParticipant *const participant =
+	&accessor->bits->participants[accessor->participant];
+
+	/* TODO: use an unsigned int instead of a byte */
+	unsigned char current_outer_byte;
+
+	Assert(accessor->write_file);
+
+	if (!participant->writing)
+	{
+		participant->writing = true;
+	}
+
+	BufFileSeek(accessor->write_file, 0, bit / 8, SEEK_SET);
+	BufFileRead(accessor->write_file, &current_outer_byte, 1);
+
+	current_outer_byte |= 1U << (bit % 8);
+
+	BufFileSeek(accessor->write_file, 0, -1, SEEK_CUR);
+	BufFileWrite(accessor->write_file, &current_outer_byte, 1);
+}
+
+bool
+sb_checkbit(SharedBitsAccessor *accessor, uint32 n)
+{
+	bool		match;
+	uint32		bytenum = n / 8;
+	unsigned char bit = n % 8;
+	unsigned char byte_to_check = 0;
+
+	Assert(accessor->combined);
+
+	/* seek to byte to check */
+	if (BufFileSeek(accessor->combined,
+					0,
+					bytenum,
+					SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg(
+						"could not rewind shared outer temporary file: %m")));
+	/* read byte containing ntuple bit */
+	if (BufFileRead(accessor->combined, &byte_to_check, 1) == 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg(
+						"could not read byte in outer match status bitmap: %m.")));
+	/* if bit is set */
+	match = ((byte_to_check) >> bit) & 1;
+
+	return match;
+}
+
+BufFile *
+sb_combine(SharedBitsAccessor *accessor)
+{
+	/*
+	 * TODO: this tries to close an outer match status file for each
+	 * participant in the tuplestore. technically, only participants in the
+	 * barrier could have outer match status files, however, all but one
+	 * participant continue on and detach from the barrier so we won't have a
+	 * reliable way to close only files for those attached to the barrier
+	 */
+	BufFile   **statuses;
+	BufFile    *combined_bitmap_file;
+	int			statuses_length;
+
+	int			nbparticipants = 0;
+
+	for (int l = 0; l < accessor->bits->nparticipants; l++)
+	{
+		SharedBitsParticipant participant = accessor->bits->participants[l];
+
+		if (participant.present)
+		{
+			Assert(!participant.writing);
+			nbparticipants++;
+		}
+	}
+	statuses = palloc(sizeof(BufFile *) * nbparticipants);
+
+	/*
+	 * Open the bitmap shared BufFile from each participant. TODO: explain why
+	 * file can be NULLs
+	 */
+	statuses_length = 0;
+
+	for (int i = 0; i < accessor->bits->nparticipants; i++)
+	{
+		char		bitmap_filename[MAXPGPATH];
+		BufFile    *file;
+
+		/* TODO: make a function that will do this */
+		snprintf(bitmap_filename, MAXPGPATH, "%s.p%d.bitmap", accessor->bits->name, i);
+
+		if (!accessor->bits->participants[i].present)
+			continue;
+		file = BufFileOpenShared(accessor->fileset, bitmap_filename);
+		/* TODO: can we be sure that this file is at beginning? */
+		Assert(file);
+
+		statuses[statuses_length++] = file;
+	}
+
+	combined_bitmap_file = BufFileCreateTemp(false);
+
+	for (int64 cur = 0; cur < BufFileSize(statuses[0]); cur++)	/* make it while not EOF */
+	{
+		/*
+		 * TODO: make this use an unsigned int instead of a byte so it isn't
+		 * so slow
+		 */
+		unsigned char combined_byte = 0;
+
+		for (int i = 0; i < statuses_length; i++)
+		{
+			unsigned char read_byte;
+
+			BufFileRead(statuses[i], &read_byte, 1);
+			combined_byte |= read_byte;
+		}
+
+		BufFileWrite(combined_bitmap_file, &combined_byte, 1);
+	}
+
+	if (BufFileSeek(combined_bitmap_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	for (int i = 0; i < statuses_length; i++)
+		BufFileClose(statuses[i]);
+	pfree(statuses);
+
+	accessor->combined = combined_bitmap_file;
+	return combined_bitmap_file;
+}
+
+void
+sb_end_write(SharedBitsAccessor *sba)
+{
+	SharedBitsParticipant
+			   *const participant = &sba->bits->participants[sba->participant];
+
+	participant->writing = false;
+
+	/*
+	 * TODO: this should not be needed if flow is correct. need to fix that
+	 * and get rid of this check
+	 */
+	if (sba->write_file)
+		BufFileClose(sba->write_file);
+	sba->write_file = NULL;
+}
+
+void
+sb_end_read(SharedBitsAccessor *accessor)
+{
+	if (accessor->combined == NULL)
+		return;
+
+	BufFileClose(accessor->combined);
+	accessor->combined = NULL;
+}
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 6537a4303b..1269e70b3b 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -52,6 +52,7 @@ typedef struct SharedTuplestoreParticipant
 {
 	LWLock		lock;
 	BlockNumber read_page;		/* Page number for next read. */
+	bool		rewound;
 	BlockNumber npages;			/* Number of pages written. */
 	bool		writing;		/* Used only for assertions. */
 } SharedTuplestoreParticipant;
@@ -60,6 +61,7 @@ typedef struct SharedTuplestoreParticipant
 struct SharedTuplestore
 {
 	int			nparticipants;	/* Number of participants that can write. */
+	pg_atomic_uint32 ntuples;	/* Number of tuples in this tuplestore. */
 	int			flags;			/* Flag bits from SHARED_TUPLESTORE_XXX */
 	size_t		meta_data_size; /* Size of per-tuple header. */
 	char		name[NAMEDATALEN];	/* A name for this tuplestore. */
@@ -85,6 +87,8 @@ struct SharedTuplestoreAccessor
 	char	   *read_buffer;	/* A buffer for loading tuples. */
 	size_t		read_buffer_size;
 	BlockNumber read_next_page; /* Lowest block we'll consider reading. */
+	BlockNumber start_page;		/* page to reset p->read_page to if back out
+								 * required */
 
 	/* State for writing. */
 	SharedTuplestoreChunk *write_chunk; /* Buffer for writing. */
@@ -137,6 +141,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 	Assert(my_participant_number < participants);
 
 	sts->nparticipants = participants;
+	pg_atomic_init_u32(&sts->ntuples, 1);
 	sts->meta_data_size = meta_data_size;
 	sts->flags = flags;
 
@@ -158,6 +163,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 		LWLockInitialize(&sts->participants[i].lock,
 						 LWTRANCHE_SHARED_TUPLESTORE);
 		sts->participants[i].read_page = 0;
+		sts->participants[i].rewound = false;
 		sts->participants[i].writing = false;
 	}
 
@@ -272,6 +278,45 @@ sts_begin_parallel_scan(SharedTuplestoreAccessor *accessor)
 	accessor->read_participant = accessor->participant;
 	accessor->read_file = NULL;
 	accessor->read_next_page = 0;
+	accessor->start_page = 0;
+}
+
+void
+sts_resume_parallel_scan(SharedTuplestoreAccessor *accessor)
+{
+	int			i PG_USED_FOR_ASSERTS_ONLY;
+	SharedTuplestoreParticipant *p;
+
+	/* End any existing scan that was in progress. */
+	sts_end_parallel_scan(accessor);
+
+	/*
+	 * Any backend that might have written into this shared tuplestore must
+	 * have called sts_end_write(), so that all buffers are flushed and the
+	 * files have stopped growing.
+	 */
+	for (i = 0; i < accessor->sts->nparticipants; ++i)
+		Assert(!accessor->sts->participants[i].writing);
+
+	/*
+	 * We will start out reading the file that THIS backend wrote.  There may
+	 * be some caching locality advantage to that.
+	 */
+
+	/*
+	 * TODO: does this still apply in the multi-stripe case? It seems like if
+	 * a participant file is exhausted for the current stripe it might be
+	 * better to remember that
+	 */
+	accessor->read_participant = accessor->participant;
+	accessor->read_file = NULL;
+	p = &accessor->sts->participants[accessor->read_participant];
+
+	/* TODO: find a better solution than this for resuming the parallel scan */
+	LWLockAcquire(&p->lock, LW_SHARED);
+	accessor->start_page = p->read_page;
+	LWLockRelease(&p->lock);
+	accessor->read_next_page = 0;
 }
 
 /*
@@ -290,6 +335,7 @@ sts_end_parallel_scan(SharedTuplestoreAccessor *accessor)
 		BufFileClose(accessor->read_file);
 		accessor->read_file = NULL;
 	}
+	accessor->start_page = 0;
 }
 
 /*
@@ -526,7 +572,13 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	for (;;)
 	{
 		/* Can we read more tuples from the current chunk? */
-		if (accessor->read_ntuples < accessor->read_ntuples_available)
+		/*
+		 * Added a check for accessor->read_file being present here, as it
+		 * became relevant for adaptive hashjoin. Not sure if this has other
+		 * consequences for correctness
+		 */
+
+		if (accessor->read_ntuples < accessor->read_ntuples_available && accessor->read_file)
 			return sts_read_tuple(accessor, meta_data);
 
 		/* Find the location of a new chunk to read. */
@@ -536,7 +588,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 		/* We can skip directly past overflow pages we know about. */
 		if (p->read_page < accessor->read_next_page)
 			p->read_page = accessor->read_next_page;
-		eof = p->read_page >= p->npages;
+		eof = p->read_page >= p->npages || p->rewound;
 		if (!eof)
 		{
 			/* Claim the next chunk. */
@@ -544,9 +596,22 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 			/* Advance the read head for the next reader. */
 			p->read_page += STS_CHUNK_PAGES;
 			accessor->read_next_page = p->read_page;
+
+			/*
+			 * initialize start_page to the read_page this participant will
+			 * start reading from
+			 */
+			accessor->start_page = read_page;
 		}
 		LWLockRelease(&p->lock);
 
+		if (!eof)
+		{
+			char		name[MAXPGPATH];
+
+			sts_filename(name, accessor, accessor->read_participant);
+		}
+
 		if (!eof)
 		{
 			SharedTuplestoreChunk chunk_header;
@@ -610,6 +675,7 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 			if (accessor->read_participant == accessor->participant)
 				break;
 			accessor->read_next_page = 0;
+			accessor->start_page = 0;
 
 			/* Go around again, so we can get a chunk from this file. */
 		}
@@ -618,6 +684,48 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	return NULL;
 }
 
+void
+sts_parallel_scan_rewind(SharedTuplestoreAccessor *accessor)
+{
+	SharedTuplestoreParticipant *p =
+	&accessor->sts->participants[accessor->read_participant];
+
+	/*
+	 * Only set the read_page back to the start of the sts_chunk this worker
+	 * was reading if some other worker has not already done so. It could be
+	 * the case that this worker saw a tuple from a future stripe and another
+	 * worker did also in its sts_chunk and it already set read_page to its
+	 * start_page If so, we want to set read_page to the lowest value to
+	 * ensure that we read all tuples from the stripe (don't miss tuples)
+	 */
+	LWLockAcquire(&p->lock, LW_EXCLUSIVE);
+	p->read_page = Min(p->read_page, accessor->start_page);
+	p->rewound = true;
+	LWLockRelease(&p->lock);
+
+	accessor->read_ntuples_available = 0;
+	accessor->read_next_page = 0;
+}
+
+void
+sts_reset_rewound(SharedTuplestoreAccessor *accessor)
+{
+	for (int i = 0; i < accessor->sts->nparticipants; ++i)
+		accessor->sts->participants[i].rewound = false;
+}
+
+uint32
+sts_increment_ntuples(SharedTuplestoreAccessor *accessor)
+{
+	return pg_atomic_fetch_add_u32(&accessor->sts->ntuples, 1);
+}
+
+uint32
+sts_get_tuplenum(SharedTuplestoreAccessor *accessor)
+{
+	return pg_atomic_read_u32(&accessor->sts->ntuples);
+}
+
 /*
  * Create the name used for the BufFile that a given participant will write.
  */
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index ba661d32a6..0ba9d856c8 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -46,6 +46,7 @@ typedef struct ExplainState
 	bool		timing;			/* print detailed node timing */
 	bool		summary;		/* print total planning and execution timing */
 	bool		settings;		/* print modified settings */
+	bool		usage;			/* print memory usage */
 	ExplainFormat format;		/* output format */
 	/* state for output formatting --- not reset for each new plan tree */
 	int			indent;			/* current indentation level */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 79b634e8ed..9ef83e7a2e 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -19,6 +19,7 @@
 #include "storage/barrier.h"
 #include "storage/buffile.h"
 #include "storage/lwlock.h"
+#include "utils/sharedbits.h"
 
 /* ----------------------------------------------------------------
  *				hash-join hash table structures
@@ -142,6 +143,17 @@ typedef struct HashMemoryChunkData *HashMemoryChunk;
 /* tuples exceeding HASH_CHUNK_THRESHOLD bytes are put in their own chunk */
 #define HASH_CHUNK_THRESHOLD	(HASH_CHUNK_SIZE / 4)
 
+/*
+ * HashJoinTableData->curstripe the current stripe number
+ * The phantom stripe refers to the state of the inner side hashtable (empty)
+ * during the final scan of the outer batch file for a batch being processed
+ * using the hashloop fallback algorithm.
+ * In parallel-aware hash join, curstripe is in a detached state
+ * when the worker is not attached to the stripe_barrier.
+ */
+#define PHANTOM_STRIPE -2
+#define STRIPE_DETACHED -1
+
 /*
  * For each batch of a Parallel Hash Join, we have a ParallelHashJoinBatch
  * object in shared memory to coordinate access to it.  Since they are
@@ -152,6 +164,7 @@ typedef struct ParallelHashJoinBatch
 {
 	dsa_pointer buckets;		/* array of hash table buckets */
 	Barrier		batch_barrier;	/* synchronization for joining this batch */
+	Barrier		stripe_barrier; /* synchronization for stripes */
 
 	dsa_pointer chunks;			/* chunks of tuples loaded */
 	size_t		size;			/* size of buckets + chunks in memory */
@@ -160,6 +173,17 @@ typedef struct ParallelHashJoinBatch
 	size_t		old_ntuples;	/* number of tuples before repartitioning */
 	bool		space_exhausted;
 
+	/* Adaptive HashJoin */
+
+	/*
+	 * after finishing build phase, hashloop_fallback cannot change, and does
+	 * not require a lock to read
+	 */
+	bool		hashloop_fallback;
+	int			maximum_stripe_number;	/* the number of stripes in the batch */
+	size_t		estimated_stripe_size;	/* size of last stripe in batch */
+	LWLock		lock;
+
 	/*
 	 * Variable-sized SharedTuplestore objects follow this struct in memory.
 	 * See the accessor macros below.
@@ -177,10 +201,17 @@ typedef struct ParallelHashJoinBatch
 	 ((char *) ParallelHashJoinBatchInner(batch) +						\
 	  MAXALIGN(sts_estimate(nparticipants))))
 
+/* Accessor for sharedbits following a ParallelHashJoinBatch. */
+#define ParallelHashJoinBatchOuterBits(batch, nparticipants) \
+	((SharedBits *)												\
+	 ((char *) ParallelHashJoinBatchOuter(batch, nparticipants) +						\
+	  MAXALIGN(sts_estimate(nparticipants))))
+
 /* Total size of a ParallelHashJoinBatch and tuplestores. */
 #define EstimateParallelHashJoinBatch(hashtable)						\
 	(MAXALIGN(sizeof(ParallelHashJoinBatch)) +							\
-	 MAXALIGN(sts_estimate((hashtable)->parallel_state->nparticipants)) * 2)
+	 MAXALIGN(sts_estimate((hashtable)->parallel_state->nparticipants)) * 2 + \
+	 MAXALIGN(sb_estimate((hashtable)->parallel_state->nparticipants)))
 
 /* Accessor for the nth ParallelHashJoinBatch given the base. */
 #define NthParallelHashJoinBatch(base, n)								\
@@ -204,9 +235,19 @@ typedef struct ParallelHashJoinBatchAccessor
 	size_t		old_ntuples;	/* how many tuples before repartitioning? */
 	bool		at_least_one_chunk; /* has this backend allocated a chunk? */
 
-	bool		done;			/* flag to remember that a batch is done */
+	int			done;			/* flag to remember that a batch is done */
+	/* -1 for not done, 0 for tentatively done, 1 for done */
 	SharedTuplestoreAccessor *inner_tuples;
 	SharedTuplestoreAccessor *outer_tuples;
+	SharedBitsAccessor *sba;
+
+	/*
+	 * All participants except the last worker working on a batch which has
+	 * fallen back to hashloop processing save the stripe barrier phase and
+	 * detach to avoid the deadlock hazard of waiting on a barrier after
+	 * tuples have been emitted.
+	 */
+	int			last_participating_stripe_phase;
 } ParallelHashJoinBatchAccessor;
 
 /*
@@ -227,6 +268,18 @@ typedef enum ParallelHashGrowth
 	PHJ_GROWTH_DISABLED
 } ParallelHashGrowth;
 
+typedef enum ParallelHashJoinBatchAccessorStatus
+{
+	/* No more useful work can be done on this batch by this worker */
+	PHJ_BATCH_ACCESSOR_DONE,
+
+	/*
+	 * The worker has not yet checked this batch to see if it can do useful
+	 * work
+	 */
+	PHJ_BATCH_ACCESSOR_NOT_DONE
+}			ParallelHashJoinBatchAccessorStatus;
+
 /*
  * The shared state used to coordinate a Parallel Hash Join.  This is stored
  * in the DSM segment.
@@ -263,9 +316,18 @@ typedef struct ParallelHashJoinState
 /* The phases for probing each batch, used by for batch_barrier. */
 #define PHJ_BATCH_ELECTING				0
 #define PHJ_BATCH_ALLOCATING			1
-#define PHJ_BATCH_LOADING				2
-#define PHJ_BATCH_PROBING				3
-#define PHJ_BATCH_DONE					4
+#define PHJ_BATCH_STRIPING				2
+#define PHJ_BATCH_DONE					3
+
+/* The phases for probing each stripe of each batch used with stripe barriers */
+#define PHJ_STRIPE_INVALID_PHASE        -1
+#define PHJ_STRIPE_ELECTING				0
+#define PHJ_STRIPE_RESETTING			1
+#define PHJ_STRIPE_LOADING				2
+#define PHJ_STRIPE_PROBING				3
+#define PHJ_STRIPE_DONE				    4
+#define PHJ_STRIPE_NUMBER(n)            ((n) / 5)
+#define PHJ_STRIPE_PHASE(n)             ((n) % 5)
 
 /* The phases of batch growth while hashing, for grow_batches_barrier. */
 #define PHJ_GROW_BATCHES_ELECTING		0
@@ -313,8 +375,6 @@ typedef struct HashJoinTableData
 	int			nbatch_original;	/* nbatch when we started inner scan */
 	int			nbatch_outstart;	/* nbatch when we started outer scan */
 
-	bool		growEnabled;	/* flag to shut off nbatch increases */
-
 	double		totalTuples;	/* # tuples obtained from inner plan */
 	double		partialTuples;	/* # tuples obtained from inner plan by me */
 	double		skewTuples;		/* # tuples inserted into skew tuples */
@@ -329,6 +389,18 @@ typedef struct HashJoinTableData
 	BufFile   **innerBatchFile; /* buffered virtual temp file per batch */
 	BufFile   **outerBatchFile; /* buffered virtual temp file per batch */
 
+	/*
+	 * Adaptive hashjoin variables
+	 */
+	BufFile   **hashloopBatchFile;	/* outer match status files if fall back */
+	List	   *fallback_batches_stats; /* per hashjoin batch statistics */
+
+	/*
+	 * current stripe #; 0 during 1st pass, -1 (macro STRIPE_DETACHED) when
+	 * detached, -2 on phantom stripe (macro PHANTOM_STRIPE)
+	 */
+	int			curstripe;
+
 	/*
 	 * Info about the datatype-specific hash functions for the datatypes being
 	 * hashed. These are arrays of the same length as the number of hash join
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index a97562e7a4..e72bd5702a 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -14,6 +14,7 @@
 #define INSTRUMENT_H
 
 #include "portability/instr_time.h"
+#include "nodes/pg_list.h"
 
 
 typedef struct BufferUsage
@@ -39,6 +40,12 @@ typedef struct WalUsage
 	uint64		wal_bytes;		/* size of WAL records produced */
 } WalUsage;
 
+typedef struct FallbackBatchStats
+{
+	int			batchno;
+	int			numstripes;
+} FallbackBatchStats;
+
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
 typedef enum InstrumentOption
 {
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 64d2ce693c..f85308738b 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -31,6 +31,7 @@ extern void ExecParallelHashTableAlloc(HashJoinTable hashtable,
 extern void ExecHashTableDestroy(HashJoinTable hashtable);
 extern void ExecHashTableDetach(HashJoinTable hashtable);
 extern void ExecHashTableDetachBatch(HashJoinTable hashtable);
+extern bool ExecHashTableDetachStripe(HashJoinTable hashtable);
 extern void ExecParallelHashTableSetCurrentBatch(HashJoinTable hashtable,
 												 int batchno);
 
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index f7df70b5ab..0c0d87d1d3 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -129,6 +129,7 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+	uint32		tts_tuplenum;	/* a tuple id for use when ctid cannot be used */
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -425,6 +426,7 @@ static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
 	slot->tts_ops->clear(slot);
+	slot->tts_tuplenum = 0;		/* TODO: should this be done elsewhere? */
 
 	return slot;
 }
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f5dfa32d55..0f19e24929 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1957,6 +1957,10 @@ typedef struct HashJoinState
 	int			hj_JoinState;
 	bool		hj_MatchedOuter;
 	bool		hj_OuterNotEmpty;
+	/* Adaptive Hashjoin variables */
+	int			hj_CurNumOuterTuples;	/* number of outer tuples in a batch */
+	unsigned int hj_CurOuterMatchStatus;
+	int			hj_EmitOuterTupleId;
 } HashJoinState;
 
 
@@ -2381,6 +2385,7 @@ typedef struct HashInstrumentation
 	int			nbatch;			/* number of batches at end of execution */
 	int			nbatch_original;	/* planned number of batches */
 	Size		space_peak;		/* peak memory usage in bytes */
+	List	   *fallback_batches_stats; /* per hashjoin batch stats */
 } HashInstrumentation;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..1195ee8c7a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -855,7 +855,10 @@ typedef enum
 	WAIT_EVENT_EXECUTE_GATHER,
 	WAIT_EVENT_HASH_BATCH_ALLOCATE,
 	WAIT_EVENT_HASH_BATCH_ELECT,
-	WAIT_EVENT_HASH_BATCH_LOAD,
+	WAIT_EVENT_HASH_STRIPE_ELECT,
+	WAIT_EVENT_HASH_STRIPE_RESET,
+	WAIT_EVENT_HASH_STRIPE_LOAD,
+	WAIT_EVENT_HASH_STRIPE_PROBE,
 	WAIT_EVENT_HASH_BUILD_ALLOCATE,
 	WAIT_EVENT_HASH_BUILD_ELECT,
 	WAIT_EVENT_HASH_BUILD_HASH_INNER,
diff --git a/src/include/utils/sharedbits.h b/src/include/utils/sharedbits.h
new file mode 100644
index 0000000000..de43279de8
--- /dev/null
+++ b/src/include/utils/sharedbits.h
@@ -0,0 +1,39 @@
+/*-------------------------------------------------------------------------
+ *
+ * sharedbits.h
+ *	  Simple mechanism for sharing bits between backends.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/sharedbits.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SHAREDBITS_H
+#define SHAREDBITS_H
+
+#include "storage/sharedfileset.h"
+
+struct SharedBits;
+typedef struct SharedBits SharedBits;
+
+struct SharedBitsParticipant;
+typedef struct SharedBitsParticipant SharedBitsParticipant;
+
+struct SharedBitsAccessor;
+typedef struct SharedBitsAccessor SharedBitsAccessor;
+
+extern SharedBitsAccessor *sb_attach(SharedBits *sbits, int my_participant_number, SharedFileSet *fileset);
+extern SharedBitsAccessor *sb_initialize(SharedBits *sbits, int participants, int my_participant_number, SharedFileSet *fileset, char *name);
+extern void sb_initialize_accessor(SharedBitsAccessor *accessor, uint32 nbits);
+extern size_t sb_estimate(int participants);
+
+extern void sb_setbit(SharedBitsAccessor *accessor, uint64 bit);
+extern bool sb_checkbit(SharedBitsAccessor *accessor, uint32 n);
+extern BufFile *sb_combine(SharedBitsAccessor *accessor);
+
+extern void sb_end_write(SharedBitsAccessor *sba);
+extern void sb_end_read(SharedBitsAccessor *accessor);
+
+#endif							/* SHAREDBITS_H */
diff --git a/src/include/utils/sharedtuplestore.h b/src/include/utils/sharedtuplestore.h
index 9754504cc5..99aead8a4a 100644
--- a/src/include/utils/sharedtuplestore.h
+++ b/src/include/utils/sharedtuplestore.h
@@ -22,6 +22,17 @@ typedef struct SharedTuplestore SharedTuplestore;
 
 struct SharedTuplestoreAccessor;
 typedef struct SharedTuplestoreAccessor SharedTuplestoreAccessor;
+struct tupleMetadata;
+typedef struct tupleMetadata tupleMetadata;
+struct tupleMetadata
+{
+	uint32		hashvalue;
+	union
+	{
+		uint32		tupleid;	/* tuple number or id on the outer side */
+		int			stripe;		/* stripe number for inner side */
+	};
+};
 
 /*
  * A flag indicating that the tuplestore will only be scanned once, so backing
@@ -49,6 +60,8 @@ extern void sts_reinitialize(SharedTuplestoreAccessor *accessor);
 
 extern void sts_begin_parallel_scan(SharedTuplestoreAccessor *accessor);
 
+extern void sts_resume_parallel_scan(SharedTuplestoreAccessor *accessor);
+
 extern void sts_end_parallel_scan(SharedTuplestoreAccessor *accessor);
 
 extern void sts_puttuple(SharedTuplestoreAccessor *accessor,
@@ -58,4 +71,10 @@ extern void sts_puttuple(SharedTuplestoreAccessor *accessor,
 extern MinimalTuple sts_parallel_scan_next(SharedTuplestoreAccessor *accessor,
 										   void *meta_data);
 
+extern void sts_parallel_scan_rewind(SharedTuplestoreAccessor *accessor);
+
+extern void sts_reset_rewound(SharedTuplestoreAccessor *accessor);
+extern uint32 sts_increment_ntuples(SharedTuplestoreAccessor *accessor);
+extern uint32 sts_get_tuplenum(SharedTuplestoreAccessor *accessor);
+
 #endif							/* SHAREDTUPLESTORE_H */
diff --git a/src/test/regress/expected/join_hash.out b/src/test/regress/expected/join_hash.out
index 3a91c144a2..463e71238a 100644
--- a/src/test/regress/expected/join_hash.out
+++ b/src/test/regress/expected/join_hash.out
@@ -1013,3 +1013,1454 @@ WHERE
 (1 row)
 
 ROLLBACK;
+-- Serial Adaptive Hash Join
+BEGIN;
+CREATE TYPE stub AS (hash INTEGER, value CHAR(8098));
+CREATE FUNCTION stub_hash(item stub)
+RETURNS INTEGER AS $$
+DECLARE
+  batch_size INTEGER;
+BEGIN
+  batch_size := 4;
+  RETURN item.hash << (batch_size - 1);
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+CREATE FUNCTION stub_eq(item1 stub, item2 stub)
+RETURNS BOOLEAN AS $$
+BEGIN
+  RETURN item1.hash = item2.hash AND item1.value = item2.value;
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+CREATE OPERATOR = (
+  FUNCTION = stub_eq,
+  LEFTARG = stub,
+  RIGHTARG = stub,
+  COMMUTATOR = =,
+  HASHES, MERGES
+);
+CREATE OPERATOR CLASS stub_hash_ops
+DEFAULT FOR TYPE stub USING hash AS
+  OPERATOR 1 =(stub, stub),
+  FUNCTION 1 stub_hash(stub);
+CREATE TABLE probeside(a stub);
+ALTER TABLE probeside ALTER COLUMN a SET STORAGE PLAIN;
+-- non-fallback batch with unmatched outer tuple
+INSERT INTO probeside SELECT '(2, "")' FROM generate_series(1, 1);
+-- fallback batch unmatched outer tuple (in first stripe maybe)
+INSERT INTO probeside SELECT '(1, "unmatched outer tuple")' FROM generate_series(1, 1);
+-- fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(1, "")' FROM generate_series(1, 5);
+-- fallback batch unmatched outer tuple (in last stripe maybe)
+-- When numbatches=4, hash 5 maps to batch 1, but after numbatches doubles to
+-- 8 batches hash 5 maps to batch 5.
+INSERT INTO probeside SELECT '(5, "")' FROM generate_series(1, 1);
+-- non-fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(3, "")' FROM generate_series(1, 1);
+-- batch with 3 stripes where non-first/non-last stripe contains unmatched outer tuple
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 5);
+INSERT INTO probeside SELECT '(6, "unmatched outer tuple")' FROM generate_series(1, 1);
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 1);
+CREATE TABLE hashside_wide(a stub, id int);
+ALTER TABLE hashside_wide ALTER COLUMN a SET STORAGE PLAIN;
+-- falls back with an unmatched inner tuple that is in fist, middle, and last
+-- stripe
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in first stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in middle stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in last stripe")', 1 FROM generate_series(1, 1);
+-- doesn't fall back -- matched tuple
+INSERT INTO hashside_wide SELECT '(3, "")', 3 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(6, "")', 6 FROM generate_series(1, 20);
+ANALYZE probeside, hashside_wide;
+SET enable_nestloop TO off;
+SET enable_mergejoin TO off;
+SET work_mem = 64;
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash | btrim 
+------+-----------------------+----+------+-------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+(215 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Left Join (actual rows=215 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash | btrim | id | hash |                 btrim                  
+------+-------+----+------+----------------------------------------
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    3 |       |  3 |    3 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+      |       |  1 |    1 | unmatched inner tuple in first stripe
+      |       |  1 |    1 | unmatched inner tuple in last stripe
+      |       |  1 |    1 | unmatched inner tuple in middle stripe
+(214 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Right Join (actual rows=214 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+FULL OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash |                 btrim                  
+------+-----------------------+----+------+----------------------------------------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+      |                       |  1 |    1 | unmatched inner tuple in first stripe
+      |                       |  1 |    1 | unmatched inner tuple in last stripe
+      |                       |  1 |    1 | unmatched inner tuple in middle stripe
+(218 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+FULL OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Full Join (actual rows=218 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+-- semi-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Semi Join (actual rows=12 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+ hash | btrim 
+------+-------
+    1 | 
+    1 | 
+    1 | 
+    1 | 
+    1 | 
+    3 | 
+    6 | 
+    6 | 
+    6 | 
+    6 | 
+    6 | 
+    6 | 
+(12 rows)
+
+-- anti-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Anti Join (actual rows=4 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+ hash |         btrim         
+------+-----------------------
+    1 | unmatched outer tuple
+    2 | 
+    5 | 
+    6 | unmatched outer tuple
+(4 rows)
+
+-- parallel LOJ test case with two batches falling back
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_parallel_hash = on;
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Gather (actual rows=215 loops=1)
+   Workers Planned: 1
+   Workers Launched: 1
+   ->  Parallel Hash Left Join (actual rows=108 loops=2)
+         Hash Cond: (probeside.a = hashside_wide.a)
+         ->  Parallel Seq Scan on probeside (actual rows=16 loops=1)
+         ->  Parallel Hash (actual rows=21 loops=2)
+               Buckets: 8 (originally 8)  Batches: 128 (originally 8)
+               Batch: 1  Stripes: 3
+               Batch: 6  Stripes: 2
+               ->  Parallel Seq Scan on hashside_wide (actual rows=42 loops=1)
+(11 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash | btrim 
+------+-----------------------+----+------+-------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+(215 rows)
+
+rollback to settings;
+-- Test spill of batch 0 gives correct results.
+CREATE TABLE probeside_batch0(a stub);
+ALTER TABLE probeside_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO probeside_batch0 SELECT '(0, "")' FROM generate_series(1, 13);
+INSERT INTO probeside_batch0 SELECT '(0, "unmatched outer")' FROM generate_series(1, 1);
+CREATE TABLE hashside_wide_batch0(a stub, id int);
+ALTER TABLE hashside_wide_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO hashside_wide_batch0 SELECT '(0, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide_batch0 SELECT '(0, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide_batch0 SELECT '(0, "")', 1 FROM generate_series(1, 9);
+ANALYZE probeside_batch0, hashside_wide_batch0;
+SELECT (probeside_batch0.a).hash, ((((probeside_batch0.a).hash << 7) >> 3) & 31) AS batchno, TRIM((probeside_batch0.a).value), hashside_wide_batch0.id, hashside_wide_batch0.ctid, (hashside_wide_batch0.a).hash, TRIM((hashside_wide_batch0.a).value)
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash | batchno |      btrim      | id |  ctid  | hash | btrim 
+------+---------+-----------------+----+--------+------+-------
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (0,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (1,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (2,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (3,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (4,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (5,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (6,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (7,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (8,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (9,1)  |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (10,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (11,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (12,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (13,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (14,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (15,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (16,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (17,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (18,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (19,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (20,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (21,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (22,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (23,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (24,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (25,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 |                 |  1 | (26,1) |    0 | 
+    0 |       0 | unmatched outer |    |        |      | 
+(352 rows)
+
+ROLLBACK;
diff --git a/src/test/regress/sql/join_hash.sql b/src/test/regress/sql/join_hash.sql
index 68c1a8c7b6..ab41b4d4c3 100644
--- a/src/test/regress/sql/join_hash.sql
+++ b/src/test/regress/sql/join_hash.sql
@@ -538,3 +538,149 @@ WHERE
     AND hjtest_1.a <> hjtest_2.b;
 
 ROLLBACK;
+
+-- Serial Adaptive Hash Join
+
+BEGIN;
+CREATE TYPE stub AS (hash INTEGER, value CHAR(8098));
+
+CREATE FUNCTION stub_hash(item stub)
+RETURNS INTEGER AS $$
+DECLARE
+  batch_size INTEGER;
+BEGIN
+  batch_size := 4;
+  RETURN item.hash << (batch_size - 1);
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+
+CREATE FUNCTION stub_eq(item1 stub, item2 stub)
+RETURNS BOOLEAN AS $$
+BEGIN
+  RETURN item1.hash = item2.hash AND item1.value = item2.value;
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+
+CREATE OPERATOR = (
+  FUNCTION = stub_eq,
+  LEFTARG = stub,
+  RIGHTARG = stub,
+  COMMUTATOR = =,
+  HASHES, MERGES
+);
+
+CREATE OPERATOR CLASS stub_hash_ops
+DEFAULT FOR TYPE stub USING hash AS
+  OPERATOR 1 =(stub, stub),
+  FUNCTION 1 stub_hash(stub);
+
+CREATE TABLE probeside(a stub);
+ALTER TABLE probeside ALTER COLUMN a SET STORAGE PLAIN;
+-- non-fallback batch with unmatched outer tuple
+INSERT INTO probeside SELECT '(2, "")' FROM generate_series(1, 1);
+-- fallback batch unmatched outer tuple (in first stripe maybe)
+INSERT INTO probeside SELECT '(1, "unmatched outer tuple")' FROM generate_series(1, 1);
+-- fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(1, "")' FROM generate_series(1, 5);
+-- fallback batch unmatched outer tuple (in last stripe maybe)
+-- When numbatches=4, hash 5 maps to batch 1, but after numbatches doubles to
+-- 8 batches hash 5 maps to batch 5.
+INSERT INTO probeside SELECT '(5, "")' FROM generate_series(1, 1);
+-- non-fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(3, "")' FROM generate_series(1, 1);
+-- batch with 3 stripes where non-first/non-last stripe contains unmatched outer tuple
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 5);
+INSERT INTO probeside SELECT '(6, "unmatched outer tuple")' FROM generate_series(1, 1);
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 1);
+
+CREATE TABLE hashside_wide(a stub, id int);
+ALTER TABLE hashside_wide ALTER COLUMN a SET STORAGE PLAIN;
+-- falls back with an unmatched inner tuple that is in fist, middle, and last
+-- stripe
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in first stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in middle stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in last stripe")', 1 FROM generate_series(1, 1);
+
+-- doesn't fall back -- matched tuple
+INSERT INTO hashside_wide SELECT '(3, "")', 3 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(6, "")', 6 FROM generate_series(1, 20);
+
+ANALYZE probeside, hashside_wide;
+
+SET enable_nestloop TO off;
+SET enable_mergejoin TO off;
+SET work_mem = 64;
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+FULL OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+FULL OUTER JOIN hashside_wide USING (a);
+
+-- semi-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+
+-- anti-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+
+-- parallel LOJ test case with two batches falling back
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_parallel_hash = on;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+rollback to settings;
+
+-- Test spill of batch 0 gives correct results.
+CREATE TABLE probeside_batch0(a stub);
+ALTER TABLE probeside_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO probeside_batch0 SELECT '(0, "")' FROM generate_series(1, 13);
+INSERT INTO probeside_batch0 SELECT '(0, "unmatched outer")' FROM generate_series(1, 1);
+
+CREATE TABLE hashside_wide_batch0(a stub, id int);
+ALTER TABLE hashside_wide_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO hashside_wide_batch0 SELECT '(0, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide_batch0 SELECT '(0, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide_batch0 SELECT '(0, "")', 1 FROM generate_series(1, 9);
+ANALYZE probeside_batch0, hashside_wide_batch0;
+
+SELECT (probeside_batch0.a).hash, ((((probeside_batch0.a).hash << 7) >> 3) & 31) AS batchno, TRIM((probeside_batch0.a).value), hashside_wide_batch0.id, hashside_wide_batch0.ctid, (hashside_wide_batch0.a).hash, TRIM((hashside_wide_batch0.a).value)
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+ROLLBACK;
-- 
2.20.1

#59

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Melanie Plageman (#58)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Thu, Jun 25, 2020 at 03:09:44PM -0700, Melanie Plageman wrote:

On Tue, Jun 23, 2020 at 3:24 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

I started looking at the patch to refresh my knowledge both of this
patch and parallel hash join, but I think it needs a rebase. The
changes in 7897e3bb90 apparently touched some of the code.

Thanks so much for the review, Tomas!

I've attached a rebased patch which also contains updates discussed
below.

Thanks.

I assume
you're working on a patch addressing the remaining TODOS, right?

I wanted to get some feedback on the patch before working through the
TODOs to make sure I was on the right track.

Now that you are reviewing this, I will focus all my attention
on addressing your feedback. If there are any TODOs that you feel are
most important, let me know, so I can start with those.

Otherwise, I will prioritize parallel batch 0 spilling.

Feel free to work on the batch 0 spilling, please. I still need to get
familiar with various parts of the parallel hash join etc. so I don't
have any immediate feedback which TODOs to work on first.

David Kimura plans to do a bit of work on on parallel hash join batch 0
spilling tomorrow. Whatever is left after that, I will pick up next
week. Parallel hash join batch 0 spilling is the last large TODO that I
had.

My plan was to then focus on the feedback (either about which TODOs are
most important or outside of the TODOs I've identified) I get from you
and anyone else who reviews this.

OK.

I see you've switched to "stripe" naming - I find that a bit confusing,
because when I hear stripe I think about RAID, where it means pieces of
data interleaved and stored on different devices. But maybe that's just
me and it's a good name. Maybe it'd be better to keep the naming and
only tweak it at the end, not to disrupt reviews unnecessarily.

I hear you about "stripe". I still quite like it, especially as compared
to its predecessor (originally, I called them chunks -- which is
impossible given that SharedTuplestoreChunks are a thing).

I don't think using chunks in one place means we can't use it elsewhere
in a different context. I'm sure we have "chunks" in other places. But
let's not bikeshed on this too much.

For ease of review, as you mentioned, I will keep the name for now. I am
open to changing it later, though.

I've been soliciting ideas for alternatives and, so far, folks have
suggested "stride", "step", "flock", "herd", "cohort", and "school". I'm
still on team "stripe" though, as it stands.

;-)

nodeHash.c
----------

1) MultiExecPrivateHash says this

/*
* Not subject to skew optimization, so either insert normally
* or save to batch file if it belongs to another stripe
*/

I wonder what it means to "belong to another stripe". I understand what
that means for batches, which are identified by batchno computed from
the hash value. But I thought "stripes" are just work_mem-sized pieces
of a batch, so I don't quite understand this. Especially when the code
does not actually check "which stripe" the row belongs to.

I agree this was confusing.

"belongs to another stripe" meant here that if batch 0 falls back and we
are still loading it, once we've filled up work_mem, we need to start
saving those tuples to a spill file for batch 0. I've changed the
comment to this:
-        * or save to batch file if it belongs to another stripe
+       * or save to batch file if batch 0 falls back and we have
+       * already filled the hashtable up to space_allowed.

OK. Silly question - what does "batch 0 falls back" mean? Does it mean
that we realized the hash table for batch 0 would not fit into work_mem,
so we switched to the "hashloop" strategy?

2) I find the fields hashloop_fallback rather confusing. We have one in
HashJoinTable (and it's array of BufFile items) and another one in
ParallelHashJoinBatch (this time just bool).

I think HashJoinTable should be renamed to hashloopBatchFile (similarly
to the other BufFile arrays).

I think you are right about the name. I've changed the name in
HashJoinTableData to hashloopBatchFile.

The array of BufFiles hashloop_fallback was only used by serial
hashjoin. The boolean hashloop_fallback variable is used only by
parallel hashjoin.

The reason I had them named the same thing is that I thought it would be
nice to have a variable with the same name to indicate if a batch "fell
back" for both parallel and serial hashjoin--especially since we check
it in the main hashjoin state machine used by parallel and serial
hashjoin.

In serial hashjoin, the BufFiles aren't identified by name, so I kept
them in that array. In parallel hashjoin, each ParallelHashJoinBatch has
the status saved (in the struct).
So, both represented the fall back status of a batch.

However, I agree with you, so I've renamed the serial one to
hashloopBatchFile.

Although I'm not sure why we even need
this file, when we have innerBatchFile? BufFile(s) are not exactly free,
in fact it's one of the problems for hashjoins with many batches.

Interesting -- it didn't even occur to me to combine the bitmap with the
inner side batch file data.
It definitely seems like a good idea to save the BufFile given that so
little data will likely go in it and that it has a 1-1 relationship with
inner side batches.

How might it work? Would you reserve some space at the beginning of the
file? When would you reserve the bytes (before adding tuples you won't
know how many bytes you need, so it might be hard to make sure there is
enough space.) Would all inner side files have space reserved or just
fallback batches?

Oh! So the hashloopBatchFile is only used for the bitmap? I haven't
realized that. In that case it probably makes sense to keep it separate
from the files with spilled tuples, interleaving that somehow would be
way too complex, I think.

However, do we need an array of those files? I thought we only need the
bitmap until we process all rows from each "stripe" and then we can
throw it away, right? Which would also mean we don't need to worry about
the memory usage too much, because the 8kB buffer will go away after
calling BufFileClose.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#60

Melanie Plageman

melanieplageman@gmail.com

over 5 years ago

In reply to: Tomas Vondra (#59)

1 attachment(s)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Attached is the current version of adaptive hash join with two
significant changes as compared to v10:

1) Implements spilling of batch 0 for parallel-aware parallel hash join.
2) Moves "striping" of fallback batches from "build" to "load" stage
It includes several smaller changes as well.

Batch 0 spilling is necessary when the hash table for batch 0 cannot fit
in memory and allows us to use the "hashloop" strategy for batch 0.

Spilling of batch 0 necessitated the addition of a few new pieces of
code. The most noticeable one is probably the hash table eviction phase
machine. If batch 0 was marked as a "fallback" batch in
ExecParallelHashIncreaseNumBatches() PHJ_GROW_BATCHES_DECIDING phase,
any future attempt to insert a tuple that would exceed the space_allowed
triggers eviction of the hash table.
ExecParallelHashTableEvictBatch0() will evict all batch 0 tuples in
memory into spill files in a batch 0 inner SharedTuplestore.

This means that when repartitioning batch 0 in the future, both the
batch 0 spill file and the hash table need to be drained and relocated
into the new generation of batches and the hash table. If enough memory
is freed up from batch 0 tuples relocating to other batches, then it is
possible that tuples from the batch 0 spill files will go back into the
hash table.
After batch 0 is evicted, the build stage proceeds as normal.

The main alternative to this design that we considered was to "close" the
hash table after it is full. That is, if batch 0 has been marked to fall
back, once it is full, all subsequent tuples pulled from the outer child
would bypass the hash table altogether and go directly into a spill
file.

We chose the hash table eviction route because I thought it might be
better to write chunks of the hashtable into a file together rather than
sporadically write new batch 0 tuples to spill files as they are
pulled out of the child node. However, since the same sts_puttuple() API
is used in both cases, it is highly possible this won't actually matter
and we will do the same amount of I/O.
Both designs involved changing the flow of the code for inserting and
repartitioning tuples, so I figured that I would choose one, do some
testing, and try the other one later after more discussion and review.

This patch also introduces a significant change to how tuples are split
into stripes. Previously, during the build stage, tuples were written to
spill files in the SharedTuplestore with a stripe number in the metadata
section of the MinimalTuple.
For a batch that had been designated a "fallback" batch,
once the space_allowed had been exhausted, the shared stripe number
would be incremented and the new stripe number was written in the tuple
metadata to the files. Then, during loading, tuples were only loaded
into the hashtable if their stripe number matched the current stripe number.

This had several downsides. It introduced a couple new shared variables --
the current stripe number for the batch and its size.
In master, during the normal mode of the "build" stage, shared variables
for the size or estimated_size of the batch are checked on each
allocation of a STS Chunk or HashMemoryChunk, however, during
repartitioning, because bailing out early was not an option, workers
could use backend-local variables to keep track of size and merge them
at the end of repartitioning. This wasn't possible if we needed accurate
stripe numbers written into the tuples. This meant that we had to add
new shared variable accesses to repartitioning.

To avoid this, Deep and I worked on moving the "striping" logic from the
"build" stage to the "load" stage for batches. Serial hash join already
did striping in this way. This patch now pauses loading once the
space_allowed has been exhausted for parallel hash join as well. The
tricky part was keeping track of multiple read_pages for a given file.

When tuples had explicit stripe numbers, we simply rewound the read_page
in the SharedTuplestoreParticipant to the earliest SharedTuplestoreChunk
that anyone had read and relied on the stripe numbers to avoid loading
tuples more than once. Now, each worker participating in reading from
the SharedTuplestore could have received a read_page "assignment" (four
blocks, currently) and then failed to allocate a HashMemoryChunk. We
cannot risk rewinding the read_page because there could be
SharedTuplestoreChunks that have already been loaded in between ones
that have not.

The design we went with was to "overflow" the tuples from this
SharedTuplestoreChunk onto the end of the write_file which this worker
wrote--if it participated in writing this STS--or by making a new
write_file if it did not participate in writing. This entailed keeping
track of who participated in the write phase. SharedTuplestore
participation now has three "modes"-- reading, writing, and appending.
During appending, workers can write to their own file and read from any
file.

One of the alternative designs I considered was to store the offset and
length of leftover blocks that still needed to be loaded into the hash
table in the SharedTuplestoreParticipant data structure. Then, workers
would pick up these "assignments". It is basically a
SharedTuplestoreParticipant work queue.
The main stumbling block I faced here was allocating a variable number of
things in shared memory. You don't know how many read participants will
read from the file and how many stripes there will be (until you've
loaded the file). In the worst case, you would need space for
nparticipants * nstripes - 1 offset/length combos.
Since I don't know how many stripes I have until I've loaded the file, I
can't allocate shared memory for this up front.

The downside of the "append overflow" design is that, currently, all
workers participating in loading a fallback batch write an overflow
chunk for every fallback stripe.
It seems like something could be done to check if there is space in the
hashtable before accepting an assignment of blocks to read from the
SharedTuplestore and moving the shared variable read_page. It might
reduce instances in which workers have to overflow. However, I tried
this and it is very intrusive on the SharedTuplestore API (it would have
to know about the hash table). Also, oversized tuples would not be
addressed by this pre-assignment check since memory is allocated a
HashMemoryChunk at a time. So, even if this was solved, you would need
overflow functionality

One note is that I had to comment out a test in join_hash.sql which
inserts tuples larger than work_mem in size (each), because it no longer
successfully executes.
Also, the stripe number is not deterministic, so sometimes the tests that
compare fallback batches' number of stripes fail (also in join_hash.sql).

Major outstanding TODOs:
--
- Potential redesign of stripe loading pausing and resumption
- The instrumentation for parallel fallback batches has some problems
- Deadlock hazard avoidance design of the stripe barrier still needs work
- Assorted smaller TODOs in the code

On Thu, Jun 25, 2020 at 5:22 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Thu, Jun 25, 2020 at 03:09:44PM -0700, Melanie Plageman wrote:

On Tue, Jun 23, 2020 at 3:24 PM Tomas Vondra <

tomas.vondra@2ndquadrant.com>

wrote:

I assume
you're working on a patch addressing the remaining TODOS, right?

I wanted to get some feedback on the patch before working through the
TODOs to make sure I was on the right track.

Now that you are reviewing this, I will focus all my attention
on addressing your feedback. If there are any TODOs that you feel are
most important, let me know, so I can start with those.

Otherwise, I will prioritize parallel batch 0 spilling.

Feel free to work on the batch 0 spilling, please. I still need to get
familiar with various parts of the parallel hash join etc. so I don't
have any immediate feedback which TODOs to work on first.

David Kimura plans to do a bit of work on on parallel hash join batch 0
spilling tomorrow. Whatever is left after that, I will pick up next
week. Parallel hash join batch 0 spilling is the last large TODO that I
had.

My plan was to then focus on the feedback (either about which TODOs are
most important or outside of the TODOs I've identified) I get from you
and anyone else who reviews this.

OK.

See list of patch contents above.

Tomas, I wasn't sure if you would want a patchset which included a
commit with just the differences between this version and v10 since you
had already started reviewing it.
This commit [1]https://github.com/melanieplageman/postgres/commit/c6843ef9e0767f80d928d87bdb1078c9d20346e3 is on a branch off of my fork that has just the delta
between v10 and v11.
As a warning, I have added a few updates to comments and such after
squashing the two in my current branch (which is what is in this patch).
I didn't intend to maintain the commits separately as I felt it would be
more confusing for other reviewers.

nodeHash.c
----------

1) MultiExecPrivateHash says this

/*
* Not subject to skew optimization, so either insert normally
* or save to batch file if it belongs to another stripe
*/

I wonder what it means to "belong to another stripe". I understand what
that means for batches, which are identified by batchno computed from
the hash value. But I thought "stripes" are just work_mem-sized pieces
of a batch, so I don't quite understand this. Especially when the code
does not actually check "which stripe" the row belongs to.

I agree this was confusing.

"belongs to another stripe" meant here that if batch 0 falls back and we
are still loading it, once we've filled up work_mem, we need to start
saving those tuples to a spill file for batch 0. I've changed the
comment to this:
-        * or save to batch file if it belongs to another stripe
+       * or save to batch file if batch 0 falls back and we have
+       * already filled the hashtable up to space_allowed.
OK. Silly question - what does "batch 0 falls back" mean? Does it mean
that we realized the hash table for batch 0 would not fit into work_mem,
so we switched to the "hashloop" strategy?

Exactly.

2) I find the fields hashloop_fallback rather confusing. We have one in
HashJoinTable (and it's array of BufFile items) and another one in
ParallelHashJoinBatch (this time just bool).

I think HashJoinTable should be renamed to hashloopBatchFile (similarly
to the other BufFile arrays).

I think you are right about the name. I've changed the name in
HashJoinTableData to hashloopBatchFile.

The array of BufFiles hashloop_fallback was only used by serial
hashjoin. The boolean hashloop_fallback variable is used only by
parallel hashjoin.

The reason I had them named the same thing is that I thought it would be
nice to have a variable with the same name to indicate if a batch "fell
back" for both parallel and serial hashjoin--especially since we check
it in the main hashjoin state machine used by parallel and serial
hashjoin.

In serial hashjoin, the BufFiles aren't identified by name, so I kept
them in that array. In parallel hashjoin, each ParallelHashJoinBatch has
the status saved (in the struct).
So, both represented the fall back status of a batch.

However, I agree with you, so I've renamed the serial one to
hashloopBatchFile.

OK

Although I'm not sure why we even need
this file, when we have innerBatchFile? BufFile(s) are not exactly free,
in fact it's one of the problems for hashjoins with many batches.

Interesting -- it didn't even occur to me to combine the bitmap with the
inner side batch file data.
It definitely seems like a good idea to save the BufFile given that so
little data will likely go in it and that it has a 1-1 relationship with
inner side batches.

How might it work? Would you reserve some space at the beginning of the
file? When would you reserve the bytes (before adding tuples you won't
know how many bytes you need, so it might be hard to make sure there is
enough space.) Would all inner side files have space reserved or just
fallback batches?

Oh! So the hashloopBatchFile is only used for the bitmap? I haven't
realized that. In that case it probably makes sense to keep it separate
from the files with spilled tuples, interleaving that somehow would be
way too complex, I think.

However, do we need an array of those files? I thought we only need the
bitmap until we process all rows from each "stripe" and then we can
throw it away, right? Which would also mean we don't need to worry about
the memory usage too much, because the 8kB buffer will go away after
calling BufFileClose.

Good point! I will try this change.

Regards,
Melanie (VMWare)

[1]: https://github.com/melanieplageman/postgres/commit/c6843ef9e0767f80d928d87bdb1078c9d20346e3
https://github.com/melanieplageman/postgres/commit/c6843ef9e0767f80d928d87bdb1078c9d20346e3

Attachments:

v11-0001-Implement-Adaptive-Hashjoin.patchtext/x-patch; charset=US-ASCII; name=v11-0001-Implement-Adaptive-Hashjoin.patchDownload

From 051185fcbb8acfdfd44af0cafbb7953bed363363 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 31 Aug 2020 11:53:32 -0700
Subject: [PATCH v11] Implement Adaptive Hashjoin

If the inner side tuples of a hashjoin will not fit in memory, the
hashjoin can be executed in multiple batches. If the statistics on the
inner side relation are accurate, planner chooses a multi-batch
strategy and sets the number of batches.
The query executor measures the real size of the hashtable and increases
the number of batches if the hashtable grows too large.

The number of batches is always a power of two, so an increase in the
number of batches doubles it.

Serial hashjoin measures batch size lazily -- waiting until it is
loading a batch to determine if it will fit in memory.

Parallel hashjoin, on the other hand, completes all changes to the
number of batches during the build phase. If it doubles the number of
batches, it dumps all the tuples out, reassigns them to batches,
measures each batch, and checks that it will fit in the space allowed.

In both cases, the executor currently makes a best effort. If a
particular batch won't fit in memory, and, upon changing the number of
batches none of the tuples move to a new batch, the executor disables
growth in the number of batches globally. After growth is disabled, all
batches that would have previously triggered an increase in the number
of batches instead exceed the space allowed.

There is no mechanism to perform a hashjoin within memory constraints if
a run of tuples hash to the same batch. Also, hashjoin will continue to
double the number of batches if *some* tuples move each time -- even if
the batch will never fit in memory -- resulting in an explosion in the
number of batches (affecting performance negatively for multiple
reasons).

Adaptive hashjoin is a mechanism to process a run of inner side tuples
with join keys which hash to the same batch in a manner that is
efficient and respects the space allowed.

When an offending batch causes the number of batches to be doubled and
some percentage of the tuples would not move to a new batch, that batch
can be marked to "fall back". This mechanism replaces serial hashjoin's
"grow_enabled" flag and replaces part of the functionality of parallel
hashjoin's "growth = PHJ_GROWTH_DISABLED" flag. However, instead of
disabling growth in the number of batches for all batches, it only
prevents this batch from causing another increase in the number of
batches.

When the inner side of this batch is loaded into memory, stripes of
arbitrary tuples totaling work_mem in size are loaded into the
hashtable. After probing this stripe, the outer side batch is rewound
and the next stripe is loaded. Each stripe of inner is probed until all
tuples have been processed.

Tuples that match are emitted (depending on the join semantics of the
particular join type) during probing of a stripe. In order to make
left outer join work, unmatched tuples cannot be emitted NULL-extended
until all stripes have been probed. To address this, a bitmap is created
with a bit for each tuple of the outer side. If a tuple on the outer
side matches a tuple from the inner, the corresponding bit is set. At
the end of probing all stripes, the executor scans the bitmap and emits
unmatched outer tuples.

Co-authored-by: Jesse Zhang <sbjesse@gmail.com>
Co-authored-by: David Kimura <dkimura@pivotal.io>
Co-authored-by: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>
---
 src/backend/commands/explain.c            |   43 +-
 src/backend/executor/nodeHash.c           |  749 ++++++--
 src/backend/executor/nodeHashjoin.c       |  794 ++++++--
 src/backend/postmaster/pgstat.c           |   31 +-
 src/backend/utils/sort/Makefile           |    1 +
 src/backend/utils/sort/sharedbits.c       |  288 +++
 src/backend/utils/sort/sharedtuplestore.c |   96 +-
 src/include/commands/explain.h            |    1 +
 src/include/executor/hashjoin.h           |  132 +-
 src/include/executor/instrument.h         |    7 +
 src/include/executor/nodeHash.h           |    9 +-
 src/include/executor/tuptable.h           |    2 +
 src/include/nodes/execnodes.h             |    5 +
 src/include/pgstat.h                      |   11 +-
 src/include/utils/sharedbits.h            |   39 +
 src/include/utils/sharedtuplestore.h      |   21 +
 src/test/regress/expected/join_hash.out   | 2024 ++++++++++++++++++++-
 src/test/regress/sql/join_hash.sql        |  214 ++-
 18 files changed, 4173 insertions(+), 294 deletions(-)
 create mode 100644 src/backend/utils/sort/sharedbits.c
 create mode 100644 src/include/utils/sharedbits.h

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c98c9b5547..1ce37dc4e2 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -185,6 +185,8 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 			es->wal = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "settings") == 0)
 			es->settings = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "usage") == 0)
+			es->usage = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "timing") == 0)
 		{
 			timing_set = true;
@@ -308,6 +310,7 @@ NewExplainState(void)
 
 	/* Set default options (most fields can be left as zeroes). */
 	es->costs = true;
+	es->usage = true;
 	/* Prepare output buffer. */
 	es->str = makeStringInfo();
 
@@ -3011,22 +3014,50 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		else if (hinstrument.nbatch_original != hinstrument.nbatch ||
 				 hinstrument.nbuckets_original != hinstrument.nbuckets)
 		{
+			ListCell   *lc;
+
 			ExplainIndentText(es);
 			appendStringInfo(es->str,
-							 "Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 "Buckets: %d (originally %d)  Batches: %d (originally %d)",
 							 hinstrument.nbuckets,
 							 hinstrument.nbuckets_original,
 							 hinstrument.nbatch,
-							 hinstrument.nbatch_original,
-							 spacePeakKb);
+							 hinstrument.nbatch_original);
+			if (es->usage)
+				appendStringInfo(es->str, "  Memory Usage: %ldkB\n", spacePeakKb);
+			else
+				appendStringInfo(es->str, "\n");
+
+			foreach(lc, hinstrument.fallback_batches_stats)
+			{
+				FallbackBatchStats *fbs = lfirst(lc);
+
+				ExplainIndentText(es);
+				appendStringInfo(es->str, "Batch: %d  Stripes: %d\n", fbs->batchno, fbs->numstripes);
+			}
 		}
 		else
 		{
+			ListCell   *lc;
+
 			ExplainIndentText(es);
 			appendStringInfo(es->str,
-							 "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
-							 hinstrument.nbuckets, hinstrument.nbatch,
-							 spacePeakKb);
+							 "Buckets: %d  Batches: %d",
+							 hinstrument.nbuckets, hinstrument.nbatch);
+			if (es->usage)
+				appendStringInfo(es->str, "  Memory Usage: %ldkB\n", spacePeakKb);
+			else
+				appendStringInfo(es->str, "\n");
+			foreach(lc, hinstrument.fallback_batches_stats)
+			{
+				FallbackBatchStats *fbs = lfirst(lc);
+
+				ExplainIndentText(es);
+				appendStringInfo(es->str,
+								 "Batch: %d  Stripes: %d\n",
+								 fbs->batchno,
+								 fbs->numstripes);
+			}
 		}
 	}
 }
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index ea69eeb2a1..8a62c0d2dd 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -60,6 +60,7 @@ static void *dense_alloc(HashJoinTable hashtable, Size size);
 static HashJoinTuple ExecParallelHashTupleAlloc(HashJoinTable hashtable,
 												size_t size,
 												dsa_pointer *shared);
+static void ExecParallelHashTableEvictBatch0(HashJoinTable hashtable);
 static void MultiExecPrivateHash(HashState *node);
 static void MultiExecParallelHash(HashState *node);
 static inline HashJoinTuple ExecParallelHashFirstTuple(HashJoinTable table,
@@ -72,6 +73,9 @@ static inline void ExecParallelHashPushTuple(dsa_pointer_atomic *head,
 static void ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch);
 static void ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable);
 static void ExecParallelHashRepartitionFirst(HashJoinTable hashtable);
+static void ExecParallelHashRepartitionBatch0Tuple(HashJoinTable hashtable,
+												   MinimalTuple tuple,
+												   uint32 hashvalue);
 static void ExecParallelHashRepartitionRest(HashJoinTable hashtable);
 static HashMemoryChunk ExecParallelHashPopChunkQueue(HashJoinTable table,
 													 dsa_pointer *shared);
@@ -184,13 +188,53 @@ MultiExecPrivateHash(HashState *node)
 			}
 			else
 			{
-				/* Not subject to skew optimization, so insert normally */
-				ExecHashTableInsert(hashtable, slot, hashvalue);
+				/*
+				 * Not subject to skew optimization, so either insert normally
+				 * or save to batch file if batch 0 falls back and we have
+				 * already filled the hashtable up to space_allowed.
+				 */
+				int			bucketno;
+				int			batchno;
+				bool		shouldFree;
+				MinimalTuple tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+				ExecHashGetBucketAndBatch(hashtable, hashvalue,
+										  &bucketno, &batchno);
+
+				/*
+				 * If we set batch 0 to fallback on the previous tuple Save
+				 * the tuples in this batch which will not fit in the
+				 * hashtable should I be checking that hashtable->curstripe !=
+				 * 0?
+				 */
+				if (hashtable->hashloopBatchFile && hashtable->hashloopBatchFile[0])
+					ExecHashJoinSaveTuple(tuple,
+										  hashvalue,
+										  &hashtable->innerBatchFile[batchno]);
+				else
+					ExecHashTableInsert(hashtable, slot, hashvalue);
+
+				if (shouldFree)
+					heap_free_minimal_tuple(tuple);
 			}
 			hashtable->totalTuples += 1;
 		}
 	}
 
+	/*
+	 * If batch 0 fell back, rewind the inner side file where we saved the
+	 * tuples which did not fit in memory to prepare it for loading upon
+	 * finishing probing stripe 0 of batch 0
+	 */
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[0])
+	{
+		if (BufFileSeek(hashtable->innerBatchFile[0], 0, 0L, SEEK_SET))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not rewind hash-join temporary file: %m")));
+	}
+
+
 	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
 	if (hashtable->nbuckets != hashtable->nbuckets_optimal)
 		ExecHashIncreaseNumBuckets(hashtable);
@@ -319,9 +363,9 @@ MultiExecParallelHash(HashState *node)
 				 * are now fixed.  While building them we made sure they'd fit
 				 * in our memory budget when we load them back in later (or we
 				 * tried to do that and gave up because we detected extreme
-				 * skew).
+				 * skew and thus marked them to fall back).
 				 */
-				pstate->growth = PHJ_GROWTH_DISABLED;
+				pstate->growth = PHJ_GROWTH_LOADING;
 			}
 	}
 
@@ -496,12 +540,14 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 	hashtable->curbatch = 0;
 	hashtable->nbatch_original = nbatch;
 	hashtable->nbatch_outstart = nbatch;
-	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
 	hashtable->partialTuples = 0;
 	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
+	hashtable->hashloopBatchFile = NULL;
+	hashtable->fallback_batches_stats = NULL;
+	hashtable->curstripe = STRIPE_DETACHED;
 	hashtable->spaceUsed = 0;
 	hashtable->spacePeak = 0;
 	hashtable->spaceAllowed = space_allowed;
@@ -573,6 +619,8 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 			palloc0(nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			palloc0(nbatch * sizeof(BufFile *));
+		hashtable->hashloopBatchFile = (BufFile **)
+			palloc0(nbatch * sizeof(BufFile *));
 		/* The files will not be opened until needed... */
 		/* ... but make sure we have temp tablespaces established for them */
 		PrepareTempTablespaces();
@@ -856,18 +904,19 @@ ExecHashTableDestroy(HashJoinTable hashtable)
 	int			i;
 
 	/*
-	 * Make sure all the temp files are closed.  We skip batch 0, since it
-	 * can't have any temp files (and the arrays might not even exist if
-	 * nbatch is only 1).  Parallel hash joins don't use these files.
+	 * Make sure all the temp files are closed.  Parallel hash joins don't use
+	 * these files.
 	 */
 	if (hashtable->innerBatchFile != NULL)
 	{
-		for (i = 1; i < hashtable->nbatch; i++)
+		for (i = 0; i < hashtable->nbatch; i++)
 		{
 			if (hashtable->innerBatchFile[i])
 				BufFileClose(hashtable->innerBatchFile[i]);
 			if (hashtable->outerBatchFile[i])
 				BufFileClose(hashtable->outerBatchFile[i]);
+			if (hashtable->hashloopBatchFile[i])
+				BufFileClose(hashtable->hashloopBatchFile[i]);
 		}
 	}
 
@@ -878,6 +927,18 @@ ExecHashTableDestroy(HashJoinTable hashtable)
 	pfree(hashtable);
 }
 
+/*
+ * Threshhold for tuple relocation during batch split for parallel and serial
+ * hashjoin.
+ * While growing the number of batches, for the batch which triggered the growth,
+ * if more than MAX_RELOCATION % of its tuples move to its child batch, then
+ * it likely has skewed data and so the child batch (the new home to the skewed
+ * tuples) will be marked as a "fallback" batch and processed using the hashloop
+ * join algorithm. The reverse is true as well: if more than MAX_RELOCATION
+ * remain in the parent, it too should be marked to "fallback".
+ */
+#define MAX_RELOCATION 0.8
+
 /*
  * ExecHashIncreaseNumBatches
  *		increase the original number of batches in order to reduce
@@ -888,14 +949,19 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 {
 	int			oldnbatch = hashtable->nbatch;
 	int			curbatch = hashtable->curbatch;
+	int			childbatch;
 	int			nbatch;
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
 	HashMemoryChunk oldchunks;
+	int			curbatch_outgoing_tuples;
+	int			childbatch_outgoing_tuples;
+	int			target_batch;
+	FallbackBatchStats *fallback_batch_stats;
+	size_t		batchSize = 0;
 
-	/* do nothing if we've decided to shut off growth */
-	if (!hashtable->growEnabled)
+	if (hashtable->hashloopBatchFile && hashtable->hashloopBatchFile[curbatch])
 		return;
 
 	/* safety check to avoid overflow */
@@ -919,6 +985,8 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			palloc0(nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			palloc0(nbatch * sizeof(BufFile *));
+		hashtable->hashloopBatchFile = (BufFile **)
+			palloc0(nbatch * sizeof(BufFile *));
 		/* time to establish the temp tablespaces, too */
 		PrepareTempTablespaces();
 	}
@@ -929,10 +997,14 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			repalloc(hashtable->innerBatchFile, nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			repalloc(hashtable->outerBatchFile, nbatch * sizeof(BufFile *));
+		hashtable->hashloopBatchFile = (BufFile **)
+			repalloc(hashtable->hashloopBatchFile, nbatch * sizeof(BufFile *));
 		MemSet(hashtable->innerBatchFile + oldnbatch, 0,
 			   (nbatch - oldnbatch) * sizeof(BufFile *));
 		MemSet(hashtable->outerBatchFile + oldnbatch, 0,
 			   (nbatch - oldnbatch) * sizeof(BufFile *));
+		MemSet(hashtable->hashloopBatchFile + oldnbatch, 0,
+			   (nbatch - oldnbatch) * sizeof(BufFile *));
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -944,6 +1016,8 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 * no longer of the current batch.
 	 */
 	ninmemory = nfreed = 0;
+	curbatch_outgoing_tuples = childbatch_outgoing_tuples = 0;
+	childbatch = (1U << (my_log2(hashtable->nbatch) - 1)) | hashtable->curbatch;
 
 	/* If know we need to resize nbuckets, we can do it while rebatching. */
 	if (hashtable->nbuckets_optimal != hashtable->nbuckets)
@@ -990,7 +1064,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
 									  &bucketno, &batchno);
 
-			if (batchno == curbatch)
+			if (batchno == curbatch && (curbatch != 0 || batchSize + hashTupleSize < hashtable->spaceAllowed))
 			{
 				/* keep tuple in memory - copy it into the new chunk */
 				HashJoinTuple copyTuple;
@@ -1001,17 +1075,29 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 				/* and add it back to the appropriate bucket */
 				copyTuple->next.unshared = hashtable->buckets.unshared[bucketno];
 				hashtable->buckets.unshared[bucketno] = copyTuple;
+				curbatch_outgoing_tuples++;
+				batchSize += hashTupleSize;
 			}
 			else
 			{
 				/* dump it out */
-				Assert(batchno > curbatch);
+				Assert(batchno > curbatch || batchSize + hashTupleSize >= hashtable->spaceAllowed);
 				ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(hashTuple),
 									  hashTuple->hashvalue,
 									  &hashtable->innerBatchFile[batchno]);
 
 				hashtable->spaceUsed -= hashTupleSize;
 				nfreed++;
+
+				/*
+				 * TODO: what to do about tuples that don't go to the child
+				 * batch or stay in the current batch? (this is why we are
+				 * counting tuples to child and curbatch with two diff
+				 * variables in case the tuples go to a batch that isn't the
+				 * child)
+				 */
+				if (batchno == childbatch)
+					childbatch_outgoing_tuples++;
 			}
 
 			/* next tuple in this chunk */
@@ -1032,21 +1118,33 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 #endif
 
 	/*
-	 * If we dumped out either all or none of the tuples in the table, disable
-	 * further expansion of nbatch.  This situation implies that we have
-	 * enough tuples of identical hashvalues to overflow spaceAllowed.
-	 * Increasing nbatch will not fix it since there's no way to subdivide the
-	 * group any more finely. We have to just gut it out and hope the server
-	 * has enough RAM.
+	 * The same batch should not be marked to fall back more than once
 	 */
-	if (nfreed == 0 || nfreed == ninmemory)
-	{
-		hashtable->growEnabled = false;
 #ifdef HJDEBUG
-		printf("Hashjoin %p: disabling further increase of nbatch\n",
-			   hashtable);
+	if ((childbatch_outgoing_tuples / (float) ninmemory) >= 0.8)
+		printf("childbatch %i targeted to fallback.", childbatch);
+	if ((curbatch_outgoing_tuples / (float) ninmemory) >= 0.8)
+		printf("curbatch %i targeted to fallback.", curbatch);
 #endif
-	}
+
+	/*
+	 * If too many tuples remain in the parent or too many tuples migrate to
+	 * the child, there is likely skew and continuing to increase the number
+	 * of batches will not help. Mark the batch which contains the skewed
+	 * tuples to be processed with block nested hashloop join.
+	 */
+	if ((childbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION)
+		target_batch = childbatch;
+	else if ((curbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION)
+		target_batch = curbatch;
+	else
+		return;
+	hashtable->hashloopBatchFile[target_batch] = BufFileCreateTemp(false);
+
+	fallback_batch_stats = palloc0(sizeof(FallbackBatchStats));
+	fallback_batch_stats->batchno = target_batch;
+	fallback_batch_stats->numstripes = 0;
+	hashtable->fallback_batches_stats = lappend(hashtable->fallback_batches_stats, fallback_batch_stats);
 }
 
 /*
@@ -1199,6 +1297,11 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 			ExecParallelHashTableSetCurrentBatch(hashtable, 0);
 			/* Then partition, flush counters. */
 			ExecParallelHashRepartitionFirst(hashtable);
+
+			/*
+			 * TODO: add a debugging check that confirms that all the tuples
+			 * from the old generation are present in the new generation
+			 */
 			ExecParallelHashRepartitionRest(hashtable);
 			ExecParallelHashMergeCounters(hashtable);
 			/* Wait for the above to be finished. */
@@ -1217,7 +1320,6 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 									 WAIT_EVENT_HASH_GROW_BATCHES_DECIDE))
 			{
 				bool		space_exhausted = false;
-				bool		extreme_skew_detected = false;
 
 				/* Make sure that we have the current dimensions and buckets. */
 				ExecParallelHashEnsureBatchAccessors(hashtable);
@@ -1228,27 +1330,83 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 				{
 					ParallelHashJoinBatch *batch = hashtable->batches[i].shared;
 
+					/*
+					 * All batches were just created anew during
+					 * repartitioning
+					 */
+					Assert(!hashtable->batches[i].shared->hashloop_fallback);
+
+					/*
+					 * At the time of repartitioning, each batch updates its
+					 * estimated_size to reflect the size of the batch file on
+					 * disk. It is also updated when increasing preallocated
+					 * space in ExecParallelHashTuplePrealloc().
+					 *
+					 * Batch 0 is inserted into memory during the build stage,
+					 * it can spill to a file, so the size member, which
+					 * reflects the part of batch 0 in memory should never
+					 * exceed the space_allowed.
+					 */
+					Assert(batch->size <= pstate->space_allowed);
+
 					if (batch->space_exhausted ||
 						batch->estimated_size > pstate->space_allowed)
 					{
 						int			parent;
+						float		frac_moved;
 
 						space_exhausted = true;
 
+						parent = i % pstate->old_nbatch;
+						frac_moved = batch->ntuples / (float) hashtable->batches[parent].shared->old_ntuples;
+
 						/*
-						 * Did this batch receive ALL of the tuples from its
-						 * parent batch?  That would indicate that further
-						 * repartitioning isn't going to help (the hash values
-						 * are probably all the same).
+						 * If too many tuples remain in the parent or too many
+						 * tuples migrate to the child, there is likely skew
+						 * and continuing to increase the number of batches
+						 * will not help. Mark the batch which contains the
+						 * skewed tuples to be processed with block nested
+						 * hashloop join.
 						 */
-						parent = i % pstate->old_nbatch;
-						if (batch->ntuples == hashtable->batches[parent].shared->old_ntuples)
-							extreme_skew_detected = true;
+						if (frac_moved >= MAX_RELOCATION)
+						{
+							batch->hashloop_fallback = true;
+							space_exhausted = false;
+						}
 					}
+
+					/*
+					 * If all of the tuples in the hashtable were put back in
+					 * the hashtable during repartitioning, mark this batch as
+					 * a fallback batch so that we will evict the tuples to a
+					 * spill file were we to run out of space again This has
+					 * the problem of wasting a lot of time during the probe
+					 * phase if it turns out that we never try and allocate
+					 * any more memory in the hashtable.
+					 *
+					 * TODO: It might be worth doing something to indicate
+					 * that if all of the tuples went back into a batch but it
+					 * only exactly used the space_allowed, that the batch is
+					 * not a fallback batch yet but that the current stripe is
+					 * full, so if you need to allocate more, it would mark it
+					 * as a fallback batch. Otherwise, a batch 0 with no
+					 * tuples in spill files will still be treated as a
+					 * fallback batch during probing
+					 */
+					if (i == 0 && hashtable->batches[0].shared->size == pstate->space_allowed)
+					{
+						if (hashtable->batches[0].shared->ntuples == hashtable->batches[0].shared->old_ntuples)
+						{
+							hashtable->batches[0].shared->hashloop_fallback = true;
+							space_exhausted = false;
+						}
+					}
+					if (space_exhausted)
+						break;
 				}
 
-				/* Don't keep growing if it's not helping or we'd overflow. */
-				if (extreme_skew_detected || hashtable->nbatch >= INT_MAX / 2)
+				/* Don't keep growing if we'd overflow. */
+				if (hashtable->nbatch >= INT_MAX / 2)
 					pstate->growth = PHJ_GROWTH_DISABLED;
 				else if (space_exhausted)
 					pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
@@ -1276,65 +1434,153 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 static void
 ExecParallelHashRepartitionFirst(HashJoinTable hashtable)
 {
+	ParallelHashJoinState *pstate;
+
+	ParallelHashJoinBatch *old_shared;
+	SharedTuplestoreAccessor *old_inner_batch0_sts;
+
 	dsa_pointer chunk_shared;
 	HashMemoryChunk chunk;
 
-	Assert(hashtable->nbatch == hashtable->parallel_state->nbatch);
+	ParallelHashJoinBatch *old_batches = (ParallelHashJoinBatch *) dsa_get_address(hashtable->area, hashtable->parallel_state->old_batches);
+
+	Assert(old_batches);
+	old_shared = NthParallelHashJoinBatch(old_batches, 0);
+	old_inner_batch0_sts = sts_attach(ParallelHashJoinBatchInner(old_shared), ParallelWorkerNumber + 1, &hashtable->parallel_state->fileset);
+
+	pstate = hashtable->parallel_state;
 
-	while ((chunk = ExecParallelHashPopChunkQueue(hashtable, &chunk_shared)))
+	Assert(hashtable->nbatch == hashtable->parallel_state->nbatch);
+	BarrierAttach(&pstate->repartition_barrier);
+	switch (PHJ_REPARTITION_BATCH0_PHASE(BarrierPhase(&pstate->repartition_barrier)))
 	{
-		size_t		idx = 0;
+		case PHJ_REPARTITION_BATCH0_DRAIN_QUEUE:
+			while ((chunk = ExecParallelHashPopChunkQueue(hashtable, &chunk_shared)))
+			{
+				MinimalTuple tuple;
+				size_t		idx = 0;
 
-		/* Repartition all tuples in this chunk. */
-		while (idx < chunk->used)
-		{
-			HashJoinTuple hashTuple = (HashJoinTuple) (HASH_CHUNK_DATA(chunk) + idx);
-			MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
-			HashJoinTuple copyTuple;
-			dsa_pointer shared;
-			int			bucketno;
-			int			batchno;
+				/*
+				 * Repartition all tuples in this chunk. These tuples may be
+				 * relocated to a batch file or may be inserted back into
+				 * memory.
+				 */
+				while (idx < chunk->used)
+				{
+					HashJoinTuple hashTuple = (HashJoinTuple) (HASH_CHUNK_DATA(chunk) + idx);
 
-			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
-									  &bucketno, &batchno);
+					tuple = HJTUPLE_MINTUPLE(hashTuple);
 
-			Assert(batchno < hashtable->nbatch);
-			if (batchno == 0)
-			{
-				/* It still belongs in batch 0.  Copy to a new chunk. */
-				copyTuple =
-					ExecParallelHashTupleAlloc(hashtable,
-											   HJTUPLE_OVERHEAD + tuple->t_len,
-											   &shared);
-				copyTuple->hashvalue = hashTuple->hashvalue;
-				memcpy(HJTUPLE_MINTUPLE(copyTuple), tuple, tuple->t_len);
-				ExecParallelHashPushTuple(&hashtable->buckets.shared[bucketno],
-										  copyTuple, shared);
+					ExecParallelHashRepartitionBatch0Tuple(hashtable,
+														   tuple,
+														   hashTuple->hashvalue);
+
+					idx += MAXALIGN(HJTUPLE_OVERHEAD + HJTUPLE_MINTUPLE(hashTuple)->t_len);
+				}
+
+				dsa_free(hashtable->area, chunk_shared);
+				CHECK_FOR_INTERRUPTS();
 			}
-			else
+			BarrierArriveAndWait(&pstate->repartition_barrier, WAIT_EVENT_HASH_REPARTITION_BATCH0_DRAIN_QUEUE);
+			/* FALLTHROUGH */
+		case PHJ_REPARTITION_BATCH0_DRAIN_SPILL_FILE:
 			{
-				size_t		tuple_size =
-				MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
+				MinimalTuple tuple;
+				tupleMetadata metadata;
 
-				/* It belongs in a later batch. */
-				hashtable->batches[batchno].estimated_size += tuple_size;
-				sts_puttuple(hashtable->batches[batchno].inner_tuples,
-							 &hashTuple->hashvalue, tuple);
+				/*
+				 * Repartition all of the tuples in this spill file. These
+				 * tuples may go back into the hashtable if space was freed up
+				 * or they may go into another batch or they may go into the
+				 * batch 0 spill file.
+				 */
+				sts_begin_parallel_scan(old_inner_batch0_sts);
+				while ((tuple = sts_parallel_scan_next(old_inner_batch0_sts,
+													   &metadata.hashvalue)))
+				{
+
+					ExecParallelHashRepartitionBatch0Tuple(hashtable,
+														   tuple,
+														   metadata.hashvalue);
+				}
+				sts_end_parallel_scan(old_inner_batch0_sts);
 			}
+	}
+	BarrierArriveAndDetach(&pstate->repartition_barrier);
+}
 
-			/* Count this tuple. */
-			++hashtable->batches[0].old_ntuples;
-			++hashtable->batches[batchno].ntuples;
+static void
+ExecParallelHashRepartitionBatch0Tuple(HashJoinTable hashtable,
+									   MinimalTuple tuple,
+									   uint32 hashvalue)
+{
+	int			batchno;
+	int			bucketno;
+	dsa_pointer shared;
+	HashJoinTuple copyTuple;
+	ParallelHashJoinState *pstate = hashtable->parallel_state;
+	bool		spill = true;
+	bool		hashtable_full = hashtable->batches[0].shared->size >= pstate->space_allowed;
+	size_t		tuple_size =
+	MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 
-			idx += MAXALIGN(HJTUPLE_OVERHEAD +
-							HJTUPLE_MINTUPLE(hashTuple)->t_len);
+	ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno, &batchno);
+
+	/*
+	 * We don't take a lock to read pstate->space_allowed because it should
+	 * not change during execution of the hash join
+	 */
+
+	Assert(BarrierPhase(&pstate->build_barrier) == PHJ_BUILD_HASHING_INNER);
+	if (batchno == 0 && !hashtable_full)
+	{
+		copyTuple = ExecParallelHashTupleAlloc(hashtable,
+											   HJTUPLE_OVERHEAD + tuple->t_len,
+											   &shared);
+
+		/*
+		 * TODO: do we need to check if growth was set to
+		 * PHJ_GROWTH_SPILL_BATCH0?
+		 */
+		if (copyTuple)
+		{
+			/* Store the hash value in the HashJoinTuple header. */
+			copyTuple->hashvalue = hashvalue;
+			memcpy(HJTUPLE_MINTUPLE(copyTuple), tuple, tuple->t_len);
+
+			/* Push it onto the front of the bucket's list */
+			ExecParallelHashPushTuple(&hashtable->buckets.shared[bucketno],
+									  copyTuple, shared);
+			pg_atomic_add_fetch_u64(&hashtable->batches[0].shared->ntuples_in_memory, 1);
+
+			spill = false;
 		}
+	}
 
-		/* Free this chunk. */
-		dsa_free(hashtable->area, chunk_shared);
+	if (spill)
+	{
 
-		CHECK_FOR_INTERRUPTS();
+		tupleMetadata metadata;
+
+		ParallelHashJoinBatchAccessor *batch_accessor = &(hashtable->batches[batchno]);
+
+		/*
+		 * It is okay to use backend local here because force spill tuple is
+		 * only done during repartitioning when we can't grow batches so won't
+		 * make decision based on it and will merge counters during deciding
+		 * and during evictbatch0 which can ony be done on a batch that is
+		 * already fallback so we won't make decision on it and will merge
+		 * counters after the build phase
+		 */
+		batch_accessor->estimated_size += tuple_size;
+		metadata.hashvalue = hashvalue;
+
+		sts_puttuple(batch_accessor->inner_tuples,
+					 &metadata,
+					 tuple);
 	}
+	++hashtable->batches[batchno].ntuples;
+	++hashtable->batches[0].old_ntuples;
 }
 
 /*
@@ -1371,24 +1617,41 @@ ExecParallelHashRepartitionRest(HashJoinTable hashtable)
 
 		/* Scan one partition from the previous generation. */
 		sts_begin_parallel_scan(old_inner_tuples[i]);
-		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i], &hashvalue)))
+		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i],
+											   &hashvalue)))
 		{
-			size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 			int			bucketno;
 			int			batchno;
+			size_t		tuple_size;
+			tupleMetadata metadata;
+			ParallelHashJoinBatchAccessor *batch_accessor;
+
 
 			/* Decide which partition it goes to in the new generation. */
 			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
 									  &batchno);
 
-			hashtable->batches[batchno].estimated_size += tuple_size;
-			++hashtable->batches[batchno].ntuples;
-			++hashtable->batches[i].old_ntuples;
+			tuple_size =
+				MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 
-			/* Store the tuple its new batch. */
-			sts_puttuple(hashtable->batches[batchno].inner_tuples,
-						 &hashvalue, tuple);
+			batch_accessor = &(hashtable->batches[batchno]);
 
+			/*
+			 * It is okay to use backend local here because force spill tuple
+			 * is only done during repartitioning when we can't grow batches
+			 * so won't make decision based on it and will merge counters
+			 * during deciding and during evictbatch0 which can ony be done on
+			 * a batch that is already fallback so we won't make decision on
+			 * it and will merge counters after the build phase
+			 */
+			batch_accessor->estimated_size += tuple_size;
+			metadata.hashvalue = hashvalue;
+
+			sts_puttuple(batch_accessor->inner_tuples,
+						 &metadata,
+						 tuple);
+			++hashtable->batches[batchno].ntuples;
+			++hashtable->batches[i].old_ntuples;
 			CHECK_FOR_INTERRUPTS();
 		}
 		sts_end_parallel_scan(old_inner_tuples[i]);
@@ -1705,7 +1968,7 @@ retry:
 		hashTuple = ExecParallelHashTupleAlloc(hashtable,
 											   HJTUPLE_OVERHEAD + tuple->t_len,
 											   &shared);
-		if (hashTuple == NULL)
+		if (!hashTuple)
 			goto retry;
 
 		/* Store the hash value in the HashJoinTuple header. */
@@ -1715,10 +1978,13 @@ retry:
 		/* Push it onto the front of the bucket's list */
 		ExecParallelHashPushTuple(&hashtable->buckets.shared[bucketno],
 								  hashTuple, shared);
+		pg_atomic_add_fetch_u64(&hashtable->batches[0].shared->ntuples_in_memory, 1);
+
 	}
 	else
 	{
 		size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
+		tupleMetadata metadata;
 
 		Assert(batchno > 0);
 
@@ -1731,7 +1997,11 @@ retry:
 
 		Assert(hashtable->batches[batchno].preallocated >= tuple_size);
 		hashtable->batches[batchno].preallocated -= tuple_size;
-		sts_puttuple(hashtable->batches[batchno].inner_tuples, &hashvalue,
+
+		metadata.hashvalue = hashvalue;
+
+		sts_puttuple(hashtable->batches[batchno].inner_tuples,
+					 &metadata,
 					 tuple);
 	}
 	++hashtable->batches[batchno].ntuples;
@@ -1746,10 +2016,11 @@ retry:
  * to other batches or to run out of memory, and should only be called with
  * tuples that belong in the current batch once growth has been disabled.
  */
-void
+MinimalTuple
 ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable,
 										TupleTableSlot *slot,
-										uint32 hashvalue)
+										uint32 hashvalue,
+										int read_participant)
 {
 	bool		shouldFree;
 	MinimalTuple tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
@@ -1758,19 +2029,26 @@ ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable,
 	int			batchno;
 	int			bucketno;
 
+
 	ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno, &batchno);
 	Assert(batchno == hashtable->curbatch);
+
 	hashTuple = ExecParallelHashTupleAlloc(hashtable,
 										   HJTUPLE_OVERHEAD + tuple->t_len,
 										   &shared);
+	if (!hashTuple)
+		return NULL;
+
 	hashTuple->hashvalue = hashvalue;
 	memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len);
 	HeapTupleHeaderClearMatch(HJTUPLE_MINTUPLE(hashTuple));
 	ExecParallelHashPushTuple(&hashtable->buckets.shared[bucketno],
 							  hashTuple, shared);
+	pg_atomic_add_fetch_u64(&hashtable->batches[hashtable->curbatch].shared->ntuples_in_memory, 1);
 
 	if (shouldFree)
 		heap_free_minimal_tuple(tuple);
+	return tuple;
 }
 
 /*
@@ -2602,6 +2880,12 @@ ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt)
 		pcxt->nworkers * sizeof(HashInstrumentation);
 	node->shared_info = (SharedHashInfo *) shm_toc_allocate(pcxt->toc, size);
 
+	/*
+	 * TODO: the linked list which is being used for fallback stats needs
+	 * space allocated for it in shared memory as well. For now, it seems to
+	 * be coincidentally working
+	 */
+
 	/* Each per-worker area must start out as zeroes. */
 	memset(node->shared_info, 0, size);
 
@@ -2701,6 +2985,11 @@ ExecHashAccumInstrumentation(HashInstrumentation *instrument,
 									  hashtable->nbatch_original);
 	instrument->space_peak = Max(instrument->space_peak,
 								 hashtable->spacePeak);
+
+	/*
+	 * TODO: this doesn't work right now in case of rescan (doesn't get max)
+	 */
+	instrument->fallback_batches_stats = hashtable->fallback_batches_stats;
 }
 
 /*
@@ -2775,6 +3064,146 @@ dense_alloc(HashJoinTable hashtable, Size size)
 	return ptr;
 }
 
+/*
+ * Assume caller has a lock or is behind a barrier and has the right
+ * to change these values
+ */
+inline void
+ExecParallelHashTableRecycle(HashJoinTable hashtable)
+{
+	ParallelHashJoinBatchAccessor *batch_accessor = &(hashtable->batches[hashtable->curbatch]);
+	ParallelHashJoinBatch *batch = batch_accessor->shared;
+
+	dsa_pointer_atomic *buckets = (dsa_pointer_atomic *)
+	dsa_get_address(hashtable->area, batch->buckets);
+
+	for (size_t i = 0; i < hashtable->nbuckets; ++i)
+		dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
+	batch->size = 0;
+	batch->space_exhausted = false;
+
+	/*
+	 * TODO: I'm not sure that we want to reset this when this function is
+	 * called to recycle the hashtable during the build stage as part of
+	 * evicting batch 0. It seems like it would be okay since a worker does
+	 * not have the right to over-allocate now. So, for a fallback batch,
+	 * at_least_one_chunk doesn't matter It seems like it may not matter at
+	 * all anymore...
+	 */
+	batch_accessor->at_least_one_chunk = false;
+	pg_atomic_exchange_u64(&batch->ntuples_in_memory, 0);
+}
+
+/*
+ * The eviction phase machine is responsible for evicting tuples from the
+ * hashtable during the Build stage of executing a parallel-aware parallel
+ * hash join. After increasing the number of batches in
+ * ExecParallelHashIncreaseNumBatches(), in the PHJ_GROW_BATCHES_DECIDING
+ * phase, if the batch 0 hashtable meets the criteria for falling back
+ * and is marked a fallback batch, the next time an inserted tuple would
+ * exceed the space_allowed, instead, trigger an eviction. Evict all
+ * batch 0 tuples to spill files in batch 0 inner side SharedTuplestore.
+ */
+static void
+ExecParallelHashTableEvictBatch0(HashJoinTable hashtable)
+{
+
+	ParallelHashJoinState *pstate = hashtable->parallel_state;
+	ParallelHashJoinBatchAccessor *batch0_accessor = &(hashtable->batches[0]);
+
+	/*
+	 * No other workers must be inserting tuples into the hashtable once
+	 * growth has been set to PHJ_EVICT. Otherwise, the below will not work
+	 * correctly. This should be okay since the same assumptions are made in
+	 * the increase batch machine.
+	 */
+	BarrierAttach(&pstate->eviction_barrier);
+	switch (PHJ_EVICT_PHASE(BarrierPhase(&pstate->eviction_barrier)))
+	{
+		case PHJ_EVICT_ELECTING:
+			if (BarrierArriveAndWait(&pstate->eviction_barrier, WAIT_EVENT_HASH_EVICT_ELECT))
+			{
+				pstate->chunk_work_queue = batch0_accessor->shared->chunks;
+				batch0_accessor->shared->chunks = InvalidDsaPointer;
+				ExecParallelHashTableRecycle(hashtable);
+			}
+			/* FALLTHROUGH */
+		case PHJ_EVICT_RESETTING:
+			BarrierArriveAndWait(&pstate->eviction_barrier, WAIT_EVENT_HASH_EVICT_RESET);
+			/* FALLTHROUGH */
+		case PHJ_EVICT_SPILLING:
+			{
+				dsa_pointer chunk_shared;
+				HashMemoryChunk chunk;
+
+				/*
+				 * TODO: Do I need to do this here? am I guaranteed to have
+				 * the correct shared memory reference to the batches array
+				 * already?
+				 */
+				ParallelHashJoinBatch *batches;
+				ParallelHashJoinBatch *batch0;
+
+				batches = (ParallelHashJoinBatch *)
+					dsa_get_address(hashtable->area, pstate->batches);
+				batch0 = NthParallelHashJoinBatch(batches, 0);
+				Assert(batch0 == hashtable->batches[0].shared);
+
+				ExecParallelHashTableSetCurrentBatch(hashtable, 0);
+
+				while ((chunk = ExecParallelHashPopChunkQueue(hashtable, &chunk_shared)))
+				{
+					size_t		idx = 0;
+
+					while (idx < chunk->used)
+					{
+						tupleMetadata metadata;
+
+						size_t		tuple_size;
+						MinimalTuple minTuple;
+						HashJoinTuple hashTuple = (HashJoinTuple) (HASH_CHUNK_DATA(chunk) + idx);
+
+						minTuple = HJTUPLE_MINTUPLE(hashTuple);
+
+						tuple_size =
+							MAXALIGN(HJTUPLE_OVERHEAD + minTuple->t_len);
+
+						/*
+						 * It is okay to use backend local here because can
+						 * ony be done on a batch that is already fallback so
+						 * we won't make decision on it and will merge
+						 * counters after the build phase
+						 */
+						batch0_accessor->estimated_size += tuple_size;
+						metadata.hashvalue = hashTuple->hashvalue;
+
+						sts_puttuple(batch0_accessor->inner_tuples,
+									 &metadata,
+									 minTuple);
+
+						idx += MAXALIGN(HJTUPLE_OVERHEAD +
+										HJTUPLE_MINTUPLE(hashTuple)->t_len);
+					}
+					dsa_free(hashtable->area, chunk_shared);
+
+					CHECK_FOR_INTERRUPTS();
+				}
+				BarrierArriveAndWait(&pstate->eviction_barrier, WAIT_EVENT_HASH_EVICT_SPILL);
+			}
+			/* FALLTHROUGH */
+		case PHJ_EVICT_FINISHING:
+
+			/*
+			 * TODO: Is this phase needed?
+			 */
+			if (BarrierArriveAndWait(&pstate->eviction_barrier, WAIT_EVENT_HASH_EVICT_FINISH))
+				pstate->growth = PHJ_GROWTH_OK;
+			/* FALLTHROUGH */
+		case PHJ_EVICT_DONE:
+			BarrierArriveAndDetach(&pstate->eviction_barrier);
+	}
+}
+
 /*
  * Allocate space for a tuple in shared dense storage.  This is equivalent to
  * dense_alloc but for Parallel Hash using shared memory.
@@ -2787,7 +3216,8 @@ dense_alloc(HashJoinTable hashtable, Size size)
  * possibility that the tuple no longer belongs in the same batch).
  */
 static HashJoinTuple
-ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
+ExecParallelHashTupleAlloc(HashJoinTable hashtable,
+						   size_t size,
 						   dsa_pointer *shared)
 {
 	ParallelHashJoinState *pstate = hashtable->parallel_state;
@@ -2828,7 +3258,8 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 	 * Check if we need to help increase the number of buckets or batches.
 	 */
 	if (pstate->growth == PHJ_GROWTH_NEED_MORE_BATCHES ||
-		pstate->growth == PHJ_GROWTH_NEED_MORE_BUCKETS)
+		pstate->growth == PHJ_GROWTH_NEED_MORE_BUCKETS ||
+		pstate->growth == PHJ_GROWTH_SPILL_BATCH0)
 	{
 		ParallelHashGrowth growth = pstate->growth;
 
@@ -2840,6 +3271,8 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 			ExecParallelHashIncreaseNumBatches(hashtable);
 		else if (growth == PHJ_GROWTH_NEED_MORE_BUCKETS)
 			ExecParallelHashIncreaseNumBuckets(hashtable);
+		else if (growth == PHJ_GROWTH_SPILL_BATCH0)
+			ExecParallelHashTableEvictBatch0(hashtable);
 
 		/* The caller must retry. */
 		return NULL;
@@ -2852,7 +3285,7 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 		chunk_size = HASH_CHUNK_SIZE;
 
 	/* Check if it's time to grow batches or buckets. */
-	if (pstate->growth != PHJ_GROWTH_DISABLED)
+	if (pstate->growth != PHJ_GROWTH_DISABLED && pstate->growth != PHJ_GROWTH_LOADING)
 	{
 		Assert(curbatch == 0);
 		Assert(BarrierPhase(&pstate->build_barrier) == PHJ_BUILD_HASHING_INNER);
@@ -2861,16 +3294,26 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 		 * Check if our space limit would be exceeded.  To avoid choking on
 		 * very large tuples or very low hash_mem setting, we'll always allow
 		 * each backend to allocate at least one chunk.
+		 *
+		 * If the batch has already been marked to fall back, then we don't
+		 * need to worry about having allocated one chunk -- we should start
+		 * evicting tuples.
 		 */
-		if (hashtable->batches[0].at_least_one_chunk &&
-			hashtable->batches[0].shared->size +
+		LWLockAcquire(&hashtable->batches[0].shared->lock, LW_EXCLUSIVE);
+		if (hashtable->batches[0].shared->size +
 			chunk_size > pstate->space_allowed)
 		{
-			pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
-			hashtable->batches[0].shared->space_exhausted = true;
-			LWLockRelease(&pstate->lock);
-
-			return NULL;
+			if (hashtable->batches[0].shared->hashloop_fallback || hashtable->batches[0].at_least_one_chunk)
+			{
+				if (hashtable->batches[0].shared->hashloop_fallback)
+					pstate->growth = PHJ_GROWTH_SPILL_BATCH0;
+				else if (hashtable->batches[0].at_least_one_chunk)
+					pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
+				hashtable->batches[0].shared->space_exhausted = true;
+				LWLockRelease(&pstate->lock);
+				LWLockRelease(&hashtable->batches[0].shared->lock);
+				return NULL;
+			}
 		}
 
 		/* Check if our load factor limit would be exceeded. */
@@ -2887,14 +3330,60 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 			{
 				pstate->growth = PHJ_GROWTH_NEED_MORE_BUCKETS;
 				LWLockRelease(&pstate->lock);
+				LWLockRelease(&hashtable->batches[0].shared->lock);
 
 				return NULL;
 			}
 		}
+		LWLockRelease(&hashtable->batches[0].shared->lock);
 	}
 
+	/*
+	 * TODO: should I care about hashtable->batches[b].at_least_one_chunk
+	 * here?
+	 */
+	if (pstate->growth == PHJ_GROWTH_LOADING)
+	{
+		int			b = hashtable->curbatch;
+
+		LWLockAcquire(&hashtable->batches[b].shared->lock, LW_EXCLUSIVE);
+		if (hashtable->batches[b].shared->hashloop_fallback &&
+			(hashtable->batches[b].shared->space_exhausted ||
+			 hashtable->batches[b].shared->size + chunk_size > pstate->space_allowed))
+		{
+			bool		space_exhausted = hashtable->batches[b].shared->space_exhausted;
+
+			if (!space_exhausted)
+				hashtable->batches[b].shared->space_exhausted = true;
+			LWLockRelease(&pstate->lock);
+			LWLockRelease(&hashtable->batches[b].shared->lock);
+			return NULL;
+		}
+		LWLockRelease(&hashtable->batches[b].shared->lock);
+	}
+
+	/*
+	 * If not even one chunk would fit in the space_allowed, there isn't
+	 * anything we can do to avoid exceeding space_allowed. Also, if we keep
+	 * the rule that a backend should be allowed to allocate at least one
+	 * chunk, then we will end up tripping this assert some of the time unless
+	 * we make that exception (should we make that exception?) TODO: should
+	 * memory settings < chunk_size even be allowed. Should it error out?
+	 * should we be able to make this assertion?
+	 * Assert(hashtable->batches[hashtable->curbatch].shared->size +
+	 * chunk_size <= pstate->space_allowed);
+	 */
+
 	/* We are cleared to allocate a new chunk. */
 	chunk_shared = dsa_allocate(hashtable->area, chunk_size);
+
+	/*
+	 * The chunk is accounted for in the hashtable size only. Even though
+	 * batch 0 can spill, we don't need to track this allocated chunk in the
+	 * estimated_stripe_size member because we check the size member when
+	 * determining if the hashtable is too big, and, we will only ever number
+	 * stripes (starting with 1 instead of 0 for batch 0) in the spill file.
+	 */
 	hashtable->batches[curbatch].shared->size += chunk_size;
 	hashtable->batches[curbatch].at_least_one_chunk = true;
 
@@ -2964,21 +3453,40 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 	{
 		ParallelHashJoinBatchAccessor *accessor = &hashtable->batches[i];
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
+		SharedBits *sbits = ParallelHashJoinBatchOuterBits(shared, pstate->nparticipants);
 		char		name[MAXPGPATH];
+		char		sbname[MAXPGPATH];
+
+		shared->hashloop_fallback = false;
+		pg_atomic_init_flag(&shared->overflow_required);
+		pg_atomic_init_u64(&shared->ntuples_in_memory, 0);
+		/* TODO: is it okay to use the same tranche for this lock? */
+		LWLockInitialize(&shared->lock, LWTRANCHE_PARALLEL_HASH_JOIN);
+		shared->nstripes = 0;
 
 		/*
 		 * All members of shared were zero-initialized.  We just need to set
 		 * up the Barrier.
 		 */
 		BarrierInit(&shared->batch_barrier, 0);
+		BarrierInit(&shared->stripe_barrier, 0);
+
+		/* Batch 0 doesn't need to be loaded. */
 		if (i == 0)
 		{
-			/* Batch 0 doesn't need to be loaded. */
+			shared->nstripes = 1;
 			BarrierAttach(&shared->batch_barrier);
-			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_PROBING)
+			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_STRIPING)
 				BarrierArriveAndWait(&shared->batch_barrier, 0);
 			BarrierDetach(&shared->batch_barrier);
+
+			BarrierAttach(&shared->stripe_barrier);
+			while (BarrierPhase(&shared->stripe_barrier) < PHJ_STRIPE_PROBING)
+				BarrierArriveAndWait(&shared->stripe_barrier, 0);
+			BarrierDetach(&shared->stripe_barrier);
 		}
+		/* why isn't done initialized here ? */
+		accessor->done = PHJ_BATCH_ACCESSOR_NOT_DONE;
 
 		/* Initialize accessor state.  All members were zero-initialized. */
 		accessor->shared = shared;
@@ -2989,7 +3497,7 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 			sts_initialize(ParallelHashJoinBatchInner(shared),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
@@ -2999,10 +3507,14 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 													  pstate->nparticipants),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
+		snprintf(sbname, MAXPGPATH, "%s.bitmaps", name);
+		/* Use the same SharedFileset for the SharedTupleStore and SharedBits */
+		accessor->sba = sb_initialize(sbits, pstate->nparticipants,
+									  ParallelWorkerNumber + 1, &pstate->fileset, sbname);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -3051,8 +3563,8 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 	 * It's possible for a backend to start up very late so that the whole
 	 * join is finished and the shm state for tracking batches has already
 	 * been freed by ExecHashTableDetach().  In that case we'll just leave
-	 * hashtable->batches as NULL so that ExecParallelHashJoinNewBatch() gives
-	 * up early.
+	 * hashtable->batches as NULL so that ExecParallelHashJoinAdvanceBatch()
+	 * gives up early.
 	 */
 	if (!DsaPointerIsValid(pstate->batches))
 		return;
@@ -3074,10 +3586,11 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 	{
 		ParallelHashJoinBatchAccessor *accessor = &hashtable->batches[i];
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
+		SharedBits *sbits = ParallelHashJoinBatchOuterBits(shared, pstate->nparticipants);
 
 		accessor->shared = shared;
 		accessor->preallocated = 0;
-		accessor->done = false;
+		accessor->done = PHJ_BATCH_ACCESSOR_NOT_DONE;
 		accessor->inner_tuples =
 			sts_attach(ParallelHashJoinBatchInner(shared),
 					   ParallelWorkerNumber + 1,
@@ -3087,6 +3600,7 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 												  pstate->nparticipants),
 					   ParallelWorkerNumber + 1,
 					   &pstate->fileset);
+		accessor->sba = sb_attach(sbits, ParallelWorkerNumber + 1, &pstate->fileset);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -3169,6 +3683,18 @@ ExecHashTableDetachBatch(HashJoinTable hashtable)
 	}
 }
 
+bool
+ExecHashTableDetachStripe(HashJoinTable hashtable)
+{
+	int			curbatch = hashtable->curbatch;
+	ParallelHashJoinBatch *batch = hashtable->batches[curbatch].shared;
+	Barrier    *stripe_barrier = &batch->stripe_barrier;
+
+	BarrierDetach(stripe_barrier);
+	hashtable->curstripe = STRIPE_DETACHED;
+	return false;
+}
+
 /*
  * Detach from all shared resources.  If we are last to detach, clean up.
  */
@@ -3326,7 +3852,6 @@ ExecParallelHashTuplePrealloc(HashJoinTable hashtable, int batchno, size_t size)
 	ParallelHashJoinBatchAccessor *batch = &hashtable->batches[batchno];
 	size_t		want = Max(size, HASH_CHUNK_SIZE - HASH_CHUNK_HEADER_SIZE);
 
-	Assert(batchno > 0);
 	Assert(batchno < hashtable->nbatch);
 	Assert(size == MAXALIGN(size));
 
@@ -3334,7 +3859,8 @@ ExecParallelHashTuplePrealloc(HashJoinTable hashtable, int batchno, size_t size)
 
 	/* Has another participant commanded us to help grow? */
 	if (pstate->growth == PHJ_GROWTH_NEED_MORE_BATCHES ||
-		pstate->growth == PHJ_GROWTH_NEED_MORE_BUCKETS)
+		pstate->growth == PHJ_GROWTH_NEED_MORE_BUCKETS ||
+		pstate->growth == PHJ_GROWTH_SPILL_BATCH0)
 	{
 		ParallelHashGrowth growth = pstate->growth;
 
@@ -3343,18 +3869,21 @@ ExecParallelHashTuplePrealloc(HashJoinTable hashtable, int batchno, size_t size)
 			ExecParallelHashIncreaseNumBatches(hashtable);
 		else if (growth == PHJ_GROWTH_NEED_MORE_BUCKETS)
 			ExecParallelHashIncreaseNumBuckets(hashtable);
+		else if (growth == PHJ_GROWTH_SPILL_BATCH0)
+			ExecParallelHashTableEvictBatch0(hashtable);
 
 		return false;
 	}
 
 	if (pstate->growth != PHJ_GROWTH_DISABLED &&
 		batch->at_least_one_chunk &&
-		(batch->shared->estimated_size + want + HASH_CHUNK_HEADER_SIZE
-		 > pstate->space_allowed))
+		(batch->shared->estimated_size + want + HASH_CHUNK_HEADER_SIZE > pstate->space_allowed) &&
+		!batch->shared->hashloop_fallback)
 	{
 		/*
 		 * We have determined that this batch would exceed the space budget if
-		 * loaded into memory.  Command all participants to help repartition.
+		 * loaded into memory.  It is also not yet marked as a fallback batch.
+		 * Command all participants to help repartition.
 		 */
 		batch->shared->space_exhausted = true;
 		pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 5532b91a71..eb67aceebb 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -92,6 +92,27 @@
  * hash_mem of all participants to create a large shared hash table.  If that
  * turns out either at planning or execution time to be impossible then we
  * fall back to regular hash_mem sized hash tables.
+ * If a given batch causes the number of batches to be doubled and data skew
+ * causes too few or too many tuples to be relocated to the child of this batch,
+ * the batch which is now home to the skewed tuples is marked as a "fallback"
+ * batch. This means that it will be processed using multiple loops --
+ * each loop probing an arbitrary stripe of tuples from this batch
+ * which fit in hash_mem or combined hash_mem.
+ * This batch is no longer permitted to cause growth in the number of batches.
+ *
+ * When the inner side of a fallback batch is loaded into memory, stripes of
+ * arbitrary tuples totaling hash_mem or combined hash_mem in size are loaded
+ * into the hashtable. After probing this stripe, the outer side batch is
+ * rewound and the next stripe is loaded. Each stripe of the inner batch is
+ * probed until all tuples from that batch have been processed.
+ *
+ * Tuples that match are emitted (depending on the join semantics of the
+ * particular join type) during probing of the stripe. However, in order to make
+ * left outer join work, unmatched tuples cannot be emitted NULL-extended until
+ * all stripes have been probed. To address this, a bitmap is created with a bit
+ * for each tuple of the outer side. If a tuple on the outer side matches a
+ * tuple from the inner, the corresponding bit is set. At the end of probing all
+ * stripes, the executor scans the bitmap and emits unmatched outer tuples.
  *
  * To avoid deadlocks, we never wait for any barrier unless it is known that
  * all other backends attached to it are actively executing the node or have
@@ -126,7 +147,7 @@
 #define HJ_SCAN_BUCKET			3
 #define HJ_FILL_OUTER_TUPLE		4
 #define HJ_FILL_INNER_TUPLES	5
-#define HJ_NEED_NEW_BATCH		6
+#define HJ_NEED_NEW_STRIPE      6
 
 /* Returns true if doing null-fill on outer relation */
 #define HJ_FILL_OUTER(hjstate)	((hjstate)->hj_NullInnerTupleSlot != NULL)
@@ -143,10 +164,91 @@ static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 												 BufFile *file,
 												 uint32 *hashvalue,
 												 TupleTableSlot *tupleSlot);
+static int	ExecHashJoinLoadStripe(HashJoinState *hjstate);
 static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
 static bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
+static bool ExecParallelHashJoinLoadStripe(HashJoinState *hjstate);
 static void ExecParallelHashJoinPartitionOuter(HashJoinState *node);
+static bool checkbit(HashJoinState *hjstate);
+static void set_match_bit(HashJoinState *hjstate);
+
+static pg_attribute_always_inline bool
+			IsHashloopFallback(HashJoinTable hashtable);
+
+#define UINT_BITS (sizeof(unsigned int) * CHAR_BIT)
+
+static void
+set_match_bit(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	BufFile    *statusFile = hashtable->hashloopBatchFile[hashtable->curbatch];
+	int			tupindex = hjstate->hj_CurNumOuterTuples - 1;
+	size_t		unit_size = sizeof(hjstate->hj_CurOuterMatchStatus);
+	off_t		offset = tupindex / UINT_BITS * unit_size;
+
+	int			fileno;
+	off_t		cursor;
+
+	BufFileTell(statusFile, &fileno, &cursor);
+
+	/* Extend the statusFile if this is stripe zero. */
+	if (hashtable->curstripe == 0)
+	{
+		for (; cursor < offset + unit_size; cursor += unit_size)
+		{
+			hjstate->hj_CurOuterMatchStatus = 0;
+			BufFileWrite(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+		}
+	}
+
+	if (cursor != offset)
+		BufFileSeek(statusFile, 0, offset, SEEK_SET);
+
+	BufFileRead(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+	BufFileSeek(statusFile, 0, -unit_size, SEEK_CUR);
+
+	hjstate->hj_CurOuterMatchStatus |= 1U << tupindex % UINT_BITS;
+	BufFileWrite(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+}
 
+/* return true if bit is set and false if not */
+static bool
+checkbit(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			curbatch = hashtable->curbatch;
+	BufFile    *outer_match_statuses;
+
+	int			bitno = hjstate->hj_EmitOuterTupleId % UINT_BITS;
+
+	hjstate->hj_EmitOuterTupleId++;
+	outer_match_statuses = hjstate->hj_HashTable->hashloopBatchFile[curbatch];
+
+	/*
+	 * if current chunk of bitmap is exhausted, read next chunk of bitmap from
+	 * outer_match_status_file
+	 */
+	if (bitno == 0)
+		BufFileRead(outer_match_statuses, &hjstate->hj_CurOuterMatchStatus,
+					sizeof(hjstate->hj_CurOuterMatchStatus));
+
+	/*
+	 * check if current tuple's match bit is set in outer match status file
+	 */
+	return hjstate->hj_CurOuterMatchStatus & (1U << bitno);
+}
+
+static bool
+IsHashloopFallback(HashJoinTable hashtable)
+{
+	if (hashtable->parallel_state)
+		return hashtable->batches[hashtable->curbatch].shared->hashloop_fallback;
+
+	if (!hashtable->hashloopBatchFile)
+		return false;
+
+	return hashtable->hashloopBatchFile[hashtable->curbatch];
+}
 
 /* ----------------------------------------------------------------
  *		ExecHashJoinImpl
@@ -290,6 +392,12 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				hashNode->hashtable = hashtable;
 				(void) MultiExecProcNode((PlanState *) hashNode);
 
+				/*
+				 * After building the hashtable, stripe 0 of batch 0 will have
+				 * been loaded.
+				 */
+				hashtable->curstripe = 0;
+
 				/*
 				 * If the inner relation is completely empty, and we're not
 				 * doing a left outer join, we can quit without scanning the
@@ -324,21 +432,21 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 						 * If multi-batch, we need to hash the outer relation
 						 * up front.
 						 */
-						if (hashtable->nbatch > 1)
+						if (hashtable->nbatch > 1 || (hashtable->nbatch == 1 && hashtable->batches[0].shared->hashloop_fallback))
 							ExecParallelHashJoinPartitionOuter(node);
 						BarrierArriveAndWait(build_barrier,
 											 WAIT_EVENT_HASH_BUILD_HASH_OUTER);
+
 					}
 					Assert(BarrierPhase(build_barrier) == PHJ_BUILD_DONE);
 
 					/* Each backend should now select a batch to work on. */
 					hashtable->curbatch = -1;
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
 
-					continue;
+					if (!ExecParallelHashJoinNewBatch(node))
+						return NULL;
 				}
-				else
-					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
 				/* FALL THRU */
 
@@ -365,12 +473,18 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 						node->hj_JoinState = HJ_FILL_INNER_TUPLES;
 					}
 					else
-						node->hj_JoinState = HJ_NEED_NEW_BATCH;
+						node->hj_JoinState = HJ_NEED_NEW_STRIPE;
 					continue;
 				}
 
 				econtext->ecxt_outertuple = outerTupleSlot;
-				node->hj_MatchedOuter = false;
+
+				/*
+				 * Don't reset hj_MatchedOuter after the first stripe as it
+				 * would cancel out whatever we found before
+				 */
+				if (node->hj_HashTable->curstripe == 0)
+					node->hj_MatchedOuter = false;
 
 				/*
 				 * Find the corresponding bucket for this tuple in the main
@@ -386,9 +500,15 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				/*
 				 * The tuple might not belong to the current batch (where
 				 * "current batch" includes the skew buckets if any).
+				 *
+				 * This should only be done once per tuple per batch. If a
+				 * batch "falls back", its inner side will be split into
+				 * stripes. Any displaced outer tuples should only be
+				 * relocated while probing the first stripe of the inner side.
 				 */
 				if (batchno != hashtable->curbatch &&
-					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO)
+					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO &&
+					node->hj_HashTable->curstripe == 0)
 				{
 					bool		shouldFree;
 					MinimalTuple mintuple = ExecFetchSlotMinimalTuple(outerTupleSlot,
@@ -410,6 +530,13 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					continue;
 				}
 
+				/*
+				 * While probing the phantom stripe, don't increment
+				 * hj_CurNumOuterTuples or extend the bitmap
+				 */
+				if (!parallel && hashtable->curstripe != PHANTOM_STRIPE)
+					node->hj_CurNumOuterTuples++;
+
 				/* OK, let's scan the bucket for matches */
 				node->hj_JoinState = HJ_SCAN_BUCKET;
 
@@ -455,6 +582,25 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				{
 					node->hj_MatchedOuter = true;
 
+					if (HJ_FILL_OUTER(node) && IsHashloopFallback(hashtable))
+					{
+						/*
+						 * Each bit corresponds to a single tuple. Setting the
+						 * match bit keeps track of which tuples were matched
+						 * for batches which are using the block nested
+						 * hashloop fallback method. It persists this match
+						 * status across multiple stripes of tuples, each of
+						 * which is loaded into the hashtable and probed. The
+						 * outer match status file is the cumulative match
+						 * status of outer tuples for a given batch across all
+						 * stripes of that inner side batch.
+						 */
+						if (parallel)
+							sb_setbit(hashtable->batches[hashtable->curbatch].sba, econtext->ecxt_outertuple->tts_tuplenum);
+						else
+							set_match_bit(node);
+					}
+
 					if (parallel)
 					{
 						/*
@@ -488,8 +634,17 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					 * continue with next outer tuple.
 					 */
 					if (node->js.single_match)
+					{
 						node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
+						/*
+						 * Only consider returning the tuple while on the
+						 * first stripe.
+						 */
+						if (node->hj_HashTable->curstripe != 0)
+							continue;
+					}
+
 					if (otherqual == NULL || ExecQual(otherqual, econtext))
 						return ExecProject(node->js.ps.ps_ProjInfo);
 					else
@@ -508,6 +663,22 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				 */
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
+				if (IsHashloopFallback(hashtable) && HJ_FILL_OUTER(node))
+				{
+					if (hashtable->curstripe != PHANTOM_STRIPE)
+						continue;
+
+					if (parallel)
+					{
+						ParallelHashJoinBatchAccessor *accessor =
+						&node->hj_HashTable->batches[node->hj_HashTable->curbatch];
+
+						node->hj_MatchedOuter = sb_checkbit(accessor->sba, econtext->ecxt_outertuple->tts_tuplenum);
+					}
+					else
+						node->hj_MatchedOuter = checkbit(node);
+				}
+
 				if (!node->hj_MatchedOuter &&
 					HJ_FILL_OUTER(node))
 				{
@@ -534,7 +705,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				if (!ExecScanHashTableForUnmatched(node, econtext))
 				{
 					/* no more unmatched tuples */
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					node->hj_JoinState = HJ_NEED_NEW_STRIPE;
 					continue;
 				}
 
@@ -550,19 +721,23 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					InstrCountFiltered2(node, 1);
 				break;
 
-			case HJ_NEED_NEW_BATCH:
+			case HJ_NEED_NEW_STRIPE:
 
 				/*
-				 * Try to advance to next batch.  Done if there are no more.
+				 * Try to advance to next stripe. Then try to advance to the
+				 * next batch if there are no more stripes in this batch. Done
+				 * if there are no more batches.
 				 */
 				if (parallel)
 				{
-					if (!ExecParallelHashJoinNewBatch(node))
+					if (!ExecParallelHashJoinLoadStripe(node) &&
+						!ExecParallelHashJoinNewBatch(node))
 						return NULL;	/* end of parallel-aware join */
 				}
 				else
 				{
-					if (!ExecHashJoinNewBatch(node))
+					if (!ExecHashJoinLoadStripe(node) &&
+						!ExecHashJoinNewBatch(node))
 						return NULL;	/* end of parallel-oblivious join */
 				}
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
@@ -751,6 +926,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate->hj_JoinState = HJ_BUILD_HASHTABLE;
 	hjstate->hj_MatchedOuter = false;
 	hjstate->hj_OuterNotEmpty = false;
+	hjstate->hj_CurNumOuterTuples = 0;
+	hjstate->hj_CurOuterMatchStatus = 0;
 
 	return hjstate;
 }
@@ -890,10 +1067,16 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 	/*
 	 * In the Parallel Hash case we only run the outer plan directly for
 	 * single-batch hash joins.  Otherwise we have to go to batch files, even
-	 * for batch 0.
+	 * for batch 0. For a single-batch hash join which, due to data skew, has
+	 * multiple stripes and is a "fallback" batch, we must still save the
+	 * outer tuples into batch files.
 	 */
-	if (curbatch == 0 && hashtable->nbatch == 1)
+	LWLockAcquire(&hashtable->batches[0].shared->lock, LW_SHARED);
+
+	if (curbatch == 0 && hashtable->nbatch == 1 && !hashtable->batches[0].shared->hashloop_fallback)
 	{
+		LWLockRelease(&hashtable->batches[0].shared->lock);
+
 		slot = ExecProcNode(outerNode);
 
 		while (!TupIsNull(slot))
@@ -917,21 +1100,36 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 	}
 	else if (curbatch < hashtable->nbatch)
 	{
+
+		tupleMetadata metadata;
 		MinimalTuple tuple;
 
-		tuple = sts_parallel_scan_next(hashtable->batches[curbatch].outer_tuples,
-									   hashvalue);
+		LWLockRelease(&hashtable->batches[0].shared->lock);
+
+		tuple =
+			sts_parallel_scan_next(hashtable->batches[curbatch].outer_tuples,
+								   &metadata);
+		*hashvalue = metadata.hashvalue;
+
 		if (tuple != NULL)
 		{
 			ExecForceStoreMinimalTuple(tuple,
 									   hjstate->hj_OuterTupleSlot,
 									   false);
+
+			/*
+			 * TODO: should we use tupleid instead of position in the serial
+			 * case too?
+			 */
+			hjstate->hj_OuterTupleSlot->tts_tuplenum = metadata.tupleid;
 			slot = hjstate->hj_OuterTupleSlot;
 			return slot;
 		}
 		else
 			ExecClearTuple(hjstate->hj_OuterTupleSlot);
 	}
+	else
+		LWLockRelease(&hashtable->batches[0].shared->lock);
 
 	/* End of this batch */
 	return NULL;
@@ -949,24 +1147,37 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	int			nbatch;
 	int			curbatch;
-	BufFile    *innerFile;
-	TupleTableSlot *slot;
-	uint32		hashvalue;
+	BufFile    *innerFile = NULL;
+	BufFile    *outerFile = NULL;
 
 	nbatch = hashtable->nbatch;
 	curbatch = hashtable->curbatch;
 
-	if (curbatch > 0)
+	/*
+	 * We no longer need the previous outer batch file; close it right away to
+	 * free disk space.
+	 */
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
 	{
-		/*
-		 * We no longer need the previous outer batch file; close it right
-		 * away to free disk space.
-		 */
-		if (hashtable->outerBatchFile[curbatch])
-			BufFileClose(hashtable->outerBatchFile[curbatch]);
+		BufFileClose(hashtable->outerBatchFile[curbatch]);
 		hashtable->outerBatchFile[curbatch] = NULL;
 	}
-	else						/* we just finished the first batch */
+	if (IsHashloopFallback(hashtable))
+	{
+		BufFileClose(hashtable->hashloopBatchFile[curbatch]);
+		hashtable->hashloopBatchFile[curbatch] = NULL;
+	}
+
+	/*
+	 * We are surely done with the inner batch file now
+	 */
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[curbatch])
+	{
+		BufFileClose(hashtable->innerBatchFile[curbatch]);
+		hashtable->innerBatchFile[curbatch] = NULL;
+	}
+
+	if (curbatch == 0)			/* we just finished the first batch */
 	{
 		/*
 		 * Reset some of the skew optimization state variables, since we no
@@ -1030,55 +1241,168 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 		return false;			/* no more batches */
 
 	hashtable->curbatch = curbatch;
+	hashtable->curstripe = STRIPE_DETACHED;
+	hjstate->hj_CurNumOuterTuples = 0;
 
-	/*
-	 * Reload the hash table with the new inner batch (which could be empty)
-	 */
-	ExecHashTableReset(hashtable);
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[curbatch])
+		innerFile = hashtable->innerBatchFile[curbatch];
+
+	if (innerFile && BufFileSeek(innerFile, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	/* Need to rewind outer when this is the first stripe of a new batch */
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
+		outerFile = hashtable->outerBatchFile[curbatch];
+
+	if (outerFile && BufFileSeek(outerFile, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	ExecHashJoinLoadStripe(hjstate);
+	return true;
+}
 
-	innerFile = hashtable->innerBatchFile[curbatch];
+static inline void
+InstrIncrBatchStripes(List *fallback_batches_stats, int curbatch)
+{
+	ListCell   *lc;
 
-	if (innerFile != NULL)
+	foreach(lc, fallback_batches_stats)
 	{
-		if (BufFileSeek(innerFile, 0, 0L, SEEK_SET))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not rewind hash-join temporary file")));
+		FallbackBatchStats *fallback_batch_stats = lfirst(lc);
 
-		while ((slot = ExecHashJoinGetSavedTuple(hjstate,
-												 innerFile,
-												 &hashvalue,
-												 hjstate->hj_HashTupleSlot)))
+		if (fallback_batch_stats->batchno == curbatch)
 		{
-			/*
-			 * NOTE: some tuples may be sent to future batches.  Also, it is
-			 * possible for hashtable->nbatch to be increased here!
-			 */
-			ExecHashTableInsert(hashtable, slot, hashvalue);
+			fallback_batch_stats->numstripes++;
+			break;
 		}
-
-		/*
-		 * after we build the hash table, the inner batch file is no longer
-		 * needed
-		 */
-		BufFileClose(innerFile);
-		hashtable->innerBatchFile[curbatch] = NULL;
 	}
+}
+
+static inline void
+InstrAppendParallelBatchStripes(List **fallback_batches_stats, int curbatch, int nstripes)
+{
+	FallbackBatchStats *fallback_batch_stats;
+
+	fallback_batch_stats = palloc(sizeof(FallbackBatchStats));
+	fallback_batch_stats->batchno = curbatch;
+	/* Display the total number of stripes as a 1-indexed number */
+	fallback_batch_stats->numstripes = nstripes + 1;
+	*fallback_batches_stats = lappend(*fallback_batches_stats, fallback_batch_stats);
+}
+
+/*
+ * Returns false when the inner batch file is exhausted
+ */
+static int
+ExecHashJoinLoadStripe(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			curbatch = hashtable->curbatch;
+	TupleTableSlot *slot;
+	uint32		hashvalue;
+	bool		loaded_inner = false;
+
+	if (hashtable->curstripe == PHANTOM_STRIPE)
+		return false;
 
 	/*
 	 * Rewind outer batch file (if present), so that we can start reading it.
+	 * TODO: This is only necessary if this is not the first stripe of the
+	 * batch
 	 */
-	if (hashtable->outerBatchFile[curbatch] != NULL)
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
 	{
 		if (BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET))
 			ereport(ERROR,
 					(errcode_for_file_access(),
-					 errmsg("could not rewind hash-join temporary file")));
+					 errmsg("could not rewind hash-join temporary file: %m")));
+	}
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[curbatch] && hashtable->curbatch == 0 && hashtable->curstripe == 0)
+	{
+		if (BufFileSeek(hashtable->innerBatchFile[curbatch], 0, 0L, SEEK_SET))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not rewind hash-join temporary file: %m")));
 	}
 
-	return true;
+	hashtable->curstripe++;
+
+	if (!hashtable->innerBatchFile || !hashtable->innerBatchFile[curbatch])
+		return false;
+
+	/*
+	 * Reload the hash table with the new inner stripe
+	 */
+	ExecHashTableReset(hashtable);
+
+	while ((slot = ExecHashJoinGetSavedTuple(hjstate,
+											 hashtable->innerBatchFile[curbatch],
+											 &hashvalue,
+											 hjstate->hj_HashTupleSlot)))
+	{
+		/*
+		 * NOTE: some tuples may be sent to future batches.  Also, it is
+		 * possible for hashtable->nbatch to be increased here!
+		 */
+		uint32		hashTupleSize;
+
+		/*
+		 * TODO: wouldn't it be cool if this returned the size of the tuple
+		 * inserted
+		 */
+		ExecHashTableInsert(hashtable, slot, hashvalue);
+		loaded_inner = true;
+
+		if (!IsHashloopFallback(hashtable))
+			continue;
+
+		hashTupleSize = slot->tts_ops->get_minimal_tuple(slot)->t_len + HJTUPLE_OVERHEAD;
+
+		if (hashtable->spaceUsed + hashTupleSize +
+			hashtable->nbuckets_optimal * sizeof(HashJoinTuple)
+			> hashtable->spaceAllowed)
+			break;
+	}
+
+	/*
+	 * if we didn't load anything and it is a FOJ/LOJ fallback batch, we will
+	 * transition to emit unmatched outer tuples next. we want to know how
+	 * many tuples were in the batch in that case, so don't zero it out then
+	 */
+
+	/*
+	 * if we loaded anything into the hashtable or it is the phantom stripe,
+	 * must proceed to probing
+	 */
+	if (loaded_inner)
+	{
+		hjstate->hj_CurNumOuterTuples = 0;
+		InstrIncrBatchStripes(hashtable->fallback_batches_stats, curbatch);
+		return true;
+	}
+
+	if (IsHashloopFallback(hashtable) && HJ_FILL_OUTER(hjstate))
+	{
+		/*
+		 * if we didn't load anything and it is a fallback batch, we will
+		 * prepare to emit outer tuples during the phantom stripe probing
+		 */
+		hashtable->curstripe = PHANTOM_STRIPE;
+		hjstate->hj_EmitOuterTupleId = 0;
+		hjstate->hj_CurOuterMatchStatus = 0;
+		BufFileSeek(hashtable->hashloopBatchFile[curbatch], 0, 0, SEEK_SET);
+		if (hashtable->outerBatchFile[curbatch])
+			BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET);
+		return true;
+	}
+	return false;
 }
 
+
 /*
  * Choose a batch to work on, and attach to it.  Returns true if successful,
  * false if there are no more batches.
@@ -1101,11 +1425,24 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 	/*
 	 * If we were already attached to a batch, remember not to bother checking
 	 * it again, and detach from it (possibly freeing the hash table if we are
-	 * last to detach).
+	 * last to detach). curbatch is set when the batch_barrier phase is either
+	 * PHJ_BATCH_LOADING or PHJ_BATCH_STRIPING (note that the
+	 * PHJ_BATCH_LOADING case will fall through to the PHJ_BATCH_STRIPING
+	 * case). The PHJ_BATCH_STRIPING case returns to the caller. So when this
+	 * function is reentered with a curbatch >= 0 then we must be done
+	 * probing.
 	 */
+
 	if (hashtable->curbatch >= 0)
 	{
-		hashtable->batches[hashtable->curbatch].done = true;
+		ParallelHashJoinBatchAccessor *batch_accessor = &hashtable->batches[hashtable->curbatch];
+
+		if (IsHashloopFallback(hashtable))
+		{
+			InstrAppendParallelBatchStripes(&hashtable->fallback_batches_stats, hashtable->curbatch, batch_accessor->shared->nstripes);
+			sb_end_write(hashtable->batches[hashtable->curbatch].sba);
+		}
+		batch_accessor->done = PHJ_BATCH_ACCESSOR_DONE;
 		ExecHashTableDetachBatch(hashtable);
 	}
 
@@ -1119,13 +1456,8 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 		hashtable->nbatch;
 	do
 	{
-		uint32		hashvalue;
-		MinimalTuple tuple;
-		TupleTableSlot *slot;
-
-		if (!hashtable->batches[batchno].done)
+		if (hashtable->batches[batchno].done != PHJ_BATCH_ACCESSOR_DONE)
 		{
-			SharedTuplestoreAccessor *inner_tuples;
 			Barrier    *batch_barrier =
 			&hashtable->batches[batchno].shared->batch_barrier;
 
@@ -1136,7 +1468,15 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 					/* One backend allocates the hash table. */
 					if (BarrierArriveAndWait(batch_barrier,
 											 WAIT_EVENT_HASH_BATCH_ELECT))
+					{
 						ExecParallelHashTableAlloc(hashtable, batchno);
+
+						/*
+						 * one worker needs to 0 out the read_pages of all the
+						 * participants in the new batch
+						 */
+						sts_reinitialize(hashtable->batches[batchno].inner_tuples);
+					}
 					/* Fall through. */
 
 				case PHJ_BATCH_ALLOCATING:
@@ -1145,41 +1485,31 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 										 WAIT_EVENT_HASH_BATCH_ALLOCATE);
 					/* Fall through. */
 
-				case PHJ_BATCH_LOADING:
-					/* Start (or join in) loading tuples. */
-					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					inner_tuples = hashtable->batches[batchno].inner_tuples;
-					sts_begin_parallel_scan(inner_tuples);
-					while ((tuple = sts_parallel_scan_next(inner_tuples,
-														   &hashvalue)))
-					{
-						ExecForceStoreMinimalTuple(tuple,
-												   hjstate->hj_HashTupleSlot,
-												   false);
-						slot = hjstate->hj_HashTupleSlot;
-						ExecParallelHashTableInsertCurrentBatch(hashtable, slot,
-																hashvalue);
-					}
-					sts_end_parallel_scan(inner_tuples);
-					BarrierArriveAndWait(batch_barrier,
-										 WAIT_EVENT_HASH_BATCH_LOAD);
-					/* Fall through. */
+				case PHJ_BATCH_STRIPING:
 
-				case PHJ_BATCH_PROBING:
+					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
+					sts_begin_parallel_scan(hashtable->batches[batchno].inner_tuples);
+					if (hashtable->batches[batchno].shared->hashloop_fallback)
+						sb_initialize_accessor(hashtable->batches[hashtable->curbatch].sba,
+											   sts_get_tuplenum(hashtable->batches[hashtable->curbatch].outer_tuples));
+					hashtable->curstripe = STRIPE_DETACHED;
+					if (ExecParallelHashJoinLoadStripe(hjstate))
+						return true;
 
 					/*
-					 * This batch is ready to probe.  Return control to
-					 * caller. We stay attached to batch_barrier so that the
-					 * hash table stays alive until everyone's finished
-					 * probing it, but no participant is allowed to wait at
-					 * this barrier again (or else a deadlock could occur).
-					 * All attached participants must eventually call
-					 * BarrierArriveAndDetach() so that the final phase
-					 * PHJ_BATCH_DONE can be reached.
+					 * ExecParallelHashJoinLoadStripe() will return false from
+					 * here when no more work can be done by this worker on
+					 * this batch. Until further optimized, this worker will
+					 * have detached from the stripe_barrier and should close
+					 * its outer match statuses bitmap and then detach from
+					 * the batch. In order to reuse the code below, fall
+					 * through, even though the phase will not have been
+					 * advanced
 					 */
-					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					sts_begin_parallel_scan(hashtable->batches[batchno].outer_tuples);
-					return true;
+					if (hashtable->batches[batchno].shared->hashloop_fallback)
+						sb_end_write(hashtable->batches[batchno].sba);
+
+					/* Fall through. */
 
 				case PHJ_BATCH_DONE:
 
@@ -1187,8 +1517,16 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 					 * Already done.  Detach and go around again (if any
 					 * remain).
 					 */
+
+					/*
+					 * In case the leader joins late, we have to make sure
+					 * that all workers have the final number of stripes.
+					 */
+					if (hashtable->batches[batchno].shared->hashloop_fallback)
+						InstrAppendParallelBatchStripes(&hashtable->fallback_batches_stats, batchno, hashtable->batches[batchno].shared->nstripes);
 					BarrierDetach(batch_barrier);
-					hashtable->batches[batchno].done = true;
+					hashtable->batches[batchno].done = PHJ_BATCH_ACCESSOR_DONE;
+
 					hashtable->curbatch = -1;
 					break;
 
@@ -1203,6 +1541,244 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 	return false;
 }
 
+
+
+/*
+ * Returns true if ready to probe and false if the inner is exhausted
+ * (there are no more stripes)
+ */
+bool
+ExecParallelHashJoinLoadStripe(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			batchno = hashtable->curbatch;
+	ParallelHashJoinBatchAccessor *batch_accessor = &(hashtable->batches[batchno]);
+	ParallelHashJoinBatch *batch = batch_accessor->shared;
+	Barrier    *stripe_barrier = &batch->stripe_barrier;
+	SharedTuplestoreAccessor *outer_tuples;
+	SharedTuplestoreAccessor *inner_tuples;
+
+	outer_tuples = hashtable->batches[batchno].outer_tuples;
+	inner_tuples = hashtable->batches[batchno].inner_tuples;
+
+	if (hashtable->curstripe >= 0)
+	{
+		/*
+		 * If a worker is already attached to a stripe, wait until all
+		 * participants have finished probing and detach. The last worker,
+		 * however, can re-attach to the stripe_barrier and proceed to load
+		 * and probe the other stripes
+		 *
+		 * After finishing with participating in a stripe, if a worker is the
+		 * only one working on a batch, it will continue working on it.
+		 * However, if a worker is not the only worker working on a batch, it
+		 * would risk deadlock if it waits on the barrier. Instead, it will
+		 * detach from the stripe, and, eventually the batch.
+		 *
+		 * This means all stripes after the first stripe will be executed
+		 * serially. TODO: allow workers to provisionally detach from the
+		 * batch and reattach later if there is still work to be done. I had a
+		 * patch that did this. Workers who were not the last worker saved the
+		 * state of the stripe barrier upon detaching and then mark the batch
+		 * as "provisionally" done (not done). Later, when the worker comes
+		 * back to the batch in the batch phase machine, if the batch is not
+		 * complete and the phase has advanced since the worker was last
+		 * participating, then the worker can join back in. This had problems.
+		 * There were synchronization issues with workers having multiple
+		 * outer match status bitmap files open at the same time, so, I had
+		 * workers close their bitmap and make a new one the next time they
+		 * joined in. This didn't work with the current code because the
+		 * original outer match status bitmap file that the worker had created
+		 * while probing stripe 1 did not get combined into the combined
+		 * bitmap This could be specifically fixed, but I think it is better
+		 * to address the lack of parallel execution for stripes after stripe
+		 * 0 more holistically.
+		 */
+		if (!BarrierArriveAndDetach(stripe_barrier))
+		{
+			sb_end_write(batch_accessor->sba);
+			hashtable->curstripe = STRIPE_DETACHED;
+			return false;
+		}
+
+		/*
+		 * This isn't a race condition if no other workers can stay attached
+		 * to this barrier in the intervening time. Basically, if you attach
+		 * to a stripe barrier in the PHJ_STRIPE_DONE phase, detach
+		 * immediately and move on.
+		 */
+		BarrierAttach(stripe_barrier);
+	}
+	else if (hashtable->curstripe == STRIPE_DETACHED)
+	{
+		int			phase = BarrierAttach(stripe_barrier);
+
+		/*
+		 * If a worker enters this phase machine for the first time for this
+		 * batch on a stripe number greater than the batch's maximum stripe
+		 * number, then: 1) The batch is done, or 2) The batch is on the
+		 * phantom stripe that's used for hashloop fallback. Either way the
+		 * worker can't contribute, so it will just detach and move on.
+		 */
+		if (PHJ_STRIPE_NUMBER(phase) > batch->nstripes ||
+			PHJ_STRIPE_PHASE(phase) == PHJ_STRIPE_DONE)
+			return ExecHashTableDetachStripe(hashtable);
+	}
+	else if (hashtable->curstripe == PHANTOM_STRIPE)
+	{
+		/* Only the last worker will execute this code. */
+		sts_end_parallel_scan(outer_tuples);
+
+		/*
+		 * TODO: ideally this would go somewhere in the batch phase machine
+		 * Putting it in ExecHashTableDetachBatch didn't do the trick
+		 */
+		sb_end_read(batch_accessor->sba);
+		return ExecHashTableDetachStripe(hashtable);
+	}
+
+	hashtable->curstripe = PHJ_STRIPE_NUMBER(BarrierPhase(stripe_barrier));
+
+	/*
+	 * The outer side is exhausted and either 1) the current stripe of the
+	 * inner side is exhausted and it is time to advance the stripe 2) the
+	 * last stripe of the inner side is exhausted and it is time to advance
+	 * the batch
+	 */
+	for (;;)
+	{
+		MinimalTuple tuple;
+		tupleMetadata metadata;
+
+		bool		overflow_required = false;
+		int			phase = BarrierPhase(stripe_barrier);
+
+		switch (PHJ_STRIPE_PHASE(phase))
+		{
+			case PHJ_STRIPE_ELECTING:
+				if (BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_ELECT))
+					sts_reinitialize(outer_tuples);
+				/* FALLTHROUGH */
+			case PHJ_STRIPE_RESETTING:
+
+				/*
+				 * This barrier allows the elected worker to finish resetting
+				 * the read_page for the outer side as well as allowing the
+				 * worker which was elected to clear out the hashtable from
+				 * the last stripe to finish.
+				 */
+				BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_RESET);
+				/* FALLTHROUGH */
+			case PHJ_STRIPE_LOADING:
+
+				/*
+				 * Start (or join in) loading the next stripe of inner tuples.
+				 */
+				sts_begin_parallel_scan(inner_tuples);
+
+				/*
+				 * TODO: add functionality to pre-alloc some memory before
+				 * calling sts_parallel_scan_next() because that will reserve
+				 * an additional STS_CHUNK for every stripe for each worker
+				 * that won't fit, so we should first see if the chunk would
+				 * fit before getting the assignment
+				 */
+				while ((tuple = sts_parallel_scan_next(inner_tuples, &metadata)))
+				{
+					ExecForceStoreMinimalTuple(tuple, hjstate->hj_HashTupleSlot, false);
+					if (!ExecParallelHashTableInsertCurrentBatch(hashtable, hjstate->hj_HashTupleSlot, metadata.hashvalue, sta_get_read_participant(inner_tuples)))
+					{
+						overflow_required = true;
+						pg_atomic_test_set_flag(&batch->overflow_required);
+						break;
+					}
+				}
+
+				if (BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOAD))
+				{
+					if (!pg_atomic_unlocked_test_flag(&batch->overflow_required))
+						batch->nstripes++;
+				}
+				/* FALLTHROUGH */
+			case PHJ_STRIPE_OVERFLOWING:
+				if (overflow_required)
+				{
+					Assert(tuple);
+					sts_spill_leftover_tuples(inner_tuples, tuple, metadata.hashvalue);
+				}
+				BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_OVERFLOW);
+
+				/* FALLTHROUGH */
+			case PHJ_STRIPE_PROBING:
+				{
+					/*
+					 * do this again here in case a worker began the scan and
+					 * then entered after loading before probing
+					 */
+					sts_end_parallel_scan(inner_tuples);
+					sts_begin_parallel_scan(outer_tuples);
+					return true;
+				}
+
+			case PHJ_STRIPE_DONE:
+				if (PHJ_STRIPE_NUMBER(phase) >= batch->nstripes)
+				{
+					/*
+					 * Handle the phantom stripe case.
+					 */
+					if (batch->hashloop_fallback && HJ_FILL_OUTER(hjstate))
+						goto fallback_stripe;
+
+					/* Return if this is the last stripe */
+					return ExecHashTableDetachStripe(hashtable);
+				}
+
+				/* this, effectively, increments the stripe number */
+				if (BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOAD))
+				{
+					ExecParallelHashTableRecycle(hashtable);
+					pg_atomic_clear_flag(&batch->overflow_required);
+				}
+
+				hashtable->curstripe++;
+				continue;
+
+			default:
+				elog(ERROR, "unexpected stripe phase %d. pid %i. batch %i.", BarrierPhase(stripe_barrier), MyProcPid, batchno);
+		}
+	}
+
+fallback_stripe:
+	sb_end_write(batch_accessor->sba);
+
+	/* Ensure that only a single worker is attached to the barrier */
+	if (!BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOAD))
+		return ExecHashTableDetachStripe(hashtable);
+
+	/* No one except the last worker will run this code */
+	hashtable->curstripe = PHANTOM_STRIPE;
+
+	ExecParallelHashTableRecycle(hashtable);
+	pg_atomic_clear_flag(&batch->overflow_required);
+
+	/*
+	 * If all workers (including this one) have finished probing the batch,
+	 * one worker is elected to Loop through the outer match status files from
+	 * all workers that were attached to this batch Combine them into one
+	 * bitmap Use the bitmap, loop through the outer batch file again, and
+	 * emit unmatched tuples All workers will detach from the batch barrier
+	 * and the last worker will clean up the hashtable. All workers except the
+	 * last worker will end their scans of the outer and inner side. The last
+	 * worker will end its scan of the inner side
+	 */
+	sb_combine(batch_accessor->sba);
+	sts_reinitialize(outer_tuples);
+
+	sts_begin_parallel_scan(outer_tuples);
+
+	return true;
+}
+
 /*
  * ExecHashJoinSaveTuple
  *		save a tuple to a batch file.
@@ -1364,6 +1940,9 @@ ExecReScanHashJoin(HashJoinState *node)
 	node->hj_MatchedOuter = false;
 	node->hj_FirstOuterTupleSlot = NULL;
 
+	node->hj_CurNumOuterTuples = 0;
+	node->hj_CurOuterMatchStatus = 0;
+
 	/*
 	 * if chgParam of subnode is not null then plan will be re-scanned by
 	 * first ExecProcNode.
@@ -1394,7 +1973,6 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 	ExprContext *econtext = hjstate->js.ps.ps_ExprContext;
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	TupleTableSlot *slot;
-	uint32		hashvalue;
 	int			i;
 
 	Assert(hjstate->hj_FirstOuterTupleSlot == NULL);
@@ -1402,6 +1980,8 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 	/* Execute outer plan, writing all tuples to shared tuplestores. */
 	for (;;)
 	{
+		tupleMetadata metadata;
+
 		slot = ExecProcNode(outerState);
 		if (TupIsNull(slot))
 			break;
@@ -1410,17 +1990,25 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 								 hjstate->hj_OuterHashKeys,
 								 true,	/* outer tuple */
 								 HJ_FILL_OUTER(hjstate),
-								 &hashvalue))
+								 &metadata.hashvalue))
 		{
 			int			batchno;
 			int			bucketno;
 			bool		shouldFree;
+			SharedTuplestoreAccessor *accessor;
+
 			MinimalTuple mintup = ExecFetchSlotMinimalTuple(slot, &shouldFree);
 
-			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
+			ExecHashGetBucketAndBatch(hashtable, metadata.hashvalue, &bucketno,
 									  &batchno);
+			accessor = hashtable->batches[batchno].outer_tuples;
+
+			/* cannot count on deterministic order of tupleids */
+			metadata.tupleid = sts_increment_ntuples(accessor);
+
 			sts_puttuple(hashtable->batches[batchno].outer_tuples,
-						 &hashvalue, mintup);
+						 &metadata.hashvalue,
+						 mintup);
 
 			if (shouldFree)
 				heap_free_minimal_tuple(mintup);
@@ -1481,6 +2069,8 @@ ExecHashJoinInitializeDSM(HashJoinState *state, ParallelContext *pcxt)
 	LWLockInitialize(&pstate->lock,
 					 LWTRANCHE_PARALLEL_HASH_JOIN);
 	BarrierInit(&pstate->build_barrier, 0);
+	BarrierInit(&pstate->eviction_barrier, 0);
+	BarrierInit(&pstate->repartition_barrier, 0);
 	BarrierInit(&pstate->grow_batches_barrier, 0);
 	BarrierInit(&pstate->grow_buckets_barrier, 0);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8116b23614..e6643ad66c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3779,8 +3779,20 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_BATCH_ELECT:
 			event_name = "HashBatchElect";
 			break;
-		case WAIT_EVENT_HASH_BATCH_LOAD:
-			event_name = "HashBatchLoad";
+		case WAIT_EVENT_HASH_STRIPE_ELECT:
+			event_name = "HashStripeElect";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_RESET:
+			event_name = "HashStripeReset";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_LOAD:
+			event_name = "HashStripeLoad";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_OVERFLOW:
+			event_name = "HashStripeOverflow";
+			break;
+		case WAIT_EVENT_HASH_STRIPE_PROBE:
+			event_name = "HashStripeProbe";
 			break;
 		case WAIT_EVENT_HASH_BUILD_ALLOCATE:
 			event_name = "HashBuildAllocate";
@@ -3794,6 +3806,21 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_HASH_BUILD_HASH_OUTER:
 			event_name = "HashBuildHashOuter";
 			break;
+		case WAIT_EVENT_HASH_EVICT_ELECT:
+			event_name = "HashEvictElect";
+			break;
+		case WAIT_EVENT_HASH_EVICT_RESET:
+			event_name = "HashEvictReset";
+			break;
+		case WAIT_EVENT_HASH_EVICT_SPILL:
+			event_name = "HashEvictSpill";
+			break;
+		case WAIT_EVENT_HASH_EVICT_FINISH:
+			event_name = "HashEvictFinish";
+			break;
+		case WAIT_EVENT_HASH_REPARTITION_BATCH0_DRAIN_QUEUE:
+			event_name = "HashRepartitionBatch0DrainQueue";
+			break;
 		case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATE:
 			event_name = "HashGrowBatchesAllocate";
 			break;
diff --git a/src/backend/utils/sort/Makefile b/src/backend/utils/sort/Makefile
index 7ac3659261..f11fe85aeb 100644
--- a/src/backend/utils/sort/Makefile
+++ b/src/backend/utils/sort/Makefile
@@ -16,6 +16,7 @@ override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS)
 
 OBJS = \
 	logtape.o \
+	sharedbits.o \
 	sharedtuplestore.o \
 	sortsupport.o \
 	tuplesort.o \
diff --git a/src/backend/utils/sort/sharedbits.c b/src/backend/utils/sort/sharedbits.c
new file mode 100644
index 0000000000..be7000b08c
--- /dev/null
+++ b/src/backend/utils/sort/sharedbits.c
@@ -0,0 +1,288 @@
+#include "postgres.h"
+
+#include <fcntl.h>
+
+#include "storage/buffile.h"
+#include "utils/sharedbits.h"
+
+/*
+ * TODO: put a comment about not currently supporting parallel scan of the SharedBits
+ * To support parallel scan, need to introduce many more mechanisms
+ */
+
+/* Per-participant shared state */
+struct SharedBitsParticipant
+{
+	bool		present;
+	bool		writing;
+};
+
+/* Shared control object */
+struct SharedBits
+{
+	int			nparticipants;	/* Number of participants that can write. */
+	int64		nbits;
+	char		name[NAMEDATALEN];	/* A name for this bitstore. */
+
+	SharedBitsParticipant participants[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/* backend-local state */
+struct SharedBitsAccessor
+{
+	int			participant;
+	SharedBits *bits;
+	SharedFileSet *fileset;
+	BufFile    *write_file;
+	BufFile    *combined;
+};
+
+SharedBitsAccessor *
+sb_attach(SharedBits *sbits, int my_participant_number, SharedFileSet *fileset)
+{
+	SharedBitsAccessor *accessor = palloc0(sizeof(SharedBitsAccessor));
+
+	accessor->participant = my_participant_number;
+	accessor->bits = sbits;
+	accessor->fileset = fileset;
+	accessor->write_file = NULL;
+	accessor->combined = NULL;
+	return accessor;
+}
+
+SharedBitsAccessor *
+sb_initialize(SharedBits *sbits,
+			  int participants,
+			  int my_participant_number,
+			  SharedFileSet *fileset,
+			  char *name)
+{
+	SharedBitsAccessor *accessor;
+
+	sbits->nparticipants = participants;
+	strcpy(sbits->name, name);
+	sbits->nbits = 0;			/* TODO: maybe delete this */
+
+	accessor = palloc0(sizeof(SharedBitsAccessor));
+	accessor->participant = my_participant_number;
+	accessor->bits = sbits;
+	accessor->fileset = fileset;
+	accessor->write_file = NULL;
+	accessor->combined = NULL;
+	return accessor;
+}
+
+/*  TODO: is "initialize_accessor" a clear enough API for this? (making the file)? */
+void
+sb_initialize_accessor(SharedBitsAccessor *accessor, uint32 nbits)
+{
+	char		name[MAXPGPATH];
+	uint32		num_to_write;
+
+	snprintf(name, MAXPGPATH, "%s.p%d.bitmap", accessor->bits->name, accessor->participant);
+
+	accessor->write_file =
+		BufFileCreateShared(accessor->fileset, name);
+
+	accessor->bits->participants[accessor->participant].present = true;
+	/* TODO: check this math. tuplenumber will be too high? */
+	num_to_write = nbits / 8 + 1;
+
+	/*
+	 * TODO: add tests that could exercise a problem with junk being written
+	 * to bitmap
+	 */
+
+	/*
+	 * TODO: is there a better way to write the bytes to the file without
+	 * calling BufFileWrite() like this? palloc()ing an undetermined number of
+	 * bytes feels like it is against the spirit of this patch to begin with,
+	 * but the many function calls seem expensive
+	 */
+	for (int i = 0; i < num_to_write; i++)
+	{
+		unsigned char byteToWrite = 0;
+
+		BufFileWrite(accessor->write_file, &byteToWrite, 1);
+	}
+
+	if (BufFileSeek(accessor->write_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+}
+
+size_t
+sb_estimate(int participants)
+{
+	return offsetof(SharedBits, participants) + participants * sizeof(SharedBitsParticipant);
+}
+
+
+void
+sb_setbit(SharedBitsAccessor *accessor, uint64 bit)
+{
+	SharedBitsParticipant *const participant =
+	&accessor->bits->participants[accessor->participant];
+
+	/* TODO: use an unsigned int instead of a byte */
+	unsigned char current_outer_byte;
+
+	Assert(accessor->write_file);
+
+	if (!participant->writing)
+	{
+		participant->writing = true;
+	}
+
+	BufFileSeek(accessor->write_file, 0, bit / 8, SEEK_SET);
+	BufFileRead(accessor->write_file, &current_outer_byte, 1);
+
+	current_outer_byte |= 1U << (bit % 8);
+
+	BufFileSeek(accessor->write_file, 0, -1, SEEK_CUR);
+	BufFileWrite(accessor->write_file, &current_outer_byte, 1);
+}
+
+bool
+sb_checkbit(SharedBitsAccessor *accessor, uint32 n)
+{
+	bool		match;
+	uint32		bytenum = n / 8;
+	unsigned char bit = n % 8;
+	unsigned char byte_to_check = 0;
+
+	Assert(accessor->combined);
+
+	/* seek to byte to check */
+	if (BufFileSeek(accessor->combined,
+					0,
+					bytenum,
+					SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg(
+						"could not rewind shared outer temporary file: %m")));
+	/* read byte containing ntuple bit */
+	if (BufFileRead(accessor->combined, &byte_to_check, 1) == 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg(
+						"could not read byte in outer match status bitmap: %m.")));
+	/* if bit is set */
+	match = ((byte_to_check) >> bit) & 1;
+
+	return match;
+}
+
+BufFile *
+sb_combine(SharedBitsAccessor *accessor)
+{
+	/*
+	 * TODO: this tries to close an outer match status file for each
+	 * participant in the tuplestore. technically, only participants in the
+	 * barrier could have outer match status files, however, all but one
+	 * participant continue on and detach from the barrier so we won't have a
+	 * reliable way to close only files for those attached to the barrier
+	 */
+	BufFile   **statuses;
+	BufFile    *combined_bitmap_file;
+	int			statuses_length;
+
+	int			nbparticipants = 0;
+
+	for (int l = 0; l < accessor->bits->nparticipants; l++)
+	{
+		SharedBitsParticipant participant = accessor->bits->participants[l];
+
+		if (participant.present)
+		{
+			Assert(!participant.writing);
+			nbparticipants++;
+		}
+	}
+	statuses = palloc(sizeof(BufFile *) * nbparticipants);
+
+	/*
+	 * Open the bitmap shared BufFile from each participant. TODO: explain why
+	 * file can be NULLs
+	 */
+	statuses_length = 0;
+
+	for (int i = 0; i < accessor->bits->nparticipants; i++)
+	{
+		char		bitmap_filename[MAXPGPATH];
+		BufFile    *file;
+
+		/* TODO: make a function that will do this */
+		snprintf(bitmap_filename, MAXPGPATH, "%s.p%d.bitmap", accessor->bits->name, i);
+
+		if (!accessor->bits->participants[i].present)
+			continue;
+		file = BufFileOpenShared(accessor->fileset, bitmap_filename, O_RDWR);
+		/* TODO: can we be sure that this file is at beginning? */
+		Assert(file);
+
+		statuses[statuses_length++] = file;
+	}
+
+	combined_bitmap_file = BufFileCreateTemp(false);
+
+	for (int64 cur = 0; cur < BufFileSize(statuses[0]); cur++)	/* make it while not EOF */
+	{
+		/*
+		 * TODO: make this use an unsigned int instead of a byte so it isn't
+		 * so slow
+		 */
+		unsigned char combined_byte = 0;
+
+		for (int i = 0; i < statuses_length; i++)
+		{
+			unsigned char read_byte;
+
+			BufFileRead(statuses[i], &read_byte, 1);
+			combined_byte |= read_byte;
+		}
+
+		BufFileWrite(combined_bitmap_file, &combined_byte, 1);
+	}
+
+	if (BufFileSeek(combined_bitmap_file, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	for (int i = 0; i < statuses_length; i++)
+		BufFileClose(statuses[i]);
+	pfree(statuses);
+
+	accessor->combined = combined_bitmap_file;
+	return combined_bitmap_file;
+}
+
+void
+sb_end_write(SharedBitsAccessor *sba)
+{
+	SharedBitsParticipant
+			   *const participant = &sba->bits->participants[sba->participant];
+
+	participant->writing = false;
+
+	/*
+	 * TODO: this should not be needed if flow is correct. need to fix that
+	 * and get rid of this check
+	 */
+	if (sba->write_file)
+		BufFileClose(sba->write_file);
+	sba->write_file = NULL;
+}
+
+void
+sb_end_read(SharedBitsAccessor *accessor)
+{
+	if (accessor->combined == NULL)
+		return;
+
+	BufFileClose(accessor->combined);
+	accessor->combined = NULL;
+}
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index b83fb50dac..cb5d950676 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -47,19 +47,28 @@ typedef struct SharedTuplestoreChunk
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } SharedTuplestoreChunk;
 
+typedef enum SharedTuplestoreMode
+{
+	WRITING = 0,
+	READING = 1,
+	APPENDING = 2
+} SharedTuplestoreMode;
+
 /* Per-participant shared state. */
 typedef struct SharedTuplestoreParticipant
 {
 	LWLock		lock;
 	BlockNumber read_page;		/* Page number for next read. */
+	bool		rewound;
 	BlockNumber npages;			/* Number of pages written. */
-	bool		writing;		/* Used only for assertions. */
+	SharedTuplestoreMode mode;	/* Used only for assertions. */
 } SharedTuplestoreParticipant;
 
 /* The control object that lives in shared memory. */
 struct SharedTuplestore
 {
 	int			nparticipants;	/* Number of participants that can write. */
+	pg_atomic_uint32 ntuples;	/* Number of tuples in this tuplestore. */
 	int			flags;			/* Flag bits from SHARED_TUPLESTORE_XXX */
 	size_t		meta_data_size; /* Size of per-tuple header. */
 	char		name[NAMEDATALEN];	/* A name for this tuplestore. */
@@ -92,6 +101,8 @@ struct SharedTuplestoreAccessor
 	BlockNumber write_page;		/* The next page to write to. */
 	char	   *write_pointer;	/* Current write pointer within chunk. */
 	char	   *write_end;		/* One past the end of the current chunk. */
+	bool		participated;	/* Did the worker participate in writing this
+								 * STS at any point */
 };
 
 static void sts_filename(char *name, SharedTuplestoreAccessor *accessor,
@@ -137,6 +148,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 	Assert(my_participant_number < participants);
 
 	sts->nparticipants = participants;
+	pg_atomic_init_u32(&sts->ntuples, 1);
 	sts->meta_data_size = meta_data_size;
 	sts->flags = flags;
 
@@ -158,7 +170,8 @@ sts_initialize(SharedTuplestore *sts, int participants,
 		LWLockInitialize(&sts->participants[i].lock,
 						 LWTRANCHE_SHARED_TUPLESTORE);
 		sts->participants[i].read_page = 0;
-		sts->participants[i].writing = false;
+		sts->participants[i].rewound = false;
+		sts->participants[i].mode = READING;
 	}
 
 	accessor = palloc0(sizeof(SharedTuplestoreAccessor));
@@ -188,6 +201,7 @@ sts_attach(SharedTuplestore *sts,
 	accessor->sts = sts;
 	accessor->fileset = fileset;
 	accessor->context = CurrentMemoryContext;
+	accessor->participated = false;
 
 	return accessor;
 }
@@ -219,7 +233,9 @@ sts_end_write(SharedTuplestoreAccessor *accessor)
 		pfree(accessor->write_chunk);
 		accessor->write_chunk = NULL;
 		accessor->write_file = NULL;
-		accessor->sts->participants[accessor->participant].writing = false;
+		accessor->write_pointer = NULL;
+		accessor->write_end = NULL;
+		accessor->sts->participants[accessor->participant].mode = READING;
 	}
 }
 
@@ -263,7 +279,7 @@ sts_begin_parallel_scan(SharedTuplestoreAccessor *accessor)
 	 * files have stopped growing.
 	 */
 	for (i = 0; i < accessor->sts->nparticipants; ++i)
-		Assert(!accessor->sts->participants[i].writing);
+		Assert((accessor->sts->participants[i].mode == READING) || (accessor->sts->participants[i].mode == APPENDING));
 
 	/*
 	 * We will start out reading the file that THIS backend wrote.  There may
@@ -311,10 +327,11 @@ sts_puttuple(SharedTuplestoreAccessor *accessor, void *meta_data,
 		/* Create one.  Only this backend will write into it. */
 		sts_filename(name, accessor, accessor->participant);
 		accessor->write_file = BufFileCreateShared(accessor->fileset, name);
+		accessor->participated = true;
 
 		/* Set up the shared state for this backend's file. */
 		participant = &accessor->sts->participants[accessor->participant];
-		participant->writing = true;	/* for assertions only */
+		participant->mode = WRITING;	/* for assertions only */
 	}
 
 	/* Do we have space? */
@@ -513,6 +530,17 @@ sts_read_tuple(SharedTuplestoreAccessor *accessor, void *meta_data)
 	return tuple;
 }
 
+MinimalTuple
+sts_parallel_scan_chunk(SharedTuplestoreAccessor *accessor,
+						void *meta_data,
+						bool inner)
+{
+	Assert(accessor->read_file);
+	if (accessor->read_ntuples < accessor->read_ntuples_available)
+		return sts_read_tuple(accessor, meta_data);
+	return NULL;
+}
+
 /*
  * Get the next tuple in the current parallel scan.
  */
@@ -526,7 +554,13 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	for (;;)
 	{
 		/* Can we read more tuples from the current chunk? */
-		if (accessor->read_ntuples < accessor->read_ntuples_available)
+		/*
+		 * Added a check for accessor->read_file being present here, as it
+		 * became relevant for adaptive hashjoin. TODO: Not sure if this has
+		 * other consequences for correctness
+		 */
+
+		if (accessor->read_ntuples < accessor->read_ntuples_available && accessor->read_file)
 			return sts_read_tuple(accessor, meta_data);
 
 		/* Find the location of a new chunk to read. */
@@ -618,6 +652,56 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	return NULL;
 }
 
+uint32
+sts_increment_ntuples(SharedTuplestoreAccessor *accessor)
+{
+	return pg_atomic_fetch_add_u32(&accessor->sts->ntuples, 1);
+}
+
+uint32
+sts_get_tuplenum(SharedTuplestoreAccessor *accessor)
+{
+	return pg_atomic_read_u32(&accessor->sts->ntuples);
+}
+
+int
+sta_get_read_participant(SharedTuplestoreAccessor *accessor)
+{
+	return accessor->read_participant;
+}
+
+void
+sts_spill_leftover_tuples(SharedTuplestoreAccessor *accessor, MinimalTuple tuple, uint32 hashvalue)
+{
+	tupleMetadata metadata;
+	SharedTuplestoreParticipant *participant;
+	char		name[MAXPGPATH];
+
+	metadata.hashvalue = hashvalue;
+	participant = &accessor->sts->participants[accessor->participant];
+	participant->mode = APPENDING;	/* for assertions only */
+
+	sts_filename(name, accessor, accessor->participant);
+	if (!accessor->participated)
+	{
+		accessor->write_file = BufFileCreateShared(accessor->fileset, name);
+		accessor->participated = true;
+	}
+
+	else
+		accessor->write_file = BufFileOpenShared(accessor->fileset, name, O_WRONLY);
+
+	BufFileSeek(accessor->write_file, 0, -1, SEEK_END);
+	do
+	{
+		sts_puttuple(accessor, &metadata, tuple);
+	} while ((tuple = sts_parallel_scan_chunk(accessor, &metadata, true)));
+
+	accessor->read_ntuples = 0;
+	accessor->read_ntuples_available = 0;
+	sts_end_write(accessor);
+}
+
 /*
  * Create the name used for the BufFile that a given participant will write.
  */
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index ba661d32a6..0ba9d856c8 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -46,6 +46,7 @@ typedef struct ExplainState
 	bool		timing;			/* print detailed node timing */
 	bool		summary;		/* print total planning and execution timing */
 	bool		settings;		/* print modified settings */
+	bool		usage;			/* print memory usage */
 	ExplainFormat format;		/* output format */
 	/* state for output formatting --- not reset for each new plan tree */
 	int			indent;			/* current indentation level */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index eb5daba36b..e9354cc6e0 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -19,6 +19,7 @@
 #include "storage/barrier.h"
 #include "storage/buffile.h"
 #include "storage/lwlock.h"
+#include "utils/sharedbits.h"
 
 /* ----------------------------------------------------------------
  *				hash-join hash table structures
@@ -142,6 +143,17 @@ typedef struct HashMemoryChunkData *HashMemoryChunk;
 /* tuples exceeding HASH_CHUNK_THRESHOLD bytes are put in their own chunk */
 #define HASH_CHUNK_THRESHOLD	(HASH_CHUNK_SIZE / 4)
 
+/*
+ * HashJoinTableData->curstripe the current stripe number
+ * The phantom stripe refers to the state of the inner side hashtable (empty)
+ * during the final scan of the outer batch file for a batch being processed
+ * using the hashloop fallback algorithm.
+ * In parallel-aware hash join, curstripe is in a detached state
+ * when the worker is not attached to the stripe_barrier.
+ */
+#define PHANTOM_STRIPE -2
+#define STRIPE_DETACHED -1
+
 /*
  * For each batch of a Parallel Hash Join, we have a ParallelHashJoinBatch
  * object in shared memory to coordinate access to it.  Since they are
@@ -152,14 +164,34 @@ typedef struct ParallelHashJoinBatch
 {
 	dsa_pointer buckets;		/* array of hash table buckets */
 	Barrier		batch_barrier;	/* synchronization for joining this batch */
+	Barrier		stripe_barrier; /* synchronization for stripes */
 
 	dsa_pointer chunks;			/* chunks of tuples loaded */
 	size_t		size;			/* size of buckets + chunks in memory */
 	size_t		estimated_size; /* size of buckets + chunks while writing */
-	size_t		ntuples;		/* number of tuples loaded */
+	 /* total number of tuples loaded into batch (in memory and spill files) */
+	size_t		ntuples;
 	size_t		old_ntuples;	/* number of tuples before repartitioning */
 	bool		space_exhausted;
 
+	/* Adaptive HashJoin */
+
+	/*
+	 * after finishing build phase, hashloop_fallback cannot change, and does
+	 * not require a lock to read
+	 */
+	pg_atomic_flag overflow_required;
+	bool		hashloop_fallback;
+	int			nstripes;		/* the number of stripes in the batch */
+	/* number of tuples loaded into the hashtable */
+	pg_atomic_uint64 ntuples_in_memory;
+
+	/*
+	 * Note that ntuples will reflect the total number of tuples in the batch
+	 * while ntuples_in_memory will reflect how many tuples are in memory
+	 */
+	LWLock		lock;
+
 	/*
 	 * Variable-sized SharedTuplestore objects follow this struct in memory.
 	 * See the accessor macros below.
@@ -177,10 +209,17 @@ typedef struct ParallelHashJoinBatch
 	 ((char *) ParallelHashJoinBatchInner(batch) +						\
 	  MAXALIGN(sts_estimate(nparticipants))))
 
+/* Accessor for sharedbits following a ParallelHashJoinBatch. */
+#define ParallelHashJoinBatchOuterBits(batch, nparticipants) \
+	((SharedBits *)												\
+	 ((char *) ParallelHashJoinBatchOuter(batch, nparticipants) +						\
+	  MAXALIGN(sts_estimate(nparticipants))))
+
 /* Total size of a ParallelHashJoinBatch and tuplestores. */
 #define EstimateParallelHashJoinBatch(hashtable)						\
 	(MAXALIGN(sizeof(ParallelHashJoinBatch)) +							\
-	 MAXALIGN(sts_estimate((hashtable)->parallel_state->nparticipants)) * 2)
+	 MAXALIGN(sts_estimate((hashtable)->parallel_state->nparticipants)) * 2 + \
+	 MAXALIGN(sb_estimate((hashtable)->parallel_state->nparticipants)))
 
 /* Accessor for the nth ParallelHashJoinBatch given the base. */
 #define NthParallelHashJoinBatch(base, n)								\
@@ -204,9 +243,19 @@ typedef struct ParallelHashJoinBatchAccessor
 	size_t		old_ntuples;	/* how many tuples before repartitioning? */
 	bool		at_least_one_chunk; /* has this backend allocated a chunk? */
 
-	bool		done;			/* flag to remember that a batch is done */
+	int			done;			/* flag to remember that a batch is done */
+	/* -1 for not done, 0 for tentatively done, 1 for done */
 	SharedTuplestoreAccessor *inner_tuples;
 	SharedTuplestoreAccessor *outer_tuples;
+	SharedBitsAccessor *sba;
+
+	/*
+	 * All participants except the last worker working on a batch which has
+	 * fallen back to hashloop processing save the stripe barrier phase and
+	 * detach to avoid the deadlock hazard of waiting on a barrier after
+	 * tuples have been emitted.
+	 */
+	int			last_participating_stripe_phase;
 } ParallelHashJoinBatchAccessor;
 
 /*
@@ -223,10 +272,28 @@ typedef enum ParallelHashGrowth
 	PHJ_GROWTH_NEED_MORE_BUCKETS,
 	/* The memory budget would be exhausted, so we need to repartition. */
 	PHJ_GROWTH_NEED_MORE_BATCHES,
-	/* Repartitioning didn't help last time, so don't try to do that again. */
-	PHJ_GROWTH_DISABLED
+
+	/*
+	 * While repartitioning or, if nbatches would overflow int, disable growth
+	 * in the number of batches
+	 */
+	PHJ_GROWTH_DISABLED,
+	PHJ_GROWTH_SPILL_BATCH0,
+	PHJ_GROWTH_LOADING
 } ParallelHashGrowth;
 
+typedef enum ParallelHashJoinBatchAccessorStatus
+{
+	/* No more useful work can be done on this batch by this worker */
+	PHJ_BATCH_ACCESSOR_DONE,
+
+	/*
+	 * The worker has not yet checked this batch to see if it can do useful
+	 * work
+	 */
+	PHJ_BATCH_ACCESSOR_NOT_DONE
+}			ParallelHashJoinBatchAccessorStatus;
+
 /*
  * The shared state used to coordinate a Parallel Hash Join.  This is stored
  * in the DSM segment.
@@ -246,6 +313,8 @@ typedef struct ParallelHashJoinState
 	LWLock		lock;			/* lock protecting the above */
 
 	Barrier		build_barrier;	/* synchronization for the build phases */
+	Barrier		eviction_barrier;
+	Barrier		repartition_barrier;
 	Barrier		grow_batches_barrier;
 	Barrier		grow_buckets_barrier;
 	pg_atomic_uint32 distributor;	/* counter for load balancing */
@@ -263,9 +332,42 @@ typedef struct ParallelHashJoinState
 /* The phases for probing each batch, used by for batch_barrier. */
 #define PHJ_BATCH_ELECTING				0
 #define PHJ_BATCH_ALLOCATING			1
-#define PHJ_BATCH_LOADING				2
-#define PHJ_BATCH_PROBING				3
-#define PHJ_BATCH_DONE					4
+#define PHJ_BATCH_STRIPING				2
+#define PHJ_BATCH_DONE					3
+
+/* The phases for probing each stripe of each batch used with stripe barriers */
+#define PHJ_STRIPE_INVALID_PHASE        -1
+#define PHJ_STRIPE_ELECTING				0
+#define PHJ_STRIPE_RESETTING			1
+#define PHJ_STRIPE_LOADING				2
+#define PHJ_STRIPE_OVERFLOWING          3
+#define PHJ_STRIPE_PROBING				4
+#define PHJ_STRIPE_DONE				    5
+#define PHJ_STRIPE_NUMBER(n)            ((n) / 6)
+#define PHJ_STRIPE_PHASE(n)             ((n) % 6)
+
+#define PHJ_EVICT_ELECTING 0
+#define PHJ_EVICT_RESETTING 1
+#define PHJ_EVICT_SPILLING 2
+#define PHJ_EVICT_FINISHING 3
+#define PHJ_EVICT_DONE 4
+#define PHJ_EVICT_PHASE(n)          ((n) % 5)
+
+/*
+ * These phases are now required for repartitioning batch 0 since it can
+ * spill. First all tuples which were resident in the hashtable need to
+ * be relocated either back to the hashtable or to a spill file, if they
+ * would relocate to a batch 1+ given the new number of batches. After
+ * draining the chunk_work_queue, we must drain the batch 0 spill file,
+ * if it exists. Some tuples may have been relocated from the hashtable
+ * to other batches, in which case, space may have been freed up which
+ * the tuples from the batch 0 spill file can occupy. The tuples from the
+ * batch 0 spill file may go to 1) the hashtable, 2) back to the batch 0
+ * spill file in the new generation of batches, 3) to a batch file 1+
+ */
+#define PHJ_REPARTITION_BATCH0_DRAIN_QUEUE 0
+#define PHJ_REPARTITION_BATCH0_DRAIN_SPILL_FILE 1
+#define PHJ_REPARTITION_BATCH0_PHASE(n)  ((n) % 2)
 
 /* The phases of batch growth while hashing, for grow_batches_barrier. */
 #define PHJ_GROW_BATCHES_ELECTING		0
@@ -313,8 +415,6 @@ typedef struct HashJoinTableData
 	int			nbatch_original;	/* nbatch when we started inner scan */
 	int			nbatch_outstart;	/* nbatch when we started outer scan */
 
-	bool		growEnabled;	/* flag to shut off nbatch increases */
-
 	double		totalTuples;	/* # tuples obtained from inner plan */
 	double		partialTuples;	/* # tuples obtained from inner plan by me */
 	double		skewTuples;		/* # tuples inserted into skew tuples */
@@ -329,6 +429,18 @@ typedef struct HashJoinTableData
 	BufFile   **innerBatchFile; /* buffered virtual temp file per batch */
 	BufFile   **outerBatchFile; /* buffered virtual temp file per batch */
 
+	/*
+	 * Adaptive hashjoin variables
+	 */
+	BufFile   **hashloopBatchFile;	/* outer match status files if fall back */
+	List	   *fallback_batches_stats; /* per hashjoin batch statistics */
+
+	/*
+	 * current stripe #; 0 during 1st pass, -1 (macro STRIPE_DETACHED) when
+	 * detached, -2 on phantom stripe (macro PHANTOM_STRIPE)
+	 */
+	int			curstripe;
+
 	/*
 	 * Info about the datatype-specific hash functions for the datatypes being
 	 * hashed. These are arrays of the same length as the number of hash join
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9dc3ecb07d..839086005c 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -14,6 +14,7 @@
 #define INSTRUMENT_H
 
 #include "portability/instr_time.h"
+#include "nodes/pg_list.h"
 
 
 typedef struct BufferUsage
@@ -39,6 +40,12 @@ typedef struct WalUsage
 	uint64		wal_bytes;		/* size of WAL records produced */
 } WalUsage;
 
+typedef struct FallbackBatchStats
+{
+	int			batchno;
+	int			numstripes;
+} FallbackBatchStats;
+
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
 typedef enum InstrumentOption
 {
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 2db4e2f672..6d094e1a43 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -31,6 +31,7 @@ extern void ExecParallelHashTableAlloc(HashJoinTable hashtable,
 extern void ExecHashTableDestroy(HashJoinTable hashtable);
 extern void ExecHashTableDetach(HashJoinTable hashtable);
 extern void ExecHashTableDetachBatch(HashJoinTable hashtable);
+extern bool ExecHashTableDetachStripe(HashJoinTable hashtable);
 extern void ExecParallelHashTableSetCurrentBatch(HashJoinTable hashtable,
 												 int batchno);
 
@@ -40,9 +41,11 @@ extern void ExecHashTableInsert(HashJoinTable hashtable,
 extern void ExecParallelHashTableInsert(HashJoinTable hashtable,
 										TupleTableSlot *slot,
 										uint32 hashvalue);
-extern void ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable,
+extern MinimalTuple
+			ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable,
 													TupleTableSlot *slot,
-													uint32 hashvalue);
+													uint32 hashvalue,
+													int read_participant);
 extern bool ExecHashGetHashValue(HashJoinTable hashtable,
 								 ExprContext *econtext,
 								 List *hashkeys,
@@ -59,6 +62,8 @@ extern void ExecPrepHashTableForUnmatched(HashJoinState *hjstate);
 extern bool ExecScanHashTableForUnmatched(HashJoinState *hjstate,
 										  ExprContext *econtext);
 extern void ExecHashTableReset(HashJoinTable hashtable);
+extern void
+			ExecParallelHashTableRecycle(HashJoinTable hashtable);
 extern void ExecHashTableResetMatchFlags(HashJoinTable hashtable);
 extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									bool try_combined_hash_mem,
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index f7df70b5ab..0c0d87d1d3 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -129,6 +129,7 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+	uint32		tts_tuplenum;	/* a tuple id for use when ctid cannot be used */
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -425,6 +426,7 @@ static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
 	slot->tts_ops->clear(slot);
+	slot->tts_tuplenum = 0;		/* TODO: should this be done elsewhere? */
 
 	return slot;
 }
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0b42dd6f94..cb30e3bea1 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1959,6 +1959,10 @@ typedef struct HashJoinState
 	int			hj_JoinState;
 	bool		hj_MatchedOuter;
 	bool		hj_OuterNotEmpty;
+	/* Adaptive Hashjoin variables */
+	int			hj_CurNumOuterTuples;	/* number of outer tuples in a batch */
+	unsigned int hj_CurOuterMatchStatus;
+	int			hj_EmitOuterTupleId;
 } HashJoinState;
 
 
@@ -2387,6 +2391,7 @@ typedef struct HashInstrumentation
 	int			nbatch;			/* number of batches at end of execution */
 	int			nbatch_original;	/* planned number of batches */
 	Size		space_peak;		/* peak memory usage in bytes */
+	List	   *fallback_batches_stats; /* per hashjoin batch stats */
 } HashInstrumentation;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 807a9c1edf..399c442171 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -855,11 +855,20 @@ typedef enum
 	WAIT_EVENT_EXECUTE_GATHER,
 	WAIT_EVENT_HASH_BATCH_ALLOCATE,
 	WAIT_EVENT_HASH_BATCH_ELECT,
-	WAIT_EVENT_HASH_BATCH_LOAD,
+	WAIT_EVENT_HASH_STRIPE_ELECT,
+	WAIT_EVENT_HASH_STRIPE_RESET,
+	WAIT_EVENT_HASH_STRIPE_LOAD,
+	WAIT_EVENT_HASH_STRIPE_OVERFLOW,
+	WAIT_EVENT_HASH_STRIPE_PROBE,
 	WAIT_EVENT_HASH_BUILD_ALLOCATE,
 	WAIT_EVENT_HASH_BUILD_ELECT,
 	WAIT_EVENT_HASH_BUILD_HASH_INNER,
 	WAIT_EVENT_HASH_BUILD_HASH_OUTER,
+	WAIT_EVENT_HASH_EVICT_ELECT,
+	WAIT_EVENT_HASH_EVICT_RESET,
+	WAIT_EVENT_HASH_EVICT_SPILL,
+	WAIT_EVENT_HASH_EVICT_FINISH,
+	WAIT_EVENT_HASH_REPARTITION_BATCH0_DRAIN_QUEUE,
 	WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATE,
 	WAIT_EVENT_HASH_GROW_BATCHES_DECIDE,
 	WAIT_EVENT_HASH_GROW_BATCHES_ELECT,
diff --git a/src/include/utils/sharedbits.h b/src/include/utils/sharedbits.h
new file mode 100644
index 0000000000..de43279de8
--- /dev/null
+++ b/src/include/utils/sharedbits.h
@@ -0,0 +1,39 @@
+/*-------------------------------------------------------------------------
+ *
+ * sharedbits.h
+ *	  Simple mechanism for sharing bits between backends.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/sharedbits.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SHAREDBITS_H
+#define SHAREDBITS_H
+
+#include "storage/sharedfileset.h"
+
+struct SharedBits;
+typedef struct SharedBits SharedBits;
+
+struct SharedBitsParticipant;
+typedef struct SharedBitsParticipant SharedBitsParticipant;
+
+struct SharedBitsAccessor;
+typedef struct SharedBitsAccessor SharedBitsAccessor;
+
+extern SharedBitsAccessor *sb_attach(SharedBits *sbits, int my_participant_number, SharedFileSet *fileset);
+extern SharedBitsAccessor *sb_initialize(SharedBits *sbits, int participants, int my_participant_number, SharedFileSet *fileset, char *name);
+extern void sb_initialize_accessor(SharedBitsAccessor *accessor, uint32 nbits);
+extern size_t sb_estimate(int participants);
+
+extern void sb_setbit(SharedBitsAccessor *accessor, uint64 bit);
+extern bool sb_checkbit(SharedBitsAccessor *accessor, uint32 n);
+extern BufFile *sb_combine(SharedBitsAccessor *accessor);
+
+extern void sb_end_write(SharedBitsAccessor *sba);
+extern void sb_end_read(SharedBitsAccessor *accessor);
+
+#endif							/* SHAREDBITS_H */
diff --git a/src/include/utils/sharedtuplestore.h b/src/include/utils/sharedtuplestore.h
index 9754504cc5..5f8d95cb1a 100644
--- a/src/include/utils/sharedtuplestore.h
+++ b/src/include/utils/sharedtuplestore.h
@@ -22,6 +22,17 @@ typedef struct SharedTuplestore SharedTuplestore;
 
 struct SharedTuplestoreAccessor;
 typedef struct SharedTuplestoreAccessor SharedTuplestoreAccessor;
+struct tupleMetadata;
+typedef struct tupleMetadata tupleMetadata;
+struct tupleMetadata
+{
+	uint32		hashvalue;
+	union
+	{
+		uint32		tupleid;	/* tuple number or id on the outer side */
+		int			stripe;		/* stripe number for inner side */
+	};
+};
 
 /*
  * A flag indicating that the tuplestore will only be scanned once, so backing
@@ -58,4 +69,14 @@ extern void sts_puttuple(SharedTuplestoreAccessor *accessor,
 extern MinimalTuple sts_parallel_scan_next(SharedTuplestoreAccessor *accessor,
 										   void *meta_data);
 
+extern uint32 sts_increment_ntuples(SharedTuplestoreAccessor *accessor);
+extern uint32 sts_get_tuplenum(SharedTuplestoreAccessor *accessor);
+extern int	sta_get_read_participant(SharedTuplestoreAccessor *accessor);
+extern void sts_spill_leftover_tuples(SharedTuplestoreAccessor *accessor, MinimalTuple tuple, uint32 hashvalue);
+
+extern MinimalTuple sts_parallel_scan_chunk(SharedTuplestoreAccessor *accessor,
+											void *meta_data,
+											bool inner);
+
+
 #endif							/* SHAREDTUPLESTORE_H */
diff --git a/src/test/regress/expected/join_hash.out b/src/test/regress/expected/join_hash.out
index 3a91c144a2..aa7477a299 100644
--- a/src/test/regress/expected/join_hash.out
+++ b/src/test/regress/expected/join_hash.out
@@ -839,45 +839,26 @@ rollback to settings;
 -- the hash table)
 -- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
 -- sts_puttuple oversized tuple cases because it's multi-batch)
-savepoint settings;
-set max_parallel_workers_per_gather = 2;
-set enable_parallel_hash = on;
-set work_mem = '128kB';
-explain (costs off)
-  select length(max(s.t))
-  from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
-                           QUERY PLAN                           
-----------------------------------------------------------------
- Finalize Aggregate
-   ->  Gather
-         Workers Planned: 2
-         ->  Partial Aggregate
-               ->  Parallel Hash Left Join
-                     Hash Cond: (wide.id = wide_1.id)
-                     ->  Parallel Seq Scan on wide
-                     ->  Parallel Hash
-                           ->  Parallel Seq Scan on wide wide_1
-(9 rows)
-
-select length(max(s.t))
-from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
- length 
---------
- 320000
-(1 row)
-
-select final > 1 as multibatch
-  from hash_join_batches(
-$$
-  select length(max(s.t))
-  from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
-$$);
- multibatch 
-------------
- t
-(1 row)
-
-rollback to settings;
+-- savepoint settings;
+-- set max_parallel_workers_per_gather = 2;
+-- set enable_parallel_hash = on;
+-- TODO: throw an error when this happens: cannot set work_mem lower than the side of a single tuple
+-- TODO: ensure that oversize tuple code is still exercised (should be with some of the stub stuff below)
+-- TODO: commented this out since it would crash otherwise
+-- this test is no longer multi-batch, so, perhaps, it should be removed
+-- set work_mem = '128kB';
+-- explain (costs off)
+--   select length(max(s.t))
+--   from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+-- select length(max(s.t))
+-- from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+-- select final > 1 as multibatch
+--   from hash_join_batches(
+-- $$
+--   select length(max(s.t))
+--   from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+-- $$);
+-- rollback to settings;
 rollback;
 -- Verify that hash key expressions reference the correct
 -- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
@@ -1013,3 +994,1968 @@ WHERE
 (1 row)
 
 ROLLBACK;
+-- Serial Adaptive Hash Join
+BEGIN;
+CREATE TYPE stub AS (hash INTEGER, value CHAR(8090));
+CREATE FUNCTION stub_hash(item stub)
+RETURNS INTEGER AS $$
+DECLARE
+  batch_size INTEGER;
+BEGIN
+  batch_size := 4;
+  RETURN item.hash << (batch_size - 1);
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+CREATE FUNCTION stub_eq(item1 stub, item2 stub)
+RETURNS BOOLEAN AS $$
+BEGIN
+  RETURN item1.hash = item2.hash AND item1.value = item2.value;
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+CREATE OPERATOR = (
+  FUNCTION = stub_eq,
+  LEFTARG = stub,
+  RIGHTARG = stub,
+  COMMUTATOR = =,
+  HASHES, MERGES
+);
+CREATE OPERATOR CLASS stub_hash_ops
+DEFAULT FOR TYPE stub USING hash AS
+  OPERATOR 1 =(stub, stub),
+  FUNCTION 1 stub_hash(stub);
+CREATE TABLE probeside(a stub);
+ALTER TABLE probeside ALTER COLUMN a SET STORAGE PLAIN;
+-- non-fallback batch with unmatched outer tuple
+INSERT INTO probeside SELECT '(2, "")' FROM generate_series(1, 1);
+-- fallback batch unmatched outer tuple (in first stripe maybe)
+INSERT INTO probeside SELECT '(1, "unmatched outer tuple")' FROM generate_series(1, 1);
+-- fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(1, "")' FROM generate_series(1, 5);
+-- fallback batch unmatched outer tuple (in last stripe maybe)
+-- When numbatches=4, hash 5 maps to batch 1, but after numbatches doubles to
+-- 8 batches hash 5 maps to batch 5.
+INSERT INTO probeside SELECT '(5, "")' FROM generate_series(1, 1);
+-- non-fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(3, "")' FROM generate_series(1, 1);
+-- batch with 3 stripes where non-first/non-last stripe contains unmatched outer tuple
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 5);
+INSERT INTO probeside SELECT '(6, "unmatched outer tuple")' FROM generate_series(1, 1);
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 1);
+CREATE TABLE hashside_wide(a stub, id int);
+ALTER TABLE hashside_wide ALTER COLUMN a SET STORAGE PLAIN;
+-- falls back with an unmatched inner tuple that is in fist, middle, and last
+-- stripe
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in first stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in middle stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in last stripe")', 1 FROM generate_series(1, 1);
+-- doesn't fall back -- matched tuple
+INSERT INTO hashside_wide SELECT '(3, "")', 3 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(6, "")', 6 FROM generate_series(1, 20);
+ANALYZE probeside, hashside_wide;
+SET enable_nestloop TO off;
+SET enable_mergejoin TO off;
+SET work_mem = 64;
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash | btrim 
+------+-----------------------+----+------+-------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+(215 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Left Join (actual rows=215 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash | btrim | id | hash |                 btrim                  
+------+-------+----+------+----------------------------------------
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    3 |       |  3 |    3 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+      |       |  1 |    1 | unmatched inner tuple in first stripe
+      |       |  1 |    1 | unmatched inner tuple in last stripe
+      |       |  1 |    1 | unmatched inner tuple in middle stripe
+(214 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Right Join (actual rows=214 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+FULL OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash |                 btrim                  
+------+-----------------------+----+------+----------------------------------------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+      |                       |  1 |    1 | unmatched inner tuple in first stripe
+      |                       |  1 |    1 | unmatched inner tuple in last stripe
+      |                       |  1 |    1 | unmatched inner tuple in middle stripe
+(218 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+FULL OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Full Join (actual rows=218 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+-- semi-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Semi Join (actual rows=12 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+ hash | btrim 
+------+-------
+    1 | 
+    1 | 
+    1 | 
+    1 | 
+    1 | 
+    3 | 
+    6 | 
+    6 | 
+    6 | 
+    6 | 
+    6 | 
+    6 | 
+(12 rows)
+
+-- anti-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Anti Join (actual rows=4 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+ hash |         btrim         
+------+-----------------------
+    1 | unmatched outer tuple
+    2 | 
+    5 | 
+    6 | unmatched outer tuple
+(4 rows)
+
+-- parallel LOJ test case with two batches falling back
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_parallel_hash = on;
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Gather (actual rows=215 loops=1)
+   Workers Planned: 1
+   Workers Launched: 1
+   ->  Parallel Hash Left Join (actual rows=108 loops=2)
+         Hash Cond: (probeside.a = hashside_wide.a)
+         ->  Parallel Seq Scan on probeside (actual rows=16 loops=1)
+         ->  Parallel Hash (actual rows=21 loops=2)
+               Buckets: 8 (originally 8)  Batches: 128 (originally 8)
+               Batch: 1  Stripes: 3
+               Batch: 6  Stripes: 3
+               ->  Parallel Seq Scan on hashside_wide (actual rows=42 loops=1)
+(11 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash | btrim 
+------+-----------------------+----+------+-------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+(215 rows)
+
+rollback to settings;
+-- Test spill of batch 0 gives correct results.
+CREATE TABLE probeside_batch0(id int generated always as identity, a stub);
+ALTER TABLE probeside_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO probeside_batch0(a) SELECT '(0, "")' FROM generate_series(1, 13);
+INSERT INTO probeside_batch0(a) SELECT '(0, "unmatched outer")' FROM generate_series(1, 1);
+CREATE TABLE hashside_wide_batch0(id int generated always as identity, a stub);
+ALTER TABLE hashside_wide_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+ANALYZE probeside_batch0, hashside_wide_batch0;
+SELECT
+       hashside_wide_batch0.id as hashside_id, 
+       (hashside_wide_batch0.a).hash as hashside_hash,
+        probeside_batch0.id as probeside_id, 
+       (probeside_batch0.a).hash as probeside_hash,
+        TRIM((probeside_batch0.a).value) as probeside_trimmed_value,
+        TRIM((hashside_wide_batch0.a).value) as hashside_trimmed_value 
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5, 6;
+ hashside_id | hashside_hash | probeside_id | probeside_hash | probeside_trimmed_value | hashside_trimmed_value 
+-------------+---------------+--------------+----------------+-------------------------+------------------------
+           1 |             0 |            1 |              0 |                         | 
+           1 |             0 |            2 |              0 |                         | 
+           1 |             0 |            3 |              0 |                         | 
+           1 |             0 |            4 |              0 |                         | 
+           1 |             0 |            5 |              0 |                         | 
+           1 |             0 |            6 |              0 |                         | 
+           1 |             0 |            7 |              0 |                         | 
+           1 |             0 |            8 |              0 |                         | 
+           1 |             0 |            9 |              0 |                         | 
+           1 |             0 |           10 |              0 |                         | 
+           1 |             0 |           11 |              0 |                         | 
+           1 |             0 |           12 |              0 |                         | 
+           1 |             0 |           13 |              0 |                         | 
+           2 |             0 |            1 |              0 |                         | 
+           2 |             0 |            2 |              0 |                         | 
+           2 |             0 |            3 |              0 |                         | 
+           2 |             0 |            4 |              0 |                         | 
+           2 |             0 |            5 |              0 |                         | 
+           2 |             0 |            6 |              0 |                         | 
+           2 |             0 |            7 |              0 |                         | 
+           2 |             0 |            8 |              0 |                         | 
+           2 |             0 |            9 |              0 |                         | 
+           2 |             0 |           10 |              0 |                         | 
+           2 |             0 |           11 |              0 |                         | 
+           2 |             0 |           12 |              0 |                         | 
+           2 |             0 |           13 |              0 |                         | 
+           3 |             0 |            1 |              0 |                         | 
+           3 |             0 |            2 |              0 |                         | 
+           3 |             0 |            3 |              0 |                         | 
+           3 |             0 |            4 |              0 |                         | 
+           3 |             0 |            5 |              0 |                         | 
+           3 |             0 |            6 |              0 |                         | 
+           3 |             0 |            7 |              0 |                         | 
+           3 |             0 |            8 |              0 |                         | 
+           3 |             0 |            9 |              0 |                         | 
+           3 |             0 |           10 |              0 |                         | 
+           3 |             0 |           11 |              0 |                         | 
+           3 |             0 |           12 |              0 |                         | 
+           3 |             0 |           13 |              0 |                         | 
+           4 |             0 |            1 |              0 |                         | 
+           4 |             0 |            2 |              0 |                         | 
+           4 |             0 |            3 |              0 |                         | 
+           4 |             0 |            4 |              0 |                         | 
+           4 |             0 |            5 |              0 |                         | 
+           4 |             0 |            6 |              0 |                         | 
+           4 |             0 |            7 |              0 |                         | 
+           4 |             0 |            8 |              0 |                         | 
+           4 |             0 |            9 |              0 |                         | 
+           4 |             0 |           10 |              0 |                         | 
+           4 |             0 |           11 |              0 |                         | 
+           4 |             0 |           12 |              0 |                         | 
+           4 |             0 |           13 |              0 |                         | 
+           5 |             0 |            1 |              0 |                         | 
+           5 |             0 |            2 |              0 |                         | 
+           5 |             0 |            3 |              0 |                         | 
+           5 |             0 |            4 |              0 |                         | 
+           5 |             0 |            5 |              0 |                         | 
+           5 |             0 |            6 |              0 |                         | 
+           5 |             0 |            7 |              0 |                         | 
+           5 |             0 |            8 |              0 |                         | 
+           5 |             0 |            9 |              0 |                         | 
+           5 |             0 |           10 |              0 |                         | 
+           5 |             0 |           11 |              0 |                         | 
+           5 |             0 |           12 |              0 |                         | 
+           5 |             0 |           13 |              0 |                         | 
+           6 |             0 |            1 |              0 |                         | 
+           6 |             0 |            2 |              0 |                         | 
+           6 |             0 |            3 |              0 |                         | 
+           6 |             0 |            4 |              0 |                         | 
+           6 |             0 |            5 |              0 |                         | 
+           6 |             0 |            6 |              0 |                         | 
+           6 |             0 |            7 |              0 |                         | 
+           6 |             0 |            8 |              0 |                         | 
+           6 |             0 |            9 |              0 |                         | 
+           6 |             0 |           10 |              0 |                         | 
+           6 |             0 |           11 |              0 |                         | 
+           6 |             0 |           12 |              0 |                         | 
+           6 |             0 |           13 |              0 |                         | 
+           7 |             0 |            1 |              0 |                         | 
+           7 |             0 |            2 |              0 |                         | 
+           7 |             0 |            3 |              0 |                         | 
+           7 |             0 |            4 |              0 |                         | 
+           7 |             0 |            5 |              0 |                         | 
+           7 |             0 |            6 |              0 |                         | 
+           7 |             0 |            7 |              0 |                         | 
+           7 |             0 |            8 |              0 |                         | 
+           7 |             0 |            9 |              0 |                         | 
+           7 |             0 |           10 |              0 |                         | 
+           7 |             0 |           11 |              0 |                         | 
+           7 |             0 |           12 |              0 |                         | 
+           7 |             0 |           13 |              0 |                         | 
+           8 |             0 |            1 |              0 |                         | 
+           8 |             0 |            2 |              0 |                         | 
+           8 |             0 |            3 |              0 |                         | 
+           8 |             0 |            4 |              0 |                         | 
+           8 |             0 |            5 |              0 |                         | 
+           8 |             0 |            6 |              0 |                         | 
+           8 |             0 |            7 |              0 |                         | 
+           8 |             0 |            8 |              0 |                         | 
+           8 |             0 |            9 |              0 |                         | 
+           8 |             0 |           10 |              0 |                         | 
+           8 |             0 |           11 |              0 |                         | 
+           8 |             0 |           12 |              0 |                         | 
+           8 |             0 |           13 |              0 |                         | 
+           9 |             0 |            1 |              0 |                         | 
+           9 |             0 |            2 |              0 |                         | 
+           9 |             0 |            3 |              0 |                         | 
+           9 |             0 |            4 |              0 |                         | 
+           9 |             0 |            5 |              0 |                         | 
+           9 |             0 |            6 |              0 |                         | 
+           9 |             0 |            7 |              0 |                         | 
+           9 |             0 |            8 |              0 |                         | 
+           9 |             0 |            9 |              0 |                         | 
+           9 |             0 |           10 |              0 |                         | 
+           9 |             0 |           11 |              0 |                         | 
+           9 |             0 |           12 |              0 |                         | 
+           9 |             0 |           13 |              0 |                         | 
+          10 |             0 |            1 |              0 |                         | 
+          10 |             0 |            2 |              0 |                         | 
+          10 |             0 |            3 |              0 |                         | 
+          10 |             0 |            4 |              0 |                         | 
+          10 |             0 |            5 |              0 |                         | 
+          10 |             0 |            6 |              0 |                         | 
+          10 |             0 |            7 |              0 |                         | 
+          10 |             0 |            8 |              0 |                         | 
+          10 |             0 |            9 |              0 |                         | 
+          10 |             0 |           10 |              0 |                         | 
+          10 |             0 |           11 |              0 |                         | 
+          10 |             0 |           12 |              0 |                         | 
+          10 |             0 |           13 |              0 |                         | 
+          11 |             0 |            1 |              0 |                         | 
+          11 |             0 |            2 |              0 |                         | 
+          11 |             0 |            3 |              0 |                         | 
+          11 |             0 |            4 |              0 |                         | 
+          11 |             0 |            5 |              0 |                         | 
+          11 |             0 |            6 |              0 |                         | 
+          11 |             0 |            7 |              0 |                         | 
+          11 |             0 |            8 |              0 |                         | 
+          11 |             0 |            9 |              0 |                         | 
+          11 |             0 |           10 |              0 |                         | 
+          11 |             0 |           11 |              0 |                         | 
+          11 |             0 |           12 |              0 |                         | 
+          11 |             0 |           13 |              0 |                         | 
+          12 |             0 |            1 |              0 |                         | 
+          12 |             0 |            2 |              0 |                         | 
+          12 |             0 |            3 |              0 |                         | 
+          12 |             0 |            4 |              0 |                         | 
+          12 |             0 |            5 |              0 |                         | 
+          12 |             0 |            6 |              0 |                         | 
+          12 |             0 |            7 |              0 |                         | 
+          12 |             0 |            8 |              0 |                         | 
+          12 |             0 |            9 |              0 |                         | 
+          12 |             0 |           10 |              0 |                         | 
+          12 |             0 |           11 |              0 |                         | 
+          12 |             0 |           12 |              0 |                         | 
+          12 |             0 |           13 |              0 |                         | 
+          13 |             0 |            1 |              0 |                         | 
+          13 |             0 |            2 |              0 |                         | 
+          13 |             0 |            3 |              0 |                         | 
+          13 |             0 |            4 |              0 |                         | 
+          13 |             0 |            5 |              0 |                         | 
+          13 |             0 |            6 |              0 |                         | 
+          13 |             0 |            7 |              0 |                         | 
+          13 |             0 |            8 |              0 |                         | 
+          13 |             0 |            9 |              0 |                         | 
+          13 |             0 |           10 |              0 |                         | 
+          13 |             0 |           11 |              0 |                         | 
+          13 |             0 |           12 |              0 |                         | 
+          13 |             0 |           13 |              0 |                         | 
+          14 |             0 |            1 |              0 |                         | 
+          14 |             0 |            2 |              0 |                         | 
+          14 |             0 |            3 |              0 |                         | 
+          14 |             0 |            4 |              0 |                         | 
+          14 |             0 |            5 |              0 |                         | 
+          14 |             0 |            6 |              0 |                         | 
+          14 |             0 |            7 |              0 |                         | 
+          14 |             0 |            8 |              0 |                         | 
+          14 |             0 |            9 |              0 |                         | 
+          14 |             0 |           10 |              0 |                         | 
+          14 |             0 |           11 |              0 |                         | 
+          14 |             0 |           12 |              0 |                         | 
+          14 |             0 |           13 |              0 |                         | 
+          15 |             0 |            1 |              0 |                         | 
+          15 |             0 |            2 |              0 |                         | 
+          15 |             0 |            3 |              0 |                         | 
+          15 |             0 |            4 |              0 |                         | 
+          15 |             0 |            5 |              0 |                         | 
+          15 |             0 |            6 |              0 |                         | 
+          15 |             0 |            7 |              0 |                         | 
+          15 |             0 |            8 |              0 |                         | 
+          15 |             0 |            9 |              0 |                         | 
+          15 |             0 |           10 |              0 |                         | 
+          15 |             0 |           11 |              0 |                         | 
+          15 |             0 |           12 |              0 |                         | 
+          15 |             0 |           13 |              0 |                         | 
+          16 |             0 |            1 |              0 |                         | 
+          16 |             0 |            2 |              0 |                         | 
+          16 |             0 |            3 |              0 |                         | 
+          16 |             0 |            4 |              0 |                         | 
+          16 |             0 |            5 |              0 |                         | 
+          16 |             0 |            6 |              0 |                         | 
+          16 |             0 |            7 |              0 |                         | 
+          16 |             0 |            8 |              0 |                         | 
+          16 |             0 |            9 |              0 |                         | 
+          16 |             0 |           10 |              0 |                         | 
+          16 |             0 |           11 |              0 |                         | 
+          16 |             0 |           12 |              0 |                         | 
+          16 |             0 |           13 |              0 |                         | 
+          17 |             0 |            1 |              0 |                         | 
+          17 |             0 |            2 |              0 |                         | 
+          17 |             0 |            3 |              0 |                         | 
+          17 |             0 |            4 |              0 |                         | 
+          17 |             0 |            5 |              0 |                         | 
+          17 |             0 |            6 |              0 |                         | 
+          17 |             0 |            7 |              0 |                         | 
+          17 |             0 |            8 |              0 |                         | 
+          17 |             0 |            9 |              0 |                         | 
+          17 |             0 |           10 |              0 |                         | 
+          17 |             0 |           11 |              0 |                         | 
+          17 |             0 |           12 |              0 |                         | 
+          17 |             0 |           13 |              0 |                         | 
+          18 |             0 |            1 |              0 |                         | 
+          18 |             0 |            2 |              0 |                         | 
+          18 |             0 |            3 |              0 |                         | 
+          18 |             0 |            4 |              0 |                         | 
+          18 |             0 |            5 |              0 |                         | 
+          18 |             0 |            6 |              0 |                         | 
+          18 |             0 |            7 |              0 |                         | 
+          18 |             0 |            8 |              0 |                         | 
+          18 |             0 |            9 |              0 |                         | 
+          18 |             0 |           10 |              0 |                         | 
+          18 |             0 |           11 |              0 |                         | 
+          18 |             0 |           12 |              0 |                         | 
+          18 |             0 |           13 |              0 |                         | 
+          19 |             0 |            1 |              0 |                         | 
+          19 |             0 |            2 |              0 |                         | 
+          19 |             0 |            3 |              0 |                         | 
+          19 |             0 |            4 |              0 |                         | 
+          19 |             0 |            5 |              0 |                         | 
+          19 |             0 |            6 |              0 |                         | 
+          19 |             0 |            7 |              0 |                         | 
+          19 |             0 |            8 |              0 |                         | 
+          19 |             0 |            9 |              0 |                         | 
+          19 |             0 |           10 |              0 |                         | 
+          19 |             0 |           11 |              0 |                         | 
+          19 |             0 |           12 |              0 |                         | 
+          19 |             0 |           13 |              0 |                         | 
+          20 |             0 |            1 |              0 |                         | 
+          20 |             0 |            2 |              0 |                         | 
+          20 |             0 |            3 |              0 |                         | 
+          20 |             0 |            4 |              0 |                         | 
+          20 |             0 |            5 |              0 |                         | 
+          20 |             0 |            6 |              0 |                         | 
+          20 |             0 |            7 |              0 |                         | 
+          20 |             0 |            8 |              0 |                         | 
+          20 |             0 |            9 |              0 |                         | 
+          20 |             0 |           10 |              0 |                         | 
+          20 |             0 |           11 |              0 |                         | 
+          20 |             0 |           12 |              0 |                         | 
+          20 |             0 |           13 |              0 |                         | 
+          21 |             0 |            1 |              0 |                         | 
+          21 |             0 |            2 |              0 |                         | 
+          21 |             0 |            3 |              0 |                         | 
+          21 |             0 |            4 |              0 |                         | 
+          21 |             0 |            5 |              0 |                         | 
+          21 |             0 |            6 |              0 |                         | 
+          21 |             0 |            7 |              0 |                         | 
+          21 |             0 |            8 |              0 |                         | 
+          21 |             0 |            9 |              0 |                         | 
+          21 |             0 |           10 |              0 |                         | 
+          21 |             0 |           11 |              0 |                         | 
+          21 |             0 |           12 |              0 |                         | 
+          21 |             0 |           13 |              0 |                         | 
+          22 |             0 |            1 |              0 |                         | 
+          22 |             0 |            2 |              0 |                         | 
+          22 |             0 |            3 |              0 |                         | 
+          22 |             0 |            4 |              0 |                         | 
+          22 |             0 |            5 |              0 |                         | 
+          22 |             0 |            6 |              0 |                         | 
+          22 |             0 |            7 |              0 |                         | 
+          22 |             0 |            8 |              0 |                         | 
+          22 |             0 |            9 |              0 |                         | 
+          22 |             0 |           10 |              0 |                         | 
+          22 |             0 |           11 |              0 |                         | 
+          22 |             0 |           12 |              0 |                         | 
+          22 |             0 |           13 |              0 |                         | 
+          23 |             0 |            1 |              0 |                         | 
+          23 |             0 |            2 |              0 |                         | 
+          23 |             0 |            3 |              0 |                         | 
+          23 |             0 |            4 |              0 |                         | 
+          23 |             0 |            5 |              0 |                         | 
+          23 |             0 |            6 |              0 |                         | 
+          23 |             0 |            7 |              0 |                         | 
+          23 |             0 |            8 |              0 |                         | 
+          23 |             0 |            9 |              0 |                         | 
+          23 |             0 |           10 |              0 |                         | 
+          23 |             0 |           11 |              0 |                         | 
+          23 |             0 |           12 |              0 |                         | 
+          23 |             0 |           13 |              0 |                         | 
+          24 |             0 |            1 |              0 |                         | 
+          24 |             0 |            2 |              0 |                         | 
+          24 |             0 |            3 |              0 |                         | 
+          24 |             0 |            4 |              0 |                         | 
+          24 |             0 |            5 |              0 |                         | 
+          24 |             0 |            6 |              0 |                         | 
+          24 |             0 |            7 |              0 |                         | 
+          24 |             0 |            8 |              0 |                         | 
+          24 |             0 |            9 |              0 |                         | 
+          24 |             0 |           10 |              0 |                         | 
+          24 |             0 |           11 |              0 |                         | 
+          24 |             0 |           12 |              0 |                         | 
+          24 |             0 |           13 |              0 |                         | 
+          25 |             0 |            1 |              0 |                         | 
+          25 |             0 |            2 |              0 |                         | 
+          25 |             0 |            3 |              0 |                         | 
+          25 |             0 |            4 |              0 |                         | 
+          25 |             0 |            5 |              0 |                         | 
+          25 |             0 |            6 |              0 |                         | 
+          25 |             0 |            7 |              0 |                         | 
+          25 |             0 |            8 |              0 |                         | 
+          25 |             0 |            9 |              0 |                         | 
+          25 |             0 |           10 |              0 |                         | 
+          25 |             0 |           11 |              0 |                         | 
+          25 |             0 |           12 |              0 |                         | 
+          25 |             0 |           13 |              0 |                         | 
+          26 |             0 |            1 |              0 |                         | 
+          26 |             0 |            2 |              0 |                         | 
+          26 |             0 |            3 |              0 |                         | 
+          26 |             0 |            4 |              0 |                         | 
+          26 |             0 |            5 |              0 |                         | 
+          26 |             0 |            6 |              0 |                         | 
+          26 |             0 |            7 |              0 |                         | 
+          26 |             0 |            8 |              0 |                         | 
+          26 |             0 |            9 |              0 |                         | 
+          26 |             0 |           10 |              0 |                         | 
+          26 |             0 |           11 |              0 |                         | 
+          26 |             0 |           12 |              0 |                         | 
+          26 |             0 |           13 |              0 |                         | 
+          27 |             0 |            1 |              0 |                         | 
+          27 |             0 |            2 |              0 |                         | 
+          27 |             0 |            3 |              0 |                         | 
+          27 |             0 |            4 |              0 |                         | 
+          27 |             0 |            5 |              0 |                         | 
+          27 |             0 |            6 |              0 |                         | 
+          27 |             0 |            7 |              0 |                         | 
+          27 |             0 |            8 |              0 |                         | 
+          27 |             0 |            9 |              0 |                         | 
+          27 |             0 |           10 |              0 |                         | 
+          27 |             0 |           11 |              0 |                         | 
+          27 |             0 |           12 |              0 |                         | 
+          27 |             0 |           13 |              0 |                         | 
+             |               |           14 |              0 | unmatched outer         | 
+(352 rows)
+
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+savepoint settings;
+set max_parallel_workers_per_gather = 1;
+set enable_parallel_hash = on;
+set work_mem = '64kB';
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a);
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Gather (actual rows=469 loops=1)
+   Workers Planned: 1
+   Workers Launched: 1
+   ->  Parallel Hash Left Join (actual rows=234 loops=2)
+         Hash Cond: (probeside_batch0.a = hashside_wide_batch0.a)
+         ->  Parallel Seq Scan on probeside_batch0 (actual rows=14 loops=1)
+         ->  Parallel Hash (actual rows=18 loops=2)
+               Buckets: 8 (originally 8)  Batches: 16 (originally 8)
+               Batch: 0  Stripes: 5
+               ->  Parallel Seq Scan on hashside_wide_batch0 (actual rows=36 loops=1)
+(10 rows)
+
+SELECT
+       hashside_wide_batch0.id as hashside_id, 
+       (hashside_wide_batch0.a).hash as hashside_hash,
+        probeside_batch0.id as probeside_id, 
+       (probeside_batch0.a).hash as probeside_hash,
+        TRIM((probeside_batch0.a).value) as probeside_trimmed_value,
+        TRIM((hashside_wide_batch0.a).value) as hashside_trimmed_value 
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5, 6;
+ hashside_id | hashside_hash | probeside_id | probeside_hash | probeside_trimmed_value | hashside_trimmed_value 
+-------------+---------------+--------------+----------------+-------------------------+------------------------
+           1 |             0 |            1 |              0 |                         | 
+           1 |             0 |            2 |              0 |                         | 
+           1 |             0 |            3 |              0 |                         | 
+           1 |             0 |            4 |              0 |                         | 
+           1 |             0 |            5 |              0 |                         | 
+           1 |             0 |            6 |              0 |                         | 
+           1 |             0 |            7 |              0 |                         | 
+           1 |             0 |            8 |              0 |                         | 
+           1 |             0 |            9 |              0 |                         | 
+           1 |             0 |           10 |              0 |                         | 
+           1 |             0 |           11 |              0 |                         | 
+           1 |             0 |           12 |              0 |                         | 
+           1 |             0 |           13 |              0 |                         | 
+           2 |             0 |            1 |              0 |                         | 
+           2 |             0 |            2 |              0 |                         | 
+           2 |             0 |            3 |              0 |                         | 
+           2 |             0 |            4 |              0 |                         | 
+           2 |             0 |            5 |              0 |                         | 
+           2 |             0 |            6 |              0 |                         | 
+           2 |             0 |            7 |              0 |                         | 
+           2 |             0 |            8 |              0 |                         | 
+           2 |             0 |            9 |              0 |                         | 
+           2 |             0 |           10 |              0 |                         | 
+           2 |             0 |           11 |              0 |                         | 
+           2 |             0 |           12 |              0 |                         | 
+           2 |             0 |           13 |              0 |                         | 
+           3 |             0 |            1 |              0 |                         | 
+           3 |             0 |            2 |              0 |                         | 
+           3 |             0 |            3 |              0 |                         | 
+           3 |             0 |            4 |              0 |                         | 
+           3 |             0 |            5 |              0 |                         | 
+           3 |             0 |            6 |              0 |                         | 
+           3 |             0 |            7 |              0 |                         | 
+           3 |             0 |            8 |              0 |                         | 
+           3 |             0 |            9 |              0 |                         | 
+           3 |             0 |           10 |              0 |                         | 
+           3 |             0 |           11 |              0 |                         | 
+           3 |             0 |           12 |              0 |                         | 
+           3 |             0 |           13 |              0 |                         | 
+           4 |             0 |            1 |              0 |                         | 
+           4 |             0 |            2 |              0 |                         | 
+           4 |             0 |            3 |              0 |                         | 
+           4 |             0 |            4 |              0 |                         | 
+           4 |             0 |            5 |              0 |                         | 
+           4 |             0 |            6 |              0 |                         | 
+           4 |             0 |            7 |              0 |                         | 
+           4 |             0 |            8 |              0 |                         | 
+           4 |             0 |            9 |              0 |                         | 
+           4 |             0 |           10 |              0 |                         | 
+           4 |             0 |           11 |              0 |                         | 
+           4 |             0 |           12 |              0 |                         | 
+           4 |             0 |           13 |              0 |                         | 
+           5 |             0 |            1 |              0 |                         | 
+           5 |             0 |            2 |              0 |                         | 
+           5 |             0 |            3 |              0 |                         | 
+           5 |             0 |            4 |              0 |                         | 
+           5 |             0 |            5 |              0 |                         | 
+           5 |             0 |            6 |              0 |                         | 
+           5 |             0 |            7 |              0 |                         | 
+           5 |             0 |            8 |              0 |                         | 
+           5 |             0 |            9 |              0 |                         | 
+           5 |             0 |           10 |              0 |                         | 
+           5 |             0 |           11 |              0 |                         | 
+           5 |             0 |           12 |              0 |                         | 
+           5 |             0 |           13 |              0 |                         | 
+           6 |             0 |            1 |              0 |                         | 
+           6 |             0 |            2 |              0 |                         | 
+           6 |             0 |            3 |              0 |                         | 
+           6 |             0 |            4 |              0 |                         | 
+           6 |             0 |            5 |              0 |                         | 
+           6 |             0 |            6 |              0 |                         | 
+           6 |             0 |            7 |              0 |                         | 
+           6 |             0 |            8 |              0 |                         | 
+           6 |             0 |            9 |              0 |                         | 
+           6 |             0 |           10 |              0 |                         | 
+           6 |             0 |           11 |              0 |                         | 
+           6 |             0 |           12 |              0 |                         | 
+           6 |             0 |           13 |              0 |                         | 
+           7 |             0 |            1 |              0 |                         | 
+           7 |             0 |            2 |              0 |                         | 
+           7 |             0 |            3 |              0 |                         | 
+           7 |             0 |            4 |              0 |                         | 
+           7 |             0 |            5 |              0 |                         | 
+           7 |             0 |            6 |              0 |                         | 
+           7 |             0 |            7 |              0 |                         | 
+           7 |             0 |            8 |              0 |                         | 
+           7 |             0 |            9 |              0 |                         | 
+           7 |             0 |           10 |              0 |                         | 
+           7 |             0 |           11 |              0 |                         | 
+           7 |             0 |           12 |              0 |                         | 
+           7 |             0 |           13 |              0 |                         | 
+           8 |             0 |            1 |              0 |                         | 
+           8 |             0 |            2 |              0 |                         | 
+           8 |             0 |            3 |              0 |                         | 
+           8 |             0 |            4 |              0 |                         | 
+           8 |             0 |            5 |              0 |                         | 
+           8 |             0 |            6 |              0 |                         | 
+           8 |             0 |            7 |              0 |                         | 
+           8 |             0 |            8 |              0 |                         | 
+           8 |             0 |            9 |              0 |                         | 
+           8 |             0 |           10 |              0 |                         | 
+           8 |             0 |           11 |              0 |                         | 
+           8 |             0 |           12 |              0 |                         | 
+           8 |             0 |           13 |              0 |                         | 
+           9 |             0 |            1 |              0 |                         | 
+           9 |             0 |            2 |              0 |                         | 
+           9 |             0 |            3 |              0 |                         | 
+           9 |             0 |            4 |              0 |                         | 
+           9 |             0 |            5 |              0 |                         | 
+           9 |             0 |            6 |              0 |                         | 
+           9 |             0 |            7 |              0 |                         | 
+           9 |             0 |            8 |              0 |                         | 
+           9 |             0 |            9 |              0 |                         | 
+           9 |             0 |           10 |              0 |                         | 
+           9 |             0 |           11 |              0 |                         | 
+           9 |             0 |           12 |              0 |                         | 
+           9 |             0 |           13 |              0 |                         | 
+          10 |             0 |            1 |              0 |                         | 
+          10 |             0 |            2 |              0 |                         | 
+          10 |             0 |            3 |              0 |                         | 
+          10 |             0 |            4 |              0 |                         | 
+          10 |             0 |            5 |              0 |                         | 
+          10 |             0 |            6 |              0 |                         | 
+          10 |             0 |            7 |              0 |                         | 
+          10 |             0 |            8 |              0 |                         | 
+          10 |             0 |            9 |              0 |                         | 
+          10 |             0 |           10 |              0 |                         | 
+          10 |             0 |           11 |              0 |                         | 
+          10 |             0 |           12 |              0 |                         | 
+          10 |             0 |           13 |              0 |                         | 
+          11 |             0 |            1 |              0 |                         | 
+          11 |             0 |            2 |              0 |                         | 
+          11 |             0 |            3 |              0 |                         | 
+          11 |             0 |            4 |              0 |                         | 
+          11 |             0 |            5 |              0 |                         | 
+          11 |             0 |            6 |              0 |                         | 
+          11 |             0 |            7 |              0 |                         | 
+          11 |             0 |            8 |              0 |                         | 
+          11 |             0 |            9 |              0 |                         | 
+          11 |             0 |           10 |              0 |                         | 
+          11 |             0 |           11 |              0 |                         | 
+          11 |             0 |           12 |              0 |                         | 
+          11 |             0 |           13 |              0 |                         | 
+          12 |             0 |            1 |              0 |                         | 
+          12 |             0 |            2 |              0 |                         | 
+          12 |             0 |            3 |              0 |                         | 
+          12 |             0 |            4 |              0 |                         | 
+          12 |             0 |            5 |              0 |                         | 
+          12 |             0 |            6 |              0 |                         | 
+          12 |             0 |            7 |              0 |                         | 
+          12 |             0 |            8 |              0 |                         | 
+          12 |             0 |            9 |              0 |                         | 
+          12 |             0 |           10 |              0 |                         | 
+          12 |             0 |           11 |              0 |                         | 
+          12 |             0 |           12 |              0 |                         | 
+          12 |             0 |           13 |              0 |                         | 
+          13 |             0 |            1 |              0 |                         | 
+          13 |             0 |            2 |              0 |                         | 
+          13 |             0 |            3 |              0 |                         | 
+          13 |             0 |            4 |              0 |                         | 
+          13 |             0 |            5 |              0 |                         | 
+          13 |             0 |            6 |              0 |                         | 
+          13 |             0 |            7 |              0 |                         | 
+          13 |             0 |            8 |              0 |                         | 
+          13 |             0 |            9 |              0 |                         | 
+          13 |             0 |           10 |              0 |                         | 
+          13 |             0 |           11 |              0 |                         | 
+          13 |             0 |           12 |              0 |                         | 
+          13 |             0 |           13 |              0 |                         | 
+          14 |             0 |            1 |              0 |                         | 
+          14 |             0 |            2 |              0 |                         | 
+          14 |             0 |            3 |              0 |                         | 
+          14 |             0 |            4 |              0 |                         | 
+          14 |             0 |            5 |              0 |                         | 
+          14 |             0 |            6 |              0 |                         | 
+          14 |             0 |            7 |              0 |                         | 
+          14 |             0 |            8 |              0 |                         | 
+          14 |             0 |            9 |              0 |                         | 
+          14 |             0 |           10 |              0 |                         | 
+          14 |             0 |           11 |              0 |                         | 
+          14 |             0 |           12 |              0 |                         | 
+          14 |             0 |           13 |              0 |                         | 
+          15 |             0 |            1 |              0 |                         | 
+          15 |             0 |            2 |              0 |                         | 
+          15 |             0 |            3 |              0 |                         | 
+          15 |             0 |            4 |              0 |                         | 
+          15 |             0 |            5 |              0 |                         | 
+          15 |             0 |            6 |              0 |                         | 
+          15 |             0 |            7 |              0 |                         | 
+          15 |             0 |            8 |              0 |                         | 
+          15 |             0 |            9 |              0 |                         | 
+          15 |             0 |           10 |              0 |                         | 
+          15 |             0 |           11 |              0 |                         | 
+          15 |             0 |           12 |              0 |                         | 
+          15 |             0 |           13 |              0 |                         | 
+          16 |             0 |            1 |              0 |                         | 
+          16 |             0 |            2 |              0 |                         | 
+          16 |             0 |            3 |              0 |                         | 
+          16 |             0 |            4 |              0 |                         | 
+          16 |             0 |            5 |              0 |                         | 
+          16 |             0 |            6 |              0 |                         | 
+          16 |             0 |            7 |              0 |                         | 
+          16 |             0 |            8 |              0 |                         | 
+          16 |             0 |            9 |              0 |                         | 
+          16 |             0 |           10 |              0 |                         | 
+          16 |             0 |           11 |              0 |                         | 
+          16 |             0 |           12 |              0 |                         | 
+          16 |             0 |           13 |              0 |                         | 
+          17 |             0 |            1 |              0 |                         | 
+          17 |             0 |            2 |              0 |                         | 
+          17 |             0 |            3 |              0 |                         | 
+          17 |             0 |            4 |              0 |                         | 
+          17 |             0 |            5 |              0 |                         | 
+          17 |             0 |            6 |              0 |                         | 
+          17 |             0 |            7 |              0 |                         | 
+          17 |             0 |            8 |              0 |                         | 
+          17 |             0 |            9 |              0 |                         | 
+          17 |             0 |           10 |              0 |                         | 
+          17 |             0 |           11 |              0 |                         | 
+          17 |             0 |           12 |              0 |                         | 
+          17 |             0 |           13 |              0 |                         | 
+          18 |             0 |            1 |              0 |                         | 
+          18 |             0 |            2 |              0 |                         | 
+          18 |             0 |            3 |              0 |                         | 
+          18 |             0 |            4 |              0 |                         | 
+          18 |             0 |            5 |              0 |                         | 
+          18 |             0 |            6 |              0 |                         | 
+          18 |             0 |            7 |              0 |                         | 
+          18 |             0 |            8 |              0 |                         | 
+          18 |             0 |            9 |              0 |                         | 
+          18 |             0 |           10 |              0 |                         | 
+          18 |             0 |           11 |              0 |                         | 
+          18 |             0 |           12 |              0 |                         | 
+          18 |             0 |           13 |              0 |                         | 
+          19 |             0 |            1 |              0 |                         | 
+          19 |             0 |            2 |              0 |                         | 
+          19 |             0 |            3 |              0 |                         | 
+          19 |             0 |            4 |              0 |                         | 
+          19 |             0 |            5 |              0 |                         | 
+          19 |             0 |            6 |              0 |                         | 
+          19 |             0 |            7 |              0 |                         | 
+          19 |             0 |            8 |              0 |                         | 
+          19 |             0 |            9 |              0 |                         | 
+          19 |             0 |           10 |              0 |                         | 
+          19 |             0 |           11 |              0 |                         | 
+          19 |             0 |           12 |              0 |                         | 
+          19 |             0 |           13 |              0 |                         | 
+          20 |             0 |            1 |              0 |                         | 
+          20 |             0 |            2 |              0 |                         | 
+          20 |             0 |            3 |              0 |                         | 
+          20 |             0 |            4 |              0 |                         | 
+          20 |             0 |            5 |              0 |                         | 
+          20 |             0 |            6 |              0 |                         | 
+          20 |             0 |            7 |              0 |                         | 
+          20 |             0 |            8 |              0 |                         | 
+          20 |             0 |            9 |              0 |                         | 
+          20 |             0 |           10 |              0 |                         | 
+          20 |             0 |           11 |              0 |                         | 
+          20 |             0 |           12 |              0 |                         | 
+          20 |             0 |           13 |              0 |                         | 
+          21 |             0 |            1 |              0 |                         | 
+          21 |             0 |            2 |              0 |                         | 
+          21 |             0 |            3 |              0 |                         | 
+          21 |             0 |            4 |              0 |                         | 
+          21 |             0 |            5 |              0 |                         | 
+          21 |             0 |            6 |              0 |                         | 
+          21 |             0 |            7 |              0 |                         | 
+          21 |             0 |            8 |              0 |                         | 
+          21 |             0 |            9 |              0 |                         | 
+          21 |             0 |           10 |              0 |                         | 
+          21 |             0 |           11 |              0 |                         | 
+          21 |             0 |           12 |              0 |                         | 
+          21 |             0 |           13 |              0 |                         | 
+          22 |             0 |            1 |              0 |                         | 
+          22 |             0 |            2 |              0 |                         | 
+          22 |             0 |            3 |              0 |                         | 
+          22 |             0 |            4 |              0 |                         | 
+          22 |             0 |            5 |              0 |                         | 
+          22 |             0 |            6 |              0 |                         | 
+          22 |             0 |            7 |              0 |                         | 
+          22 |             0 |            8 |              0 |                         | 
+          22 |             0 |            9 |              0 |                         | 
+          22 |             0 |           10 |              0 |                         | 
+          22 |             0 |           11 |              0 |                         | 
+          22 |             0 |           12 |              0 |                         | 
+          22 |             0 |           13 |              0 |                         | 
+          23 |             0 |            1 |              0 |                         | 
+          23 |             0 |            2 |              0 |                         | 
+          23 |             0 |            3 |              0 |                         | 
+          23 |             0 |            4 |              0 |                         | 
+          23 |             0 |            5 |              0 |                         | 
+          23 |             0 |            6 |              0 |                         | 
+          23 |             0 |            7 |              0 |                         | 
+          23 |             0 |            8 |              0 |                         | 
+          23 |             0 |            9 |              0 |                         | 
+          23 |             0 |           10 |              0 |                         | 
+          23 |             0 |           11 |              0 |                         | 
+          23 |             0 |           12 |              0 |                         | 
+          23 |             0 |           13 |              0 |                         | 
+          24 |             0 |            1 |              0 |                         | 
+          24 |             0 |            2 |              0 |                         | 
+          24 |             0 |            3 |              0 |                         | 
+          24 |             0 |            4 |              0 |                         | 
+          24 |             0 |            5 |              0 |                         | 
+          24 |             0 |            6 |              0 |                         | 
+          24 |             0 |            7 |              0 |                         | 
+          24 |             0 |            8 |              0 |                         | 
+          24 |             0 |            9 |              0 |                         | 
+          24 |             0 |           10 |              0 |                         | 
+          24 |             0 |           11 |              0 |                         | 
+          24 |             0 |           12 |              0 |                         | 
+          24 |             0 |           13 |              0 |                         | 
+          25 |             0 |            1 |              0 |                         | 
+          25 |             0 |            2 |              0 |                         | 
+          25 |             0 |            3 |              0 |                         | 
+          25 |             0 |            4 |              0 |                         | 
+          25 |             0 |            5 |              0 |                         | 
+          25 |             0 |            6 |              0 |                         | 
+          25 |             0 |            7 |              0 |                         | 
+          25 |             0 |            8 |              0 |                         | 
+          25 |             0 |            9 |              0 |                         | 
+          25 |             0 |           10 |              0 |                         | 
+          25 |             0 |           11 |              0 |                         | 
+          25 |             0 |           12 |              0 |                         | 
+          25 |             0 |           13 |              0 |                         | 
+          26 |             0 |            1 |              0 |                         | 
+          26 |             0 |            2 |              0 |                         | 
+          26 |             0 |            3 |              0 |                         | 
+          26 |             0 |            4 |              0 |                         | 
+          26 |             0 |            5 |              0 |                         | 
+          26 |             0 |            6 |              0 |                         | 
+          26 |             0 |            7 |              0 |                         | 
+          26 |             0 |            8 |              0 |                         | 
+          26 |             0 |            9 |              0 |                         | 
+          26 |             0 |           10 |              0 |                         | 
+          26 |             0 |           11 |              0 |                         | 
+          26 |             0 |           12 |              0 |                         | 
+          26 |             0 |           13 |              0 |                         | 
+          27 |             0 |            1 |              0 |                         | 
+          27 |             0 |            2 |              0 |                         | 
+          27 |             0 |            3 |              0 |                         | 
+          27 |             0 |            4 |              0 |                         | 
+          27 |             0 |            5 |              0 |                         | 
+          27 |             0 |            6 |              0 |                         | 
+          27 |             0 |            7 |              0 |                         | 
+          27 |             0 |            8 |              0 |                         | 
+          27 |             0 |            9 |              0 |                         | 
+          27 |             0 |           10 |              0 |                         | 
+          27 |             0 |           11 |              0 |                         | 
+          27 |             0 |           12 |              0 |                         | 
+          27 |             0 |           13 |              0 |                         | 
+          28 |             0 |            1 |              0 |                         | 
+          28 |             0 |            2 |              0 |                         | 
+          28 |             0 |            3 |              0 |                         | 
+          28 |             0 |            4 |              0 |                         | 
+          28 |             0 |            5 |              0 |                         | 
+          28 |             0 |            6 |              0 |                         | 
+          28 |             0 |            7 |              0 |                         | 
+          28 |             0 |            8 |              0 |                         | 
+          28 |             0 |            9 |              0 |                         | 
+          28 |             0 |           10 |              0 |                         | 
+          28 |             0 |           11 |              0 |                         | 
+          28 |             0 |           12 |              0 |                         | 
+          28 |             0 |           13 |              0 |                         | 
+          29 |             0 |            1 |              0 |                         | 
+          29 |             0 |            2 |              0 |                         | 
+          29 |             0 |            3 |              0 |                         | 
+          29 |             0 |            4 |              0 |                         | 
+          29 |             0 |            5 |              0 |                         | 
+          29 |             0 |            6 |              0 |                         | 
+          29 |             0 |            7 |              0 |                         | 
+          29 |             0 |            8 |              0 |                         | 
+          29 |             0 |            9 |              0 |                         | 
+          29 |             0 |           10 |              0 |                         | 
+          29 |             0 |           11 |              0 |                         | 
+          29 |             0 |           12 |              0 |                         | 
+          29 |             0 |           13 |              0 |                         | 
+          30 |             0 |            1 |              0 |                         | 
+          30 |             0 |            2 |              0 |                         | 
+          30 |             0 |            3 |              0 |                         | 
+          30 |             0 |            4 |              0 |                         | 
+          30 |             0 |            5 |              0 |                         | 
+          30 |             0 |            6 |              0 |                         | 
+          30 |             0 |            7 |              0 |                         | 
+          30 |             0 |            8 |              0 |                         | 
+          30 |             0 |            9 |              0 |                         | 
+          30 |             0 |           10 |              0 |                         | 
+          30 |             0 |           11 |              0 |                         | 
+          30 |             0 |           12 |              0 |                         | 
+          30 |             0 |           13 |              0 |                         | 
+          31 |             0 |            1 |              0 |                         | 
+          31 |             0 |            2 |              0 |                         | 
+          31 |             0 |            3 |              0 |                         | 
+          31 |             0 |            4 |              0 |                         | 
+          31 |             0 |            5 |              0 |                         | 
+          31 |             0 |            6 |              0 |                         | 
+          31 |             0 |            7 |              0 |                         | 
+          31 |             0 |            8 |              0 |                         | 
+          31 |             0 |            9 |              0 |                         | 
+          31 |             0 |           10 |              0 |                         | 
+          31 |             0 |           11 |              0 |                         | 
+          31 |             0 |           12 |              0 |                         | 
+          31 |             0 |           13 |              0 |                         | 
+          32 |             0 |            1 |              0 |                         | 
+          32 |             0 |            2 |              0 |                         | 
+          32 |             0 |            3 |              0 |                         | 
+          32 |             0 |            4 |              0 |                         | 
+          32 |             0 |            5 |              0 |                         | 
+          32 |             0 |            6 |              0 |                         | 
+          32 |             0 |            7 |              0 |                         | 
+          32 |             0 |            8 |              0 |                         | 
+          32 |             0 |            9 |              0 |                         | 
+          32 |             0 |           10 |              0 |                         | 
+          32 |             0 |           11 |              0 |                         | 
+          32 |             0 |           12 |              0 |                         | 
+          32 |             0 |           13 |              0 |                         | 
+          33 |             0 |            1 |              0 |                         | 
+          33 |             0 |            2 |              0 |                         | 
+          33 |             0 |            3 |              0 |                         | 
+          33 |             0 |            4 |              0 |                         | 
+          33 |             0 |            5 |              0 |                         | 
+          33 |             0 |            6 |              0 |                         | 
+          33 |             0 |            7 |              0 |                         | 
+          33 |             0 |            8 |              0 |                         | 
+          33 |             0 |            9 |              0 |                         | 
+          33 |             0 |           10 |              0 |                         | 
+          33 |             0 |           11 |              0 |                         | 
+          33 |             0 |           12 |              0 |                         | 
+          33 |             0 |           13 |              0 |                         | 
+          34 |             0 |            1 |              0 |                         | 
+          34 |             0 |            2 |              0 |                         | 
+          34 |             0 |            3 |              0 |                         | 
+          34 |             0 |            4 |              0 |                         | 
+          34 |             0 |            5 |              0 |                         | 
+          34 |             0 |            6 |              0 |                         | 
+          34 |             0 |            7 |              0 |                         | 
+          34 |             0 |            8 |              0 |                         | 
+          34 |             0 |            9 |              0 |                         | 
+          34 |             0 |           10 |              0 |                         | 
+          34 |             0 |           11 |              0 |                         | 
+          34 |             0 |           12 |              0 |                         | 
+          34 |             0 |           13 |              0 |                         | 
+          35 |             0 |            1 |              0 |                         | 
+          35 |             0 |            2 |              0 |                         | 
+          35 |             0 |            3 |              0 |                         | 
+          35 |             0 |            4 |              0 |                         | 
+          35 |             0 |            5 |              0 |                         | 
+          35 |             0 |            6 |              0 |                         | 
+          35 |             0 |            7 |              0 |                         | 
+          35 |             0 |            8 |              0 |                         | 
+          35 |             0 |            9 |              0 |                         | 
+          35 |             0 |           10 |              0 |                         | 
+          35 |             0 |           11 |              0 |                         | 
+          35 |             0 |           12 |              0 |                         | 
+          35 |             0 |           13 |              0 |                         | 
+          36 |             0 |            1 |              0 |                         | 
+          36 |             0 |            2 |              0 |                         | 
+          36 |             0 |            3 |              0 |                         | 
+          36 |             0 |            4 |              0 |                         | 
+          36 |             0 |            5 |              0 |                         | 
+          36 |             0 |            6 |              0 |                         | 
+          36 |             0 |            7 |              0 |                         | 
+          36 |             0 |            8 |              0 |                         | 
+          36 |             0 |            9 |              0 |                         | 
+          36 |             0 |           10 |              0 |                         | 
+          36 |             0 |           11 |              0 |                         | 
+          36 |             0 |           12 |              0 |                         | 
+          36 |             0 |           13 |              0 |                         | 
+             |               |           14 |              0 | unmatched outer         | 
+(469 rows)
+
+rollback to settings;
+rollback;
diff --git a/src/test/regress/sql/join_hash.sql b/src/test/regress/sql/join_hash.sql
index 68c1a8c7b6..d9f8a115d8 100644
--- a/src/test/regress/sql/join_hash.sql
+++ b/src/test/regress/sql/join_hash.sql
@@ -450,22 +450,26 @@ rollback to settings;
 
 -- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
 -- sts_puttuple oversized tuple cases because it's multi-batch)
-savepoint settings;
-set max_parallel_workers_per_gather = 2;
-set enable_parallel_hash = on;
-set work_mem = '128kB';
-explain (costs off)
-  select length(max(s.t))
-  from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
-select length(max(s.t))
-from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
-select final > 1 as multibatch
-  from hash_join_batches(
-$$
-  select length(max(s.t))
-  from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
-$$);
-rollback to settings;
+-- savepoint settings;
+-- set max_parallel_workers_per_gather = 2;
+-- set enable_parallel_hash = on;
+-- TODO: throw an error when this happens: cannot set work_mem lower than the side of a single tuple
+-- TODO: ensure that oversize tuple code is still exercised (should be with some of the stub stuff below)
+-- TODO: commented this out since it would crash otherwise
+-- this test is no longer multi-batch, so, perhaps, it should be removed
+-- set work_mem = '128kB';
+-- explain (costs off)
+--   select length(max(s.t))
+--   from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+-- select length(max(s.t))
+-- from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+-- select final > 1 as multibatch
+--   from hash_join_batches(
+-- $$
+--   select length(max(s.t))
+--   from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+-- $$);
+-- rollback to settings;
 
 rollback;
 
@@ -538,3 +542,181 @@ WHERE
     AND hjtest_1.a <> hjtest_2.b;
 
 ROLLBACK;
+
+-- Serial Adaptive Hash Join
+
+BEGIN;
+CREATE TYPE stub AS (hash INTEGER, value CHAR(8090));
+
+CREATE FUNCTION stub_hash(item stub)
+RETURNS INTEGER AS $$
+DECLARE
+  batch_size INTEGER;
+BEGIN
+  batch_size := 4;
+  RETURN item.hash << (batch_size - 1);
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+
+CREATE FUNCTION stub_eq(item1 stub, item2 stub)
+RETURNS BOOLEAN AS $$
+BEGIN
+  RETURN item1.hash = item2.hash AND item1.value = item2.value;
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+
+CREATE OPERATOR = (
+  FUNCTION = stub_eq,
+  LEFTARG = stub,
+  RIGHTARG = stub,
+  COMMUTATOR = =,
+  HASHES, MERGES
+);
+
+CREATE OPERATOR CLASS stub_hash_ops
+DEFAULT FOR TYPE stub USING hash AS
+  OPERATOR 1 =(stub, stub),
+  FUNCTION 1 stub_hash(stub);
+
+CREATE TABLE probeside(a stub);
+ALTER TABLE probeside ALTER COLUMN a SET STORAGE PLAIN;
+-- non-fallback batch with unmatched outer tuple
+INSERT INTO probeside SELECT '(2, "")' FROM generate_series(1, 1);
+-- fallback batch unmatched outer tuple (in first stripe maybe)
+INSERT INTO probeside SELECT '(1, "unmatched outer tuple")' FROM generate_series(1, 1);
+-- fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(1, "")' FROM generate_series(1, 5);
+-- fallback batch unmatched outer tuple (in last stripe maybe)
+-- When numbatches=4, hash 5 maps to batch 1, but after numbatches doubles to
+-- 8 batches hash 5 maps to batch 5.
+INSERT INTO probeside SELECT '(5, "")' FROM generate_series(1, 1);
+-- non-fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(3, "")' FROM generate_series(1, 1);
+-- batch with 3 stripes where non-first/non-last stripe contains unmatched outer tuple
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 5);
+INSERT INTO probeside SELECT '(6, "unmatched outer tuple")' FROM generate_series(1, 1);
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 1);
+
+CREATE TABLE hashside_wide(a stub, id int);
+ALTER TABLE hashside_wide ALTER COLUMN a SET STORAGE PLAIN;
+-- falls back with an unmatched inner tuple that is in fist, middle, and last
+-- stripe
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in first stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in middle stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in last stripe")', 1 FROM generate_series(1, 1);
+
+-- doesn't fall back -- matched tuple
+INSERT INTO hashside_wide SELECT '(3, "")', 3 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(6, "")', 6 FROM generate_series(1, 20);
+
+ANALYZE probeside, hashside_wide;
+
+SET enable_nestloop TO off;
+SET enable_mergejoin TO off;
+SET work_mem = 64;
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+FULL OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+FULL OUTER JOIN hashside_wide USING (a);
+
+-- semi-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+
+-- anti-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+
+-- parallel LOJ test case with two batches falling back
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_parallel_hash = on;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+rollback to settings;
+
+-- Test spill of batch 0 gives correct results.
+CREATE TABLE probeside_batch0(id int generated always as identity, a stub);
+ALTER TABLE probeside_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO probeside_batch0(a) SELECT '(0, "")' FROM generate_series(1, 13);
+INSERT INTO probeside_batch0(a) SELECT '(0, "unmatched outer")' FROM generate_series(1, 1);
+
+CREATE TABLE hashside_wide_batch0(id int generated always as identity, a stub);
+ALTER TABLE hashside_wide_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+ANALYZE probeside_batch0, hashside_wide_batch0;
+
+SELECT
+       hashside_wide_batch0.id as hashside_id, 
+       (hashside_wide_batch0.a).hash as hashside_hash,
+        probeside_batch0.id as probeside_id, 
+       (probeside_batch0.a).hash as probeside_hash,
+        TRIM((probeside_batch0.a).value) as probeside_trimmed_value,
+        TRIM((hashside_wide_batch0.a).value) as hashside_trimmed_value 
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5, 6;
+
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+
+savepoint settings;
+set max_parallel_workers_per_gather = 1;
+set enable_parallel_hash = on;
+set work_mem = '64kB';
+
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a);
+
+SELECT
+       hashside_wide_batch0.id as hashside_id, 
+       (hashside_wide_batch0.a).hash as hashside_hash,
+        probeside_batch0.id as probeside_id, 
+       (probeside_batch0.a).hash as probeside_hash,
+        TRIM((probeside_batch0.a).value) as probeside_trimmed_value,
+        TRIM((hashside_wide_batch0.a).value) as hashside_trimmed_value 
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5, 6;
+rollback to settings;
+
+rollback;
-- 
2.20.1

#61

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Melanie Plageman (#60)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

On Mon, Aug 31, 2020 at 03:13:06PM -0700, Melanie Plageman wrote:

Attached is the current version of adaptive hash join with two
significant changes as compared to v10:

The CF bot is complaining about a regression test failure:
@@ -2465,7 +2465,7 @@
  Gather (actual rows=469 loops=1)
    Workers Planned: 1
    Workers Launched: 1
-   ->  Parallel Hash Left Join (actual rows=234 loops=2)
+   ->  Parallel Hash Left Join (actual rows=235 loops=2)
--
Michael

#62

Alena Rybakina

a.rybakina@postgrespro.ru

about 1 year ago

In reply to: Melanie Plageman (#60)

1 attachment(s)

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Hi!

Thank you for your work on this problem!

On 01.09.2020 01:13, Melanie Plageman wrote:

Attached is the current version of adaptive hash join with two
significant changes as compared to v10:

1) Implements spilling of batch 0 for parallel-aware parallel hash join.
2) Moves "striping" of fallback batches from "build" to "load" stage
It includes several smaller changes as well.

Batch 0 spilling is necessary when the hash table for batch 0 cannot fit
in memory and allows us to use the "hashloop" strategy for batch 0.

Spilling of batch 0 necessitated the addition of a few new pieces of
code. The most noticeable one is probably the hash table eviction phase
machine. If batch 0 was marked as a "fallback" batch in
ExecParallelHashIncreaseNumBatches() PHJ_GROW_BATCHES_DECIDING phase,
any future attempt to insert a tuple that would exceed the space_allowed
triggers eviction of the hash table.
ExecParallelHashTableEvictBatch0() will evict all batch 0 tuples in
memory into spill files in a batch 0 inner SharedTuplestore.

This means that when repartitioning batch 0 in the future, both the
batch 0 spill file and the hash table need to be drained and relocated
into the new generation of batches and the hash table. If enough memory
is freed up from batch 0 tuples relocating to other batches, then it is
possible that tuples from the batch 0 spill files will go back into the
hash table.
After batch 0 is evicted, the build stage proceeds as normal.

The main alternative to this design that we considered was to "close" the
hash table after it is full. That is, if batch 0 has been marked to fall
back, once it is full, all subsequent tuples pulled from the outer child
would bypass the hash table altogether and go directly into a spill
file.

We chose the hash table eviction route because I thought it might be
better to write chunks of the hashtable into a file together rather than
sporadically write new batch 0 tuples to spill files as they are
pulled out of the child node. However, since the same sts_puttuple() API
is used in both cases, it is highly possible this won't actually matter
and we will do the same amount of I/O.
Both designs involved changing the flow of the code for inserting and
repartitioning tuples, so I figured that I would choose one, do some
testing, and try the other one later after more discussion and review.

This patch also introduces a significant change to how tuples are split
into stripes. Previously, during the build stage, tuples were written to
spill files in the SharedTuplestore with a stripe number in the metadata
section of the MinimalTuple.
For a batch that had been designated a "fallback" batch,
once the space_allowed had been exhausted, the shared stripe number
would be incremented and the new stripe number was written in the tuple
metadata to the files. Then, during loading, tuples were only loaded
into the hashtable if their stripe number matched the current stripe
number.

This had several downsides. It introduced a couple new shared variables --
the current stripe number for the batch and its size.
In master, during the normal mode of the "build" stage, shared variables
for the size or estimated_size of the batch are checked on each
allocation of a STS Chunk or HashMemoryChunk, however, during
repartitioning, because bailing out early was not an option, workers
could use backend-local variables to keep track of size and merge them
at the end of repartitioning. This wasn't possible if we needed accurate
stripe numbers written into the tuples. This meant that we had to add
new shared variable accesses to repartitioning.

To avoid this, Deep and I worked on moving the "striping" logic from the
"build" stage to the "load" stage for batches. Serial hash join already
did striping in this way. This patch now pauses loading once the
space_allowed has been exhausted for parallel hash join as well. The
tricky part was keeping track of multiple read_pages for a given file.

When tuples had explicit stripe numbers, we simply rewound the read_page
in the SharedTuplestoreParticipant to the earliest SharedTuplestoreChunk
that anyone had read and relied on the stripe numbers to avoid loading
tuples more than once. Now, each worker participating in reading from
the SharedTuplestore could have received a read_page "assignment" (four
blocks, currently) and then failed to allocate a HashMemoryChunk. We
cannot risk rewinding the read_page because there could be
SharedTuplestoreChunks that have already been loaded in between ones
that have not.

The design we went with was to "overflow" the tuples from this
SharedTuplestoreChunk onto the end of the write_file which this worker
wrote--if it participated in writing this STS--or by making a new
write_file if it did not participate in writing. This entailed keeping
track of who participated in the write phase. SharedTuplestore
participation now has three "modes"-- reading, writing, and appending.
During appending, workers can write to their own file and read from any
file.

One of the alternative designs I considered was to store the offset and
length of leftover blocks that still needed to be loaded into the hash
table in the SharedTuplestoreParticipant data structure. Then, workers
would pick up these "assignments". It is basically a
SharedTuplestoreParticipant work queue.
The main stumbling block I faced here was allocating a variable number of
things in shared memory. You don't know how many read participants will
read from the file and how many stripes there will be (until you've
loaded the file). In the worst case, you would need space for
nparticipants * nstripes - 1 offset/length combos.
Since I don't know how many stripes I have until I've loaded the file, I
can't allocate shared memory for this up front.

The downside of the "append overflow" design is that, currently, all
workers participating in loading a fallback batch write an overflow
chunk for every fallback stripe.
It seems like something could be done to check if there is space in the
hashtable before accepting an assignment of blocks to read from the
SharedTuplestore and moving the shared variable read_page. It might
reduce instances in which workers have to overflow. However, I tried
this and it is very intrusive on the SharedTuplestore API (it would have
to know about the hash table). Also, oversized tuples would not be
addressed by this pre-assignment check since memory is allocated a
HashMemoryChunk at a time. So, even if this was solved, you would need
overflow functionality

One note is that I had to comment out a test in join_hash.sql which
inserts tuples larger than work_mem in size (each), because it no longer
successfully executes.
Also, the stripe number is not deterministic, so sometimes the tests that
compare fallback batches' number of stripes fail (also in join_hash.sql).

Major outstanding TODOs:
--
- Potential redesign of stripe loading pausing and resumption
- The instrumentation for parallel fallback batches has some problems
- Deadlock hazard avoidance design of the stripe barrier still needs work
- Assorted smaller TODOs in the code

I think this patch is essential and will save us from allocating an
incredibly large amount of memory when doing a hash join.
Unfortunately, we are not yet able to avoid the problems of incorrect
cardinality estimation and clearly estimate NULL elements, which can
lead to a large growth of the batch in the hash table. Recently, my
client just faced this problem - his system was allocated 50 GB of
memory when performing a hash join and only your patch helped to avoid
this. Luckily for me, he was using version 15 of Postgres, but his case
is reproduced with the same problem in newer versions.

I noticed that your patch is far behind the master and made an attempt
to rebase it to try to revive the discussion. But I'm stuck with a
problem of initialization of the cluster and I could not solve. Can you
take a look and tell me what is wrong here?

export CDIR=$(pwd)
export PGDATA=/home/alena/postgres_data11
my/inst/bin/pg_ctl -D $PGDATA -l logfile stop
rm -r $PGDATA
mkdir $PGDATA
my/inst/bin/initdb -D $PGDATA >> log.txt
my/inst/bin/pg_ctl -D $PGDATA -l logfile start

pg_ctl: directory "/home/alena/postgres_data11" is not a database
cluster directory
2024-11-10 20:44:40.598 MSK [20213] FATAL: duplicate key value violates
unique constraint "pg_description_o_c_o_index"
2024-11-10 20:44:40.598 MSK [20213] DETAIL: Key (objoid, classoid,
objsubid)=(378, 1255, 0) already exists.
2024-11-10 20:44:40.598 MSK [20213] STATEMENT:
        WITH funcdescs AS ( SELECT p.oid as p_oid, o.oid as o_oid,
oprname FROM pg_proc p JOIN pg_operator o ON oprcode = p.oid ) INSERT
INTO pg_description   SELECT p_oid, 'pg_proc'::regclass, 0,
'implementation of ' || oprname || ' operator'   FROM funcdescs   WHERE
NOT EXISTS (SELECT 1 FROM pg_description    WHERE objoid = p_oid AND
classoid = 'pg_proc'::regclass)   AND NOT EXISTS (SELECT 1 FROM
pg_description    WHERE objoid = o_oid AND classoid =
'pg_operator'::regclass         AND description LIKE 'deprecated%');

child process exited with exit code 1

--
Regards,
Alena Rybakina
Postgres Professional

Attachments:

skew_data.difftext/x-patch; charset=UTF-8; name=skew_data.diffDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7c0fd63b2f0..c9a3e495ee2 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -219,6 +219,8 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 			es->settings = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "generic_plan") == 0)
 			es->generic = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "usage") == 0)
+			es->usage = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "timing") == 0)
 		{
 			timing_set = true;
@@ -383,6 +385,7 @@ NewExplainState(void)
 
 	/* Set default options (most fields can be left as zeroes). */
 	es->costs = true;
+	es->usage = true;
 	/* Prepare output buffer. */
 	es->str = makeStringInfo();
 
@@ -3465,22 +3468,47 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		else if (hinstrument.nbatch_original != hinstrument.nbatch ||
 				 hinstrument.nbuckets_original != hinstrument.nbuckets)
 		{
+			ListCell   *lc;
+
 			ExplainIndentText(es);
 			appendStringInfo(es->str,
-							 "Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: " UINT64_FORMAT "kB\n",
+							 "Buckets: %d (originally %d)  Batches: %d (originally %d)",
 							 hinstrument.nbuckets,
 							 hinstrument.nbuckets_original,
 							 hinstrument.nbatch,
-							 hinstrument.nbatch_original,
-							 spacePeakKb);
+							 hinstrument.nbatch_original);
+			if (es->usage)
+				appendStringInfo(es->str, "  Memory Usage: %ldkB\n", spacePeakKb);
+			else
+				appendStringInfo(es->str, "\n");
+
+			foreach(lc, hinstrument.fallback_batches_stats)
+			{
+				FallbackBatchStats *fbs = lfirst(lc);
+
+				ExplainIndentText(es);
+				appendStringInfo(es->str, "Batch: %d  Stripes: %d\n", fbs->batchno, fbs->numstripes);
+			}
 		}
 		else
 		{
+			ListCell   *lc;
+
 			ExplainIndentText(es);
-			appendStringInfo(es->str,
-							 "Buckets: %d  Batches: %d  Memory Usage: " UINT64_FORMAT "kB\n",
-							 hinstrument.nbuckets, hinstrument.nbatch,
-							 spacePeakKb);
+			if (es->usage)
+				appendStringInfo(es->str, "  Memory Usage: %ldkB\n", spacePeakKb);
+			else
+				appendStringInfo(es->str, "\n");
+			foreach(lc, hinstrument.fallback_batches_stats)
+			{
+				FallbackBatchStats *fbs = lfirst(lc);
+
+				ExplainIndentText(es);
+				appendStringInfo(es->str,
+								 "Batch: %d  Stripes: %d\n",
+								 fbs->batchno,
+								 fbs->numstripes);
+			}
 		}
 	}
 }
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 3e22d50e3a4..43056f8c5c0 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -59,6 +59,7 @@ static void *dense_alloc(HashJoinTable hashtable, Size size);
 static HashJoinTuple ExecParallelHashTupleAlloc(HashJoinTable hashtable,
 												size_t size,
 												dsa_pointer *shared);
+static void ExecParallelHashTableEvictBatch0(HashJoinTable hashtable);
 static void MultiExecPrivateHash(HashState *node);
 static void MultiExecParallelHash(HashState *node);
 static inline HashJoinTuple ExecParallelHashFirstTuple(HashJoinTable hashtable,
@@ -71,6 +72,9 @@ static inline void ExecParallelHashPushTuple(dsa_pointer_atomic *head,
 static void ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch);
 static void ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable);
 static void ExecParallelHashRepartitionFirst(HashJoinTable hashtable);
+static void ExecParallelHashRepartitionBatch0Tuple(HashJoinTable hashtable,
+												   MinimalTuple tuple,
+												   uint32 hashvalue);
 static void ExecParallelHashRepartitionRest(HashJoinTable hashtable);
 static HashMemoryChunk ExecParallelHashPopChunkQueue(HashJoinTable hashtable,
 													 dsa_pointer *shared);
@@ -188,13 +192,53 @@ MultiExecPrivateHash(HashState *node)
 			}
 			else
 			{
-				/* Not subject to skew optimization, so insert normally */
-				ExecHashTableInsert(hashtable, slot, hashvalue);
+				/*
+				 * Not subject to skew optimization, so either insert normally
+				 * or save to batch file if batch 0 falls back and we have
+				 * already filled the hashtable up to space_allowed.
+				 */
+				int			bucketno;
+				int			batchno;
+				bool		shouldFree;
+				MinimalTuple tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+				ExecHashGetBucketAndBatch(hashtable, hashvalue,
+										  &bucketno, &batchno);
+
+				/*
+				 * If we set batch 0 to fallback on the previous tuple Save
+				 * the tuples in this batch which will not fit in the
+				 * hashtable should I be checking that hashtable->curstripe !=
+				 * 0?
+				 */
+				if (hashtable->hashloopBatchFile && hashtable->hashloopBatchFile[0])
+					ExecHashJoinSaveTuple(tuple,
+										  hashvalue,
+										  &hashtable->innerBatchFile[batchno], hashtable);
+				else
+					ExecHashTableInsert(hashtable, slot, hashvalue);
+
+				if (shouldFree)
+					heap_free_minimal_tuple(tuple);
 			}
 			hashtable->totalTuples += 1;
 		}
 	}
 
+	/*
+	 * If batch 0 fell back, rewind the inner side file where we saved the
+	 * tuples which did not fit in memory to prepare it for loading upon
+	 * finishing probing stripe 0 of batch 0
+	 */
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[0])
+	{
+		if (BufFileSeek(hashtable->innerBatchFile[0], 0, 0L, SEEK_SET))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not rewind hash-join temporary file: %m")));
+	}
+
+
 	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
 	if (hashtable->nbuckets != hashtable->nbuckets_optimal)
 		ExecHashIncreaseNumBuckets(hashtable);
@@ -328,9 +372,9 @@ MultiExecParallelHash(HashState *node)
 				 * are now fixed.  While building them we made sure they'd fit
 				 * in our memory budget when we load them back in later (or we
 				 * tried to do that and gave up because we detected extreme
-				 * skew).
+				 * skew and thus marked them to fall back).
 				 */
-				pstate->growth = PHJ_GROWTH_DISABLED;
+				pstate->growth = PHJ_GROWTH_LOADING;
 			}
 	}
 
@@ -506,12 +550,14 @@ ExecHashTableCreate(HashState *state)
 	hashtable->curbatch = 0;
 	hashtable->nbatch_original = nbatch;
 	hashtable->nbatch_outstart = nbatch;
-	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
 	hashtable->partialTuples = 0;
 	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
+	hashtable->hashloopBatchFile = NULL;
+	hashtable->fallback_batches_stats = NULL;
+	hashtable->curstripe = STRIPE_DETACHED;
 	hashtable->spaceUsed = 0;
 	hashtable->spacePeak = 0;
 	hashtable->spaceAllowed = space_allowed;
@@ -561,9 +607,9 @@ ExecHashTableCreate(HashState *state)
 
 		hashtable->innerBatchFile = palloc0_array(BufFile *, nbatch);
 		hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
+		hashtable->hashloopBatchFile = palloc0_array(BufFile *, nbatch);
 
 		MemoryContextSwitchTo(oldctx);
-
 		/* The files will not be opened until needed... */
 		/* ... but make sure we have temp tablespaces established for them */
 		PrepareTempTablespaces();
@@ -868,18 +914,19 @@ ExecHashTableDestroy(HashJoinTable hashtable)
 	int			i;
 
 	/*
-	 * Make sure all the temp files are closed.  We skip batch 0, since it
-	 * can't have any temp files (and the arrays might not even exist if
-	 * nbatch is only 1).  Parallel hash joins don't use these files.
+	 * Make sure all the temp files are closed.  Parallel hash joins don't use
+	 * these files.
 	 */
 	if (hashtable->innerBatchFile != NULL)
 	{
-		for (i = 1; i < hashtable->nbatch; i++)
+		for (i = 0; i < hashtable->nbatch; i++)
 		{
 			if (hashtable->innerBatchFile[i])
 				BufFileClose(hashtable->innerBatchFile[i]);
 			if (hashtable->outerBatchFile[i])
 				BufFileClose(hashtable->outerBatchFile[i]);
+			if (hashtable->hashloopBatchFile[i])
+				BufFileClose(hashtable->hashloopBatchFile[i]);
 		}
 	}
 
@@ -890,6 +937,18 @@ ExecHashTableDestroy(HashJoinTable hashtable)
 	pfree(hashtable);
 }
 
+/*
+ * Threshhold for tuple relocation during batch split for parallel and serial
+ * hashjoin.
+ * While growing the number of batches, for the batch which triggered the growth,
+ * if more than MAX_RELOCATION % of its tuples move to its child batch, then
+ * it likely has skewed data and so the child batch (the new home to the skewed
+ * tuples) will be marked as a "fallback" batch and processed using the hashloop
+ * join algorithm. The reverse is true as well: if more than MAX_RELOCATION
+ * remain in the parent, it too should be marked to "fallback".
+ */
+#define MAX_RELOCATION 0.8
+
 /*
  * ExecHashIncreaseNumBatches
  *		increase the original number of batches in order to reduce
@@ -900,13 +959,18 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 {
 	int			oldnbatch = hashtable->nbatch;
 	int			curbatch = hashtable->curbatch;
+	int			childbatch;
 	int			nbatch;
 	long		ninmemory;
 	long		nfreed;
 	HashMemoryChunk oldchunks;
+	int			curbatch_outgoing_tuples;
+	int			childbatch_outgoing_tuples;
+	int			target_batch;
+	FallbackBatchStats *fallback_batch_stats;
+	size_t		batchSize = 0;
 
-	/* do nothing if we've decided to shut off growth */
-	if (!hashtable->growEnabled)
+	if (hashtable->hashloopBatchFile && hashtable->hashloopBatchFile[curbatch])
 		return;
 
 	/* safety check to avoid overflow */
@@ -928,9 +992,9 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 		/* we had no file arrays before */
 		hashtable->innerBatchFile = palloc0_array(BufFile *, nbatch);
 		hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
+		hashtable->hashloopBatchFile = palloc0_array(BufFile *, nbatch);
 
 		MemoryContextSwitchTo(oldcxt);
-
 		/* time to establish the temp tablespaces, too */
 		PrepareTempTablespaces();
 	}
@@ -939,6 +1003,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 		/* enlarge arrays and zero out added entries */
 		hashtable->innerBatchFile = repalloc0_array(hashtable->innerBatchFile, BufFile *, oldnbatch, nbatch);
 		hashtable->outerBatchFile = repalloc0_array(hashtable->outerBatchFile, BufFile *, oldnbatch, nbatch);
+		hashtable->hashloopBatchFile = repalloc0_array(hashtable->outerBatchFile, BufFile *, oldnbatch, nbatch);
 	}
 
 	hashtable->nbatch = nbatch;
@@ -948,6 +1013,8 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 * no longer of the current batch.
 	 */
 	ninmemory = nfreed = 0;
+	curbatch_outgoing_tuples = childbatch_outgoing_tuples = 0;
+	childbatch = (1U << (my_log2(hashtable->nbatch) - 1)) | hashtable->curbatch;
 
 	/* If know we need to resize nbuckets, we can do it while rebatching. */
 	if (hashtable->nbuckets_optimal != hashtable->nbuckets)
@@ -994,7 +1061,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
 									  &bucketno, &batchno);
 
-			if (batchno == curbatch)
+			if (batchno == curbatch && (curbatch != 0 || batchSize + hashTupleSize < hashtable->spaceAllowed))
 			{
 				/* keep tuple in memory - copy it into the new chunk */
 				HashJoinTuple copyTuple;
@@ -1005,11 +1072,13 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 				/* and add it back to the appropriate bucket */
 				copyTuple->next.unshared = hashtable->buckets.unshared[bucketno];
 				hashtable->buckets.unshared[bucketno] = copyTuple;
+				curbatch_outgoing_tuples++;
+				batchSize += hashTupleSize;
 			}
 			else
 			{
 				/* dump it out */
-				Assert(batchno > curbatch);
+				Assert(batchno > curbatch || batchSize + hashTupleSize >= hashtable->spaceAllowed);
 				ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(hashTuple),
 									  hashTuple->hashvalue,
 									  &hashtable->innerBatchFile[batchno],
@@ -1017,6 +1086,16 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 
 				hashtable->spaceUsed -= hashTupleSize;
 				nfreed++;
+
+				/*
+				 * TODO: what to do about tuples that don't go to the child
+				 * batch or stay in the current batch? (this is why we are
+				 * counting tuples to child and curbatch with two diff
+				 * variables in case the tuples go to a batch that isn't the
+				 * child)
+				 */
+				if (batchno == childbatch)
+					childbatch_outgoing_tuples++;
 			}
 
 			/* next tuple in this chunk */
@@ -1037,21 +1116,33 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 #endif
 
 	/*
-	 * If we dumped out either all or none of the tuples in the table, disable
-	 * further expansion of nbatch.  This situation implies that we have
-	 * enough tuples of identical hashvalues to overflow spaceAllowed.
-	 * Increasing nbatch will not fix it since there's no way to subdivide the
-	 * group any more finely. We have to just gut it out and hope the server
-	 * has enough RAM.
+	 * The same batch should not be marked to fall back more than once
 	 */
-	if (nfreed == 0 || nfreed == ninmemory)
-	{
-		hashtable->growEnabled = false;
 #ifdef HJDEBUG
-		printf("Hashjoin %p: disabling further increase of nbatch\n",
-			   hashtable);
+	if ((childbatch_outgoing_tuples / (float) ninmemory) >= 0.8)
+		printf("childbatch %i targeted to fallback.", childbatch);
+	if ((curbatch_outgoing_tuples / (float) ninmemory) >= 0.8)
+		printf("curbatch %i targeted to fallback.", curbatch);
 #endif
-	}
+
+	/*
+	 * If too many tuples remain in the parent or too many tuples migrate to
+	 * the child, there is likely skew and continuing to increase the number
+	 * of batches will not help. Mark the batch which contains the skewed
+	 * tuples to be processed with block nested hashloop join.
+	 */
+	if ((childbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION)
+		target_batch = childbatch;
+	else if ((curbatch_outgoing_tuples / (float) ninmemory) >= MAX_RELOCATION)
+		target_batch = curbatch;
+	else
+		return;
+	hashtable->hashloopBatchFile[target_batch] = BufFileCreateTemp(false);
+
+	fallback_batch_stats = palloc0(sizeof(FallbackBatchStats));
+	fallback_batch_stats->batchno = target_batch;
+	fallback_batch_stats->numstripes = 0;
+	hashtable->fallback_batches_stats = lappend(hashtable->fallback_batches_stats, fallback_batch_stats);
 }
 
 /*
@@ -1210,6 +1301,11 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 			ExecParallelHashTableSetCurrentBatch(hashtable, 0);
 			/* Then partition, flush counters. */
 			ExecParallelHashRepartitionFirst(hashtable);
+
+			/*
+			 * TODO: add a debugging check that confirms that all the tuples
+			 * from the old generation are present in the new generation
+			 */
 			ExecParallelHashRepartitionRest(hashtable);
 			ExecParallelHashMergeCounters(hashtable);
 			/* Wait for the above to be finished. */
@@ -1229,7 +1325,6 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 			{
 				ParallelHashJoinBatch *old_batches;
 				bool		space_exhausted = false;
-				bool		extreme_skew_detected = false;
 
 				/* Make sure that we have the current dimensions and buckets. */
 				ExecParallelHashEnsureBatchAccessors(hashtable);
@@ -1245,6 +1340,24 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 					int			parent;
 
 					batch = hashtable->batches[i].shared;
+					/*
+					 * All batches were just created anew during
+					 * repartitioning
+					 */
+					Assert(!hashtable->batches[i].shared->hashloop_fallback);
+
+					/*
+					 * At the time of repartitioning, each batch updates its
+					 * estimated_size to reflect the size of the batch file on
+					 * disk. It is also updated when increasing preallocated
+					 * space in ExecParallelHashTuplePrealloc().
+					 *
+					 * Batch 0 is inserted into memory during the build stage,
+					 * it can spill to a file, so the size member, which
+					 * reflects the part of batch 0 in memory should never
+					 * exceed the space_allowed.
+					 */
+					Assert(batch->size <= pstate->space_allowed);
 					if (batch->space_exhausted ||
 						batch->estimated_size > pstate->space_allowed)
 						space_exhausted = true;
@@ -1254,19 +1367,57 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 					if (old_batch->space_exhausted ||
 						batch->estimated_size > pstate->space_allowed)
 					{
+						float		frac_moved;
+
+						parent = i % pstate->old_nbatch;
+						frac_moved = batch->ntuples / (float) hashtable->batches[parent].shared->old_ntuples;
 						/*
-						 * Did this batch receive ALL of the tuples from its
-						 * parent batch?  That would indicate that further
-						 * repartitioning isn't going to help (the hash values
-						 * are probably all the same).
+						 * If too many tuples remain in the parent or too many
+						 * tuples migrate to the child, there is likely skew
+						 * and continuing to increase the number of batches
+						 * will not help. Mark the batch which contains the
+						 * skewed tuples to be processed with block nested
+						 * hashloop join.
 						 */
-						if (batch->ntuples == hashtable->batches[parent].shared->old_ntuples)
-							extreme_skew_detected = true;
+						if (frac_moved >= MAX_RELOCATION)
+						{
+							batch->hashloop_fallback = true;
+							space_exhausted = false;
+						}
 					}
 				}
 
-				/* Don't keep growing if it's not helping or we'd overflow. */
-				if (extreme_skew_detected || hashtable->nbatch >= INT_MAX / 2)
+				/*
+					 * If all of the tuples in the hashtable were put back in
+					 * the hashtable during repartitioning, mark this batch as
+					 * a fallback batch so that we will evict the tuples to a
+					 * spill file were we to run out of space again This has
+					 * the problem of wasting a lot of time during the probe
+					 * phase if it turns out that we never try and allocate
+					 * any more memory in the hashtable.
+					 *
+					 * TODO: It might be worth doing something to indicate
+					 * that if all of the tuples went back into a batch but it
+					 * only exactly used the space_allowed, that the batch is
+					 * not a fallback batch yet but that the current stripe is
+					 * full, so if you need to allocate more, it would mark it
+					 * as a fallback batch. Otherwise, a batch 0 with no
+					 * tuples in spill files will still be treated as a
+					 * fallback batch during probing
+					 */
+					if (/*i == 0 && */hashtable->batches[0].shared->size == pstate->space_allowed)
+					{
+						if (hashtable->batches[0].shared->ntuples == hashtable->batches[0].shared->old_ntuples)
+						{
+							hashtable->batches[0].shared->hashloop_fallback = true;
+							space_exhausted = false;
+						}
+					}
+					if (space_exhausted)
+						break;
+
+				/* Don't keep growing if we'd overflow. */
+				if (hashtable->nbatch >= INT_MAX / 2)
 					pstate->growth = PHJ_GROWTH_DISABLED;
 				else if (space_exhausted)
 					pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
@@ -1294,65 +1445,153 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 static void
 ExecParallelHashRepartitionFirst(HashJoinTable hashtable)
 {
+	ParallelHashJoinState *pstate;
+
+	ParallelHashJoinBatch *old_shared;
+	SharedTuplestoreAccessor *old_inner_batch0_sts;
+
 	dsa_pointer chunk_shared;
 	HashMemoryChunk chunk;
 
-	Assert(hashtable->nbatch == hashtable->parallel_state->nbatch);
+	ParallelHashJoinBatch *old_batches = (ParallelHashJoinBatch *) dsa_get_address(hashtable->area, hashtable->parallel_state->old_batches);
+
+	Assert(old_batches);
+	old_shared = NthParallelHashJoinBatch(old_batches, 0);
+	old_inner_batch0_sts = sts_attach(ParallelHashJoinBatchInner(old_shared), ParallelWorkerNumber + 1, &hashtable->parallel_state->fileset);
+
+	pstate = hashtable->parallel_state;
 
-	while ((chunk = ExecParallelHashPopChunkQueue(hashtable, &chunk_shared)))
+	Assert(hashtable->nbatch == hashtable->parallel_state->nbatch);
+	BarrierAttach(&pstate->repartition_barrier);
+	switch (PHJ_REPARTITION_BATCH0_PHASE(BarrierPhase(&pstate->repartition_barrier)))
 	{
-		size_t		idx = 0;
+		case PHJ_REPARTITION_BATCH0_DRAIN_QUEUE:
+			while ((chunk = ExecParallelHashPopChunkQueue(hashtable, &chunk_shared)))
+			{
+				MinimalTuple tuple;
+				size_t		idx = 0;
 
-		/* Repartition all tuples in this chunk. */
-		while (idx < chunk->used)
-		{
-			HashJoinTuple hashTuple = (HashJoinTuple) (HASH_CHUNK_DATA(chunk) + idx);
-			MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
-			HashJoinTuple copyTuple;
-			dsa_pointer shared;
-			int			bucketno;
-			int			batchno;
+				/*
+				 * Repartition all tuples in this chunk. These tuples may be
+				 * relocated to a batch file or may be inserted back into
+				 * memory.
+				 */
+				while (idx < chunk->used)
+				{
+					HashJoinTuple hashTuple = (HashJoinTuple) (HASH_CHUNK_DATA(chunk) + idx);
 
-			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
-									  &bucketno, &batchno);
+					tuple = HJTUPLE_MINTUPLE(hashTuple);
 
-			Assert(batchno < hashtable->nbatch);
-			if (batchno == 0)
-			{
-				/* It still belongs in batch 0.  Copy to a new chunk. */
-				copyTuple =
-					ExecParallelHashTupleAlloc(hashtable,
-											   HJTUPLE_OVERHEAD + tuple->t_len,
-											   &shared);
-				copyTuple->hashvalue = hashTuple->hashvalue;
-				memcpy(HJTUPLE_MINTUPLE(copyTuple), tuple, tuple->t_len);
-				ExecParallelHashPushTuple(&hashtable->buckets.shared[bucketno],
-										  copyTuple, shared);
+					ExecParallelHashRepartitionBatch0Tuple(hashtable,
+														   tuple,
+														   hashTuple->hashvalue);
+
+					idx += MAXALIGN(HJTUPLE_OVERHEAD + HJTUPLE_MINTUPLE(hashTuple)->t_len);
+				}
+
+				dsa_free(hashtable->area, chunk_shared);
+				CHECK_FOR_INTERRUPTS();
 			}
-			else
+			BarrierArriveAndWait(&pstate->repartition_barrier, WAIT_EVENT_HASH_REPARTITION_BATCH0_DRAIN_QUEUE);
+			/* FALLTHROUGH */
+		case PHJ_REPARTITION_BATCH0_DRAIN_SPILL_FILE:
 			{
-				size_t		tuple_size =
-					MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
+				MinimalTuple tuple;
+				tupleMetadata metadata;
+
+				/*
+				 * Repartition all of the tuples in this spill file. These
+				 * tuples may go back into the hashtable if space was freed up
+				 * or they may go into another batch or they may go into the
+				 * batch 0 spill file.
+				 */
+				sts_begin_parallel_scan(old_inner_batch0_sts);
+				while ((tuple = sts_parallel_scan_next(old_inner_batch0_sts,
+													   &metadata.hashvalue)))
+				{
 
-				/* It belongs in a later batch. */
-				hashtable->batches[batchno].estimated_size += tuple_size;
-				sts_puttuple(hashtable->batches[batchno].inner_tuples,
-							 &hashTuple->hashvalue, tuple);
+					ExecParallelHashRepartitionBatch0Tuple(hashtable,
+														   tuple,
+														   metadata.hashvalue);
+				}
+				sts_end_parallel_scan(old_inner_batch0_sts);
 			}
+	}
+	BarrierArriveAndDetach(&pstate->repartition_barrier);
+}
 
-			/* Count this tuple. */
-			++hashtable->batches[0].old_ntuples;
-			++hashtable->batches[batchno].ntuples;
+static void
+ExecParallelHashRepartitionBatch0Tuple(HashJoinTable hashtable,
+									   MinimalTuple tuple,
+									   uint32 hashvalue)
+{
+	int			batchno;
+	int			bucketno;
+	dsa_pointer shared;
+	HashJoinTuple copyTuple;
+	ParallelHashJoinState *pstate = hashtable->parallel_state;
+	bool		spill = true;
+	bool		hashtable_full = hashtable->batches[0].shared->size >= pstate->space_allowed;
+	size_t		tuple_size =
+	MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 
-			idx += MAXALIGN(HJTUPLE_OVERHEAD +
-							HJTUPLE_MINTUPLE(hashTuple)->t_len);
+	ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno, &batchno);
+
+	/*
+	 * We don't take a lock to read pstate->space_allowed because it should
+	 * not change during execution of the hash join
+	 */
+
+	Assert(BarrierPhase(&pstate->build_barrier) == PHJ_BUILD_HASH_INNER);
+	if (batchno == 0 && !hashtable_full)
+	{
+		copyTuple = ExecParallelHashTupleAlloc(hashtable,
+											   HJTUPLE_OVERHEAD + tuple->t_len,
+											   &shared);
+
+		/*
+		 * TODO: do we need to check if growth was set to
+		 * PHJ_GROWTH_SPILL_BATCH0?
+		 */
+		if (copyTuple)
+		{
+			/* Store the hash value in the HashJoinTuple header. */
+			copyTuple->hashvalue = hashvalue;
+			memcpy(HJTUPLE_MINTUPLE(copyTuple), tuple, tuple->t_len);
+
+			/* Push it onto the front of the bucket's list */
+			ExecParallelHashPushTuple(&hashtable->buckets.shared[bucketno],
+									  copyTuple, shared);
+			pg_atomic_add_fetch_u64(&hashtable->batches[0].shared->ntuples_in_memory, 1);
+
+			spill = false;
 		}
+	}
+
+	if (spill)
+	{
 
-		/* Free this chunk. */
-		dsa_free(hashtable->area, chunk_shared);
+		tupleMetadata metadata;
 
-		CHECK_FOR_INTERRUPTS();
+		ParallelHashJoinBatchAccessor *batch_accessor = &(hashtable->batches[batchno]);
+
+		/*
+		 * It is okay to use backend local here because force spill tuple is
+		 * only done during repartitioning when we can't grow batches so won't
+		 * make decision based on it and will merge counters during deciding
+		 * and during evictbatch0 which can ony be done on a batch that is
+		 * already fallback so we won't make decision on it and will merge
+		 * counters after the build phase
+		 */
+		batch_accessor->estimated_size += tuple_size;
+		metadata.hashvalue = hashvalue;
+
+		sts_puttuple(batch_accessor->inner_tuples,
+					 &metadata,
+					 tuple);
 	}
+	++hashtable->batches[batchno].ntuples;
+	++hashtable->batches[0].old_ntuples;
 }
 
 /*
@@ -1389,24 +1628,41 @@ ExecParallelHashRepartitionRest(HashJoinTable hashtable)
 
 		/* Scan one partition from the previous generation. */
 		sts_begin_parallel_scan(old_inner_tuples[i]);
-		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i], &hashvalue)))
+		while ((tuple = sts_parallel_scan_next(old_inner_tuples[i],
+											   &hashvalue)))
 		{
-			size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 			int			bucketno;
 			int			batchno;
+			size_t		tuple_size;
+			tupleMetadata metadata;
+			ParallelHashJoinBatchAccessor *batch_accessor;
+
 
 			/* Decide which partition it goes to in the new generation. */
 			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
 									  &batchno);
 
-			hashtable->batches[batchno].estimated_size += tuple_size;
-			++hashtable->batches[batchno].ntuples;
-			++hashtable->batches[i].old_ntuples;
+			tuple_size =
+				MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 
-			/* Store the tuple its new batch. */
-			sts_puttuple(hashtable->batches[batchno].inner_tuples,
-						 &hashvalue, tuple);
+			batch_accessor = &(hashtable->batches[batchno]);
 
+			/*
+			 * It is okay to use backend local here because force spill tuple
+			 * is only done during repartitioning when we can't grow batches
+			 * so won't make decision based on it and will merge counters
+			 * during deciding and during evictbatch0 which can ony be done on
+			 * a batch that is already fallback so we won't make decision on
+			 * it and will merge counters after the build phase
+			 */
+			batch_accessor->estimated_size += tuple_size;
+			metadata.hashvalue = hashvalue;
+
+			sts_puttuple(batch_accessor->inner_tuples,
+						 &metadata,
+						 tuple);
+			++hashtable->batches[batchno].ntuples;
+			++hashtable->batches[i].old_ntuples;
 			CHECK_FOR_INTERRUPTS();
 		}
 		sts_end_parallel_scan(old_inner_tuples[i]);
@@ -1724,7 +1980,7 @@ retry:
 		hashTuple = ExecParallelHashTupleAlloc(hashtable,
 											   HJTUPLE_OVERHEAD + tuple->t_len,
 											   &shared);
-		if (hashTuple == NULL)
+		if (!hashTuple)
 			goto retry;
 
 		/* Store the hash value in the HashJoinTuple header. */
@@ -1735,10 +1991,13 @@ retry:
 		/* Push it onto the front of the bucket's list */
 		ExecParallelHashPushTuple(&hashtable->buckets.shared[bucketno],
 								  hashTuple, shared);
+		pg_atomic_add_fetch_u64(&hashtable->batches[0].shared->ntuples_in_memory, 1);
+
 	}
 	else
 	{
 		size_t		tuple_size = MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
+		tupleMetadata metadata;
 
 		Assert(batchno > 0);
 
@@ -1751,7 +2010,11 @@ retry:
 
 		Assert(hashtable->batches[batchno].preallocated >= tuple_size);
 		hashtable->batches[batchno].preallocated -= tuple_size;
-		sts_puttuple(hashtable->batches[batchno].inner_tuples, &hashvalue,
+
+		metadata.hashvalue = hashvalue;
+
+		sts_puttuple(hashtable->batches[batchno].inner_tuples,
+					 &metadata,
 					 tuple);
 	}
 	++hashtable->batches[batchno].ntuples;
@@ -1766,10 +2029,11 @@ retry:
  * to other batches or to run out of memory, and should only be called with
  * tuples that belong in the current batch once growth has been disabled.
  */
-void
+MinimalTuple
 ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable,
 										TupleTableSlot *slot,
-										uint32 hashvalue)
+										uint32 hashvalue,
+										int read_participant)
 {
 	bool		shouldFree;
 	MinimalTuple tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
@@ -1778,19 +2042,26 @@ ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable,
 	int			batchno;
 	int			bucketno;
 
+
 	ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno, &batchno);
 	Assert(batchno == hashtable->curbatch);
+
 	hashTuple = ExecParallelHashTupleAlloc(hashtable,
 										   HJTUPLE_OVERHEAD + tuple->t_len,
 										   &shared);
+	if (!hashTuple)
+		return NULL;
+
 	hashTuple->hashvalue = hashvalue;
 	memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len);
 	HeapTupleHeaderClearMatch(HJTUPLE_MINTUPLE(hashTuple));
 	ExecParallelHashPushTuple(&hashtable->buckets.shared[bucketno],
 							  hashTuple, shared);
+	pg_atomic_add_fetch_u64(&hashtable->batches[hashtable->curbatch].shared->ntuples_in_memory, 1);
 
 	if (shouldFree)
 		heap_free_minimal_tuple(tuple);
+	return tuple;
 }
 
 
@@ -2654,6 +2925,12 @@ ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt)
 		pcxt->nworkers * sizeof(HashInstrumentation);
 	node->shared_info = (SharedHashInfo *) shm_toc_allocate(pcxt->toc, size);
 
+	/*
+	 * TODO: the linked list which is being used for fallback stats needs
+	 * space allocated for it in shared memory as well. For now, it seems to
+	 * be coincidentally working
+	 */
+
 	/* Each per-worker area must start out as zeroes. */
 	memset(node->shared_info, 0, size);
 
@@ -2752,6 +3029,11 @@ ExecHashAccumInstrumentation(HashInstrumentation *instrument,
 									  hashtable->nbatch_original);
 	instrument->space_peak = Max(instrument->space_peak,
 								 hashtable->spacePeak);
+
+	/*
+	 * TODO: this doesn't work right now in case of rescan (doesn't get max)
+	 */
+	instrument->fallback_batches_stats = hashtable->fallback_batches_stats;
 }
 
 /*
@@ -2826,6 +3108,146 @@ dense_alloc(HashJoinTable hashtable, Size size)
 	return ptr;
 }
 
+/*
+ * Assume caller has a lock or is behind a barrier and has the right
+ * to change these values
+ */
+inline void
+ExecParallelHashTableRecycle(HashJoinTable hashtable)
+{
+	ParallelHashJoinBatchAccessor *batch_accessor = &(hashtable->batches[hashtable->curbatch]);
+	ParallelHashJoinBatch *batch = batch_accessor->shared;
+
+	dsa_pointer_atomic *buckets = (dsa_pointer_atomic *)
+	dsa_get_address(hashtable->area, batch->buckets);
+
+	for (size_t i = 0; i < hashtable->nbuckets; ++i)
+		dsa_pointer_atomic_write(&buckets[i], InvalidDsaPointer);
+	batch->size = 0;
+	batch->space_exhausted = false;
+
+	/*
+	 * TODO: I'm not sure that we want to reset this when this function is
+	 * called to recycle the hashtable during the build stage as part of
+	 * evicting batch 0. It seems like it would be okay since a worker does
+	 * not have the right to over-allocate now. So, for a fallback batch,
+	 * at_least_one_chunk doesn't matter It seems like it may not matter at
+	 * all anymore...
+	 */
+	batch_accessor->at_least_one_chunk = false;
+	pg_atomic_exchange_u64(&batch->ntuples_in_memory, 0);
+}
+
+/*
+ * The eviction phase machine is responsible for evicting tuples from the
+ * hashtable during the Build stage of executing a parallel-aware parallel
+ * hash join. After increasing the number of batches in
+ * ExecParallelHashIncreaseNumBatches(), in the PHJ_GROW_BATCHES_DECIDING
+ * phase, if the batch 0 hashtable meets the criteria for falling back
+ * and is marked a fallback batch, the next time an inserted tuple would
+ * exceed the space_allowed, instead, trigger an eviction. Evict all
+ * batch 0 tuples to spill files in batch 0 inner side SharedTuplestore.
+ */
+static void
+ExecParallelHashTableEvictBatch0(HashJoinTable hashtable)
+{
+
+	ParallelHashJoinState *pstate = hashtable->parallel_state;
+	ParallelHashJoinBatchAccessor *batch0_accessor = &(hashtable->batches[0]);
+
+	/*
+	 * No other workers must be inserting tuples into the hashtable once
+	 * growth has been set to PHJ_EVICT. Otherwise, the below will not work
+	 * correctly. This should be okay since the same assumptions are made in
+	 * the increase batch machine.
+	 */
+	BarrierAttach(&pstate->eviction_barrier);
+	switch (PHJ_EVICT_PHASE(BarrierPhase(&pstate->eviction_barrier)))
+	{
+		case PHJ_EVICT_ELECTING:
+			if (BarrierArriveAndWait(&pstate->eviction_barrier, WAIT_EVENT_HASH_EVICT_ELECT))
+			{
+				pstate->chunk_work_queue = batch0_accessor->shared->chunks;
+				batch0_accessor->shared->chunks = InvalidDsaPointer;
+				ExecParallelHashTableRecycle(hashtable);
+			}
+			/* FALLTHROUGH */
+		case PHJ_EVICT_RESETTING:
+			BarrierArriveAndWait(&pstate->eviction_barrier, WAIT_EVENT_HASH_EVICT_RESET);
+			/* FALLTHROUGH */
+		case PHJ_EVICT_SPILLING:
+			{
+				dsa_pointer chunk_shared;
+				HashMemoryChunk chunk;
+
+				/*
+				 * TODO: Do I need to do this here? am I guaranteed to have
+				 * the correct shared memory reference to the batches array
+				 * already?
+				 */
+				ParallelHashJoinBatch *batches;
+				ParallelHashJoinBatch *batch0;
+
+				batches = (ParallelHashJoinBatch *)
+					dsa_get_address(hashtable->area, pstate->batches);
+				batch0 = NthParallelHashJoinBatch(batches, 0);
+				Assert(batch0 == hashtable->batches[0].shared);
+
+				ExecParallelHashTableSetCurrentBatch(hashtable, 0);
+
+				while ((chunk = ExecParallelHashPopChunkQueue(hashtable, &chunk_shared)))
+				{
+					size_t		idx = 0;
+
+					while (idx < chunk->used)
+					{
+						tupleMetadata metadata;
+
+						size_t		tuple_size;
+						MinimalTuple minTuple;
+						HashJoinTuple hashTuple = (HashJoinTuple) (HASH_CHUNK_DATA(chunk) + idx);
+
+						minTuple = HJTUPLE_MINTUPLE(hashTuple);
+
+						tuple_size =
+							MAXALIGN(HJTUPLE_OVERHEAD + minTuple->t_len);
+
+						/*
+						 * It is okay to use backend local here because can
+						 * ony be done on a batch that is already fallback so
+						 * we won't make decision on it and will merge
+						 * counters after the build phase
+						 */
+						batch0_accessor->estimated_size += tuple_size;
+						metadata.hashvalue = hashTuple->hashvalue;
+
+						sts_puttuple(batch0_accessor->inner_tuples,
+									 &metadata,
+									 minTuple);
+
+						idx += MAXALIGN(HJTUPLE_OVERHEAD +
+										HJTUPLE_MINTUPLE(hashTuple)->t_len);
+					}
+					dsa_free(hashtable->area, chunk_shared);
+
+					CHECK_FOR_INTERRUPTS();
+				}
+				BarrierArriveAndWait(&pstate->eviction_barrier, WAIT_EVENT_HASH_EVICT_SPILL);
+			}
+			/* FALLTHROUGH */
+		case PHJ_EVICT_FINISHING:
+
+			/*
+			 * TODO: Is this phase needed?
+			 */
+			if (BarrierArriveAndWait(&pstate->eviction_barrier, WAIT_EVENT_HASH_EVICT_FINISH))
+				pstate->growth = PHJ_GROWTH_OK;
+			/* FALLTHROUGH */
+		case PHJ_EVICT_DONE:
+			BarrierArriveAndDetach(&pstate->eviction_barrier);
+	}
+}
+
 /*
  * Allocate space for a tuple in shared dense storage.  This is equivalent to
  * dense_alloc but for Parallel Hash using shared memory.
@@ -2838,7 +3260,8 @@ dense_alloc(HashJoinTable hashtable, Size size)
  * possibility that the tuple no longer belongs in the same batch).
  */
 static HashJoinTuple
-ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
+ExecParallelHashTupleAlloc(HashJoinTable hashtable,
+						   size_t size,
 						   dsa_pointer *shared)
 {
 	ParallelHashJoinState *pstate = hashtable->parallel_state;
@@ -2879,7 +3302,8 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 	 * Check if we need to help increase the number of buckets or batches.
 	 */
 	if (pstate->growth == PHJ_GROWTH_NEED_MORE_BATCHES ||
-		pstate->growth == PHJ_GROWTH_NEED_MORE_BUCKETS)
+		pstate->growth == PHJ_GROWTH_NEED_MORE_BUCKETS ||
+		pstate->growth == PHJ_GROWTH_SPILL_BATCH0)
 	{
 		ParallelHashGrowth growth = pstate->growth;
 
@@ -2891,6 +3315,8 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 			ExecParallelHashIncreaseNumBatches(hashtable);
 		else if (growth == PHJ_GROWTH_NEED_MORE_BUCKETS)
 			ExecParallelHashIncreaseNumBuckets(hashtable);
+		else if (growth == PHJ_GROWTH_SPILL_BATCH0)
+			ExecParallelHashTableEvictBatch0(hashtable);
 
 		/* The caller must retry. */
 		return NULL;
@@ -2903,7 +3329,7 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 		chunk_size = HASH_CHUNK_SIZE;
 
 	/* Check if it's time to grow batches or buckets. */
-	if (pstate->growth != PHJ_GROWTH_DISABLED)
+	if (pstate->growth != PHJ_GROWTH_DISABLED && pstate->growth != PHJ_GROWTH_LOADING)
 	{
 		Assert(curbatch == 0);
 		Assert(BarrierPhase(&pstate->build_barrier) == PHJ_BUILD_HASH_INNER);
@@ -2912,16 +3338,26 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 		 * Check if our space limit would be exceeded.  To avoid choking on
 		 * very large tuples or very low hash_mem setting, we'll always allow
 		 * each backend to allocate at least one chunk.
+		 *
+		 * If the batch has already been marked to fall back, then we don't
+		 * need to worry about having allocated one chunk -- we should start
+		 * evicting tuples.
 		 */
-		if (hashtable->batches[0].at_least_one_chunk &&
-			hashtable->batches[0].shared->size +
+		LWLockAcquire(&hashtable->batches[0].shared->lock, LW_EXCLUSIVE);
+		if (hashtable->batches[0].shared->size +
 			chunk_size > pstate->space_allowed)
 		{
-			pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
-			hashtable->batches[0].shared->space_exhausted = true;
-			LWLockRelease(&pstate->lock);
-
-			return NULL;
+			if (hashtable->batches[0].shared->hashloop_fallback || hashtable->batches[0].at_least_one_chunk)
+			{
+				if (hashtable->batches[0].shared->hashloop_fallback)
+					pstate->growth = PHJ_GROWTH_SPILL_BATCH0;
+				else if (hashtable->batches[0].at_least_one_chunk)
+					pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
+				hashtable->batches[0].shared->space_exhausted = true;
+				LWLockRelease(&pstate->lock);
+				LWLockRelease(&hashtable->batches[0].shared->lock);
+				return NULL;
+			}
 		}
 
 		/* Check if our load factor limit would be exceeded. */
@@ -2938,14 +3374,60 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 			{
 				pstate->growth = PHJ_GROWTH_NEED_MORE_BUCKETS;
 				LWLockRelease(&pstate->lock);
+				LWLockRelease(&hashtable->batches[0].shared->lock);
 
 				return NULL;
 			}
 		}
+		LWLockRelease(&hashtable->batches[0].shared->lock);
 	}
 
+	/*
+	 * TODO: should I care about hashtable->batches[b].at_least_one_chunk
+	 * here?
+	 */
+	if (pstate->growth == PHJ_GROWTH_LOADING)
+	{
+		int			b = hashtable->curbatch;
+
+		LWLockAcquire(&hashtable->batches[b].shared->lock, LW_EXCLUSIVE);
+		if (hashtable->batches[b].shared->hashloop_fallback &&
+			(hashtable->batches[b].shared->space_exhausted ||
+			 hashtable->batches[b].shared->size + chunk_size > pstate->space_allowed))
+		{
+			bool		space_exhausted = hashtable->batches[b].shared->space_exhausted;
+
+			if (!space_exhausted)
+				hashtable->batches[b].shared->space_exhausted = true;
+			LWLockRelease(&pstate->lock);
+			LWLockRelease(&hashtable->batches[b].shared->lock);
+			return NULL;
+		}
+		LWLockRelease(&hashtable->batches[b].shared->lock);
+	}
+
+	/*
+	 * If not even one chunk would fit in the space_allowed, there isn't
+	 * anything we can do to avoid exceeding space_allowed. Also, if we keep
+	 * the rule that a backend should be allowed to allocate at least one
+	 * chunk, then we will end up tripping this assert some of the time unless
+	 * we make that exception (should we make that exception?) TODO: should
+	 * memory settings < chunk_size even be allowed. Should it error out?
+	 * should we be able to make this assertion?
+	 * Assert(hashtable->batches[hashtable->curbatch].shared->size +
+	 * chunk_size <= pstate->space_allowed);
+	 */
+
 	/* We are cleared to allocate a new chunk. */
 	chunk_shared = dsa_allocate(hashtable->area, chunk_size);
+
+	/*
+	 * The chunk is accounted for in the hashtable size only. Even though
+	 * batch 0 can spill, we don't need to track this allocated chunk in the
+	 * estimated_stripe_size member because we check the size member when
+	 * determining if the hashtable is too big, and, we will only ever number
+	 * stripes (starting with 1 instead of 0 for batch 0) in the spill file.
+	 */
 	hashtable->batches[curbatch].shared->size += chunk_size;
 	hashtable->batches[curbatch].at_least_one_chunk = true;
 
@@ -3018,21 +3500,40 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 	{
 		ParallelHashJoinBatchAccessor *accessor = &hashtable->batches[i];
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
+		SharedBits *sbits = ParallelHashJoinBatchOuterBits(shared, pstate->nparticipants);
 		char		name[MAXPGPATH];
+		char		sbname[MAXPGPATH];
+
+		shared->hashloop_fallback = false;
+		pg_atomic_init_flag(&shared->overflow_required);
+		pg_atomic_init_u64(&shared->ntuples_in_memory, 0);
+		/* TODO: is it okay to use the same tranche for this lock? */
+		LWLockInitialize(&shared->lock, LWTRANCHE_PARALLEL_HASH_JOIN);
+		shared->nstripes = 0;
 
 		/*
 		 * All members of shared were zero-initialized.  We just need to set
 		 * up the Barrier.
 		 */
 		BarrierInit(&shared->batch_barrier, 0);
+		BarrierInit(&shared->stripe_barrier, 0);
+
+		/* Batch 0 doesn't need to be loaded. */
 		if (i == 0)
 		{
-			/* Batch 0 doesn't need to be loaded. */
+			shared->nstripes = 1;
 			BarrierAttach(&shared->batch_barrier);
-			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_PROBE)
+			while (BarrierPhase(&shared->batch_barrier) < PHJ_BATCH_STRIPE)
 				BarrierArriveAndWait(&shared->batch_barrier, 0);
 			BarrierDetach(&shared->batch_barrier);
+
+			BarrierAttach(&shared->stripe_barrier);
+			while (BarrierPhase(&shared->stripe_barrier) < PHJ_STRIPE_PROBING)
+				BarrierArriveAndWait(&shared->stripe_barrier, 0);
+			BarrierDetach(&shared->stripe_barrier);
 		}
+		/* why isn't done initialized here ? */
+		accessor->done = PHJ_BATCH_ACCESSOR_NOT_DONE;
 
 		/* Initialize accessor state.  All members were zero-initialized. */
 		accessor->shared = shared;
@@ -3043,7 +3544,7 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 			sts_initialize(ParallelHashJoinBatchInner(shared),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
@@ -3053,10 +3554,14 @@ ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
 													  pstate->nparticipants),
 						   pstate->nparticipants,
 						   ParallelWorkerNumber + 1,
-						   sizeof(uint32),
+						   sizeof(tupleMetadata),
 						   SHARED_TUPLESTORE_SINGLE_PASS,
 						   &pstate->fileset,
 						   name);
+		snprintf(sbname, MAXPGPATH, "%s.bitmaps", name);
+		/* Use the same SharedFileset for the SharedTupleStore and SharedBits */
+		accessor->sba = sb_initialize(sbits, pstate->nparticipants,
+									  ParallelWorkerNumber + 1, &pstate->fileset, sbname);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -3128,11 +3633,13 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 	{
 		ParallelHashJoinBatchAccessor *accessor = &hashtable->batches[i];
 		ParallelHashJoinBatch *shared = NthParallelHashJoinBatch(batches, i);
+		SharedBits *sbits = ParallelHashJoinBatchOuterBits(shared, pstate->nparticipants);
 
 		accessor->shared = shared;
 		accessor->preallocated = 0;
 		accessor->done = false;
 		accessor->outer_eof = false;
+		accessor->done = PHJ_BATCH_ACCESSOR_NOT_DONE;
 		accessor->inner_tuples =
 			sts_attach(ParallelHashJoinBatchInner(shared),
 					   ParallelWorkerNumber + 1,
@@ -3142,6 +3649,7 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 												  pstate->nparticipants),
 					   ParallelWorkerNumber + 1,
 					   &pstate->fileset);
+		accessor->sba = sb_attach(sbits, ParallelWorkerNumber + 1, &pstate->fileset);
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -3259,6 +3767,18 @@ ExecHashTableDetachBatch(HashJoinTable hashtable)
 	}
 }
 
+bool
+ExecHashTableDetachStripe(HashJoinTable hashtable)
+{
+	int			curbatch = hashtable->curbatch;
+	ParallelHashJoinBatch *batch = hashtable->batches[curbatch].shared;
+	Barrier    *stripe_barrier = &batch->stripe_barrier;
+
+	BarrierDetach(stripe_barrier);
+	hashtable->curstripe = STRIPE_DETACHED;
+	return false;
+}
+
 /*
  * Detach from all shared resources.  If we are last to detach, clean up.
  */
@@ -3429,7 +3949,6 @@ ExecParallelHashTuplePrealloc(HashJoinTable hashtable, int batchno, size_t size)
 	ParallelHashJoinBatchAccessor *batch = &hashtable->batches[batchno];
 	size_t		want = Max(size, HASH_CHUNK_SIZE - HASH_CHUNK_HEADER_SIZE);
 
-	Assert(batchno > 0);
 	Assert(batchno < hashtable->nbatch);
 	Assert(size == MAXALIGN(size));
 
@@ -3437,7 +3956,8 @@ ExecParallelHashTuplePrealloc(HashJoinTable hashtable, int batchno, size_t size)
 
 	/* Has another participant commanded us to help grow? */
 	if (pstate->growth == PHJ_GROWTH_NEED_MORE_BATCHES ||
-		pstate->growth == PHJ_GROWTH_NEED_MORE_BUCKETS)
+		pstate->growth == PHJ_GROWTH_NEED_MORE_BUCKETS ||
+		pstate->growth == PHJ_GROWTH_SPILL_BATCH0)
 	{
 		ParallelHashGrowth growth = pstate->growth;
 
@@ -3446,18 +3966,21 @@ ExecParallelHashTuplePrealloc(HashJoinTable hashtable, int batchno, size_t size)
 			ExecParallelHashIncreaseNumBatches(hashtable);
 		else if (growth == PHJ_GROWTH_NEED_MORE_BUCKETS)
 			ExecParallelHashIncreaseNumBuckets(hashtable);
+		else if (growth == PHJ_GROWTH_SPILL_BATCH0)
+			ExecParallelHashTableEvictBatch0(hashtable);
 
 		return false;
 	}
 
 	if (pstate->growth != PHJ_GROWTH_DISABLED &&
 		batch->at_least_one_chunk &&
-		(batch->shared->estimated_size + want + HASH_CHUNK_HEADER_SIZE
-		 > pstate->space_allowed))
+		(batch->shared->estimated_size + want + HASH_CHUNK_HEADER_SIZE > pstate->space_allowed) &&
+		!batch->shared->hashloop_fallback)
 	{
 		/*
 		 * We have determined that this batch would exceed the space budget if
-		 * loaded into memory.  Command all participants to help repartition.
+		 * loaded into memory.  It is also not yet marked as a fallback batch.
+		 * Command all participants to help repartition.
 		 */
 		batch->shared->space_exhausted = true;
 		pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES;
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 6c3009fba0f..ba04a5a3eff 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -142,6 +142,27 @@
  * hash_mem of all participants to create a large shared hash table.  If that
  * turns out either at planning or execution time to be impossible then we
  * fall back to regular hash_mem sized hash tables.
+ * If a given batch causes the number of batches to be doubled and data skew
+ * causes too few or too many tuples to be relocated to the child of this batch,
+ * the batch which is now home to the skewed tuples is marked as a "fallback"
+ * batch. This means that it will be processed using multiple loops --
+ * each loop probing an arbitrary stripe of tuples from this batch
+ * which fit in hash_mem or combined hash_mem.
+ * This batch is no longer permitted to cause growth in the number of batches.
+ *
+ * When the inner side of a fallback batch is loaded into memory, stripes of
+ * arbitrary tuples totaling hash_mem or combined hash_mem in size are loaded
+ * into the hashtable. After probing this stripe, the outer side batch is
+ * rewound and the next stripe is loaded. Each stripe of the inner batch is
+ * probed until all tuples from that batch have been processed.
+ *
+ * Tuples that match are emitted (depending on the join semantics of the
+ * particular join type) during probing of the stripe. However, in order to make
+ * left outer join work, unmatched tuples cannot be emitted NULL-extended until
+ * all stripes have been probed. To address this, a bitmap is created with a bit
+ * for each tuple of the outer side. If a tuple on the outer side matches a
+ * tuple from the inner, the corresponding bit is set. At the end of probing all
+ * stripes, the executor scans the bitmap and emits unmatched outer tuples.
  *
  * To avoid deadlocks, we never wait for any barrier unless it is known that
  * all other backends attached to it are actively executing the node or have
@@ -182,7 +203,7 @@
 #define HJ_SCAN_BUCKET			3
 #define HJ_FILL_OUTER_TUPLE		4
 #define HJ_FILL_INNER_TUPLES	5
-#define HJ_NEED_NEW_BATCH		6
+#define HJ_NEED_NEW_STRIPE      6
 
 /* Returns true if doing null-fill on outer relation */
 #define HJ_FILL_OUTER(hjstate)	((hjstate)->hj_NullInnerTupleSlot != NULL)
@@ -199,10 +220,99 @@ static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 												 BufFile *file,
 												 uint32 *hashvalue,
 												 TupleTableSlot *tupleSlot);
+static int	ExecHashJoinLoadStripe(HashJoinState *hjstate);
 static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
 static bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
 static void ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate);
+static bool ExecParallelHashJoinLoadStripe(HashJoinState *hjstate);
+static void ExecParallelHashJoinPartitionOuter(HashJoinState *node);
+static bool checkbit(HashJoinState *hjstate);
+static void set_match_bit(HashJoinState *hjstate);
+
+static pg_attribute_always_inline bool
+			IsHashloopFallback(HashJoinTable hashtable);
+
+#define UINT_BITS (sizeof(unsigned int) * CHAR_BIT)
+
+static void
+set_match_bit(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	BufFile    *statusFile = hashtable->hashloopBatchFile[hashtable->curbatch];
+	int			tupindex = hjstate->hj_CurNumOuterTuples - 1;
+	size_t		unit_size = sizeof(hjstate->hj_CurOuterMatchStatus);
+	off_t		offset = tupindex / UINT_BITS * unit_size;
+
+	int			fileno;
+	off_t		cursor;
+
+	BufFileTell(statusFile, &fileno, &cursor);
+
+	/* Extend the statusFile if this is stripe zero. */
+	if (hashtable->curstripe == 0)
+	{
+		for (; cursor < offset + unit_size; cursor += unit_size)
+		{
+			hjstate->hj_CurOuterMatchStatus = 0;
+			BufFileWrite(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+		}
+	}
+
+	if (cursor != offset)
+		BufFileSeek(statusFile, 0, offset, SEEK_SET);
+
+	if (BufFileRead(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size) == 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg(
+						"could not read byte in outer match status bitmap: %m.")));
+	BufFileSeek(statusFile, 0, -unit_size, SEEK_CUR);
+
+	hjstate->hj_CurOuterMatchStatus |= 1U << tupindex % UINT_BITS;
+	BufFileWrite(statusFile, &hjstate->hj_CurOuterMatchStatus, unit_size);
+}
+
+/* return true if bit is set and false if not */
+static bool
+checkbit(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			curbatch = hashtable->curbatch;
+	BufFile    *outer_match_statuses;
+
+	int			bitno = hjstate->hj_EmitOuterTupleId % UINT_BITS;
+
+	hjstate->hj_EmitOuterTupleId++;
+	outer_match_statuses = hjstate->hj_HashTable->hashloopBatchFile[curbatch];
+
+	/*
+	 * if current chunk of bitmap is exhausted, read next chunk of bitmap from
+	 * outer_match_status_file
+	 */
+	if (bitno == 0 && BufFileRead(outer_match_statuses, &hjstate->hj_CurOuterMatchStatus,
+					sizeof(hjstate->hj_CurOuterMatchStatus)) == 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg(
+						"could not read byte in outer match status bitmap: %m.")));
+
+	/*
+	 * check if current tuple's match bit is set in outer match status file
+	 */
+	return hjstate->hj_CurOuterMatchStatus & (1U << bitno);
+}
+
+static bool
+IsHashloopFallback(HashJoinTable hashtable)
+{
+	if (hashtable->parallel_state)
+		return hashtable->batches[hashtable->curbatch].shared->hashloop_fallback;
+
+	if (!hashtable->hashloopBatchFile)
+		return false;
 
+	return hashtable->hashloopBatchFile[hashtable->curbatch];
+}
 
 /* ----------------------------------------------------------------
  *		ExecHashJoinImpl
@@ -343,6 +453,12 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				hashNode->hashtable = hashtable;
 				(void) MultiExecProcNode((PlanState *) hashNode);
 
+				/*
+				 * After building the hashtable, stripe 0 of batch 0 will have
+				 * been loaded.
+				 */
+				hashtable->curstripe = 0;
+
 				/*
 				 * If the inner relation is completely empty, and we're not
 				 * doing a left outer join, we can quit without scanning the
@@ -391,7 +507,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 						 * If multi-batch, we need to hash the outer relation
 						 * up front.
 						 */
-						if (hashtable->nbatch > 1)
+						if (hashtable->nbatch > 1 || (hashtable->nbatch == 1 && hashtable->batches[0].shared->hashloop_fallback))
 							ExecParallelHashJoinPartitionOuter(node);
 						BarrierArriveAndWait(build_barrier,
 											 WAIT_EVENT_HASH_BUILD_HASH_OUTER);
@@ -409,12 +525,10 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					/* Each backend should now select a batch to work on. */
 					Assert(BarrierPhase(build_barrier) == PHJ_BUILD_RUN);
 					hashtable->curbatch = -1;
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
-
-					continue;
+					if (!ExecParallelHashJoinNewBatch(node))
+						return NULL;
 				}
-				else
-					node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
 				/* FALL THRU */
 
@@ -447,7 +561,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 							if (ExecParallelPrepHashTableForUnmatched(node))
 								node->hj_JoinState = HJ_FILL_INNER_TUPLES;
 							else
-								node->hj_JoinState = HJ_NEED_NEW_BATCH;
+								node->hj_JoinState = HJ_NEED_NEW_STRIPE;
 						}
 						else
 						{
@@ -456,12 +570,18 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 						}
 					}
 					else
-						node->hj_JoinState = HJ_NEED_NEW_BATCH;
+						node->hj_JoinState = HJ_NEED_NEW_STRIPE;
 					continue;
 				}
 
 				econtext->ecxt_outertuple = outerTupleSlot;
-				node->hj_MatchedOuter = false;
+
+				/*
+				 * Don't reset hj_MatchedOuter after the first stripe as it
+				 * would cancel out whatever we found before
+				 */
+				if (node->hj_HashTable->curstripe == 0)
+					node->hj_MatchedOuter = false;
 
 				/*
 				 * Find the corresponding bucket for this tuple in the main
@@ -477,9 +597,15 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				/*
 				 * The tuple might not belong to the current batch (where
 				 * "current batch" includes the skew buckets if any).
+				 *
+				 * This should only be done once per tuple per batch. If a
+				 * batch "falls back", its inner side will be split into
+				 * stripes. Any displaced outer tuples should only be
+				 * relocated while probing the first stripe of the inner side.
 				 */
 				if (batchno != hashtable->curbatch &&
-					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO)
+					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO &&
+					node->hj_HashTable->curstripe == 0)
 				{
 					bool		shouldFree;
 					MinimalTuple mintuple = ExecFetchSlotMinimalTuple(outerTupleSlot,
@@ -502,6 +628,13 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					continue;
 				}
 
+				/*
+				 * While probing the phantom stripe, don't increment
+				 * hj_CurNumOuterTuples or extend the bitmap
+				 */
+				if (!parallel && hashtable->curstripe != PHANTOM_STRIPE)
+					node->hj_CurNumOuterTuples++;
+
 				/* OK, let's scan the bucket for matches */
 				node->hj_JoinState = HJ_SCAN_BUCKET;
 
@@ -562,35 +695,80 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					 */
 					if (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)))
 						HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
+					if (HJ_FILL_OUTER(node) && IsHashloopFallback(hashtable))
+					{
+						/*
+						 * Each bit corresponds to a single tuple. Setting the
+						 * match bit keeps track of which tuples were matched
+						 * for batches which are using the block nested
+						 * hashloop fallback method. It persists this match
+						 * status across multiple stripes of tuples, each of
+						 * which is loaded into the hashtable and probed. The
+						 * outer match status file is the cumulative match
+						 * status of outer tuples for a given batch across all
+						 * stripes of that inner side batch.
+						 */
+						if (parallel)
+							sb_setbit(hashtable->batches[hashtable->curbatch].sba, econtext->ecxt_outertuple->tts_tuplenum);
+						else
+							set_match_bit(node);
+					}
 
-					/* In an antijoin, we never return a matched tuple */
-					if (node->js.jointype == JOIN_ANTI)
+					if (parallel)
 					{
-						node->hj_JoinState = HJ_NEED_NEW_OUTER;
-						continue;
+						/*
+						 * Full/right outer joins are currently not supported
+						 * for parallel joins, so we don't need to set the
+						 * match bit.  Experiments show that it's worth
+						 * avoiding the shared memory traffic on large
+						 * systems.
+						 */
+						Assert(!HJ_FILL_INNER(node));
 					}
+					else
+					{
+						/*
+						 * This is really only needed if HJ_FILL_INNER(node),
+						 * but we'll avoid the branch and just set it always.
+						 */
 
-					/*
-					 * If we only need to consider the first matching inner
-					 * tuple, then advance to next outer tuple after we've
-					 * processed this one.
-					 */
-					if (node->js.single_match)
-						node->hj_JoinState = HJ_NEED_NEW_OUTER;
+						/* In an antijoin, we never return a matched tuple */
+						if (node->js.jointype == JOIN_ANTI)
+						{
+							node->hj_JoinState = HJ_NEED_NEW_OUTER;
+							continue;
+						}
 
-					/*
-					 * In a right-antijoin, we never return a matched tuple.
-					 * If it's not an inner_unique join, we need to stay on
-					 * the current outer tuple to continue scanning the inner
-					 * side for matches.
-					 */
-					if (node->js.jointype == JOIN_RIGHT_ANTI)
-						continue;
+						/*
+						* If we only need to consider the first matching inner
+						* tuple, then advance to next outer tuple after we've
+						* processed this one.
+						*/
+						if (node->js.single_match)
+						{
+							node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
-					if (otherqual == NULL || ExecQual(otherqual, econtext))
-						return ExecProject(node->js.ps.ps_ProjInfo);
-					else
-						InstrCountFiltered2(node, 1);
+							/*
+							* In a right-antijoin, we never return a matched tuple.
+							* If it's not an inner_unique join, we need to stay on
+							* the current outer tuple to continue scanning the inner
+							* side for matches.
+							*/
+							if (node->js.jointype == JOIN_RIGHT_ANTI)
+								continue;
+							/*
+							* Only consider returning the tuple while on the
+							* first stripe.
+							*/
+							if (node->hj_HashTable->curstripe != 0)
+								continue;
+						}
+
+						if (otherqual == NULL || ExecQual(otherqual, econtext))
+							return ExecProject(node->js.ps.ps_ProjInfo);
+						else
+							InstrCountFiltered2(node, 1);
+					}
 				}
 				else
 					InstrCountFiltered1(node, 1);
@@ -605,6 +783,22 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 				 */
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
+				if (IsHashloopFallback(hashtable) && HJ_FILL_OUTER(node))
+				{
+					if (hashtable->curstripe != PHANTOM_STRIPE)
+						continue;
+
+					if (parallel)
+					{
+						ParallelHashJoinBatchAccessor *accessor =
+						&node->hj_HashTable->batches[node->hj_HashTable->curbatch];
+
+						node->hj_MatchedOuter = sb_checkbit(accessor->sba, econtext->ecxt_outertuple->tts_tuplenum);
+					}
+					else
+						node->hj_MatchedOuter = checkbit(node);
+				}
+
 				if (!node->hj_MatchedOuter &&
 					HJ_FILL_OUTER(node))
 				{
@@ -633,7 +827,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					  : ExecScanHashTableForUnmatched(node, econtext)))
 				{
 					/* no more unmatched tuples */
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
+					node->hj_JoinState = HJ_NEED_NEW_STRIPE;
 					continue;
 				}
 
@@ -649,19 +843,23 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					InstrCountFiltered2(node, 1);
 				break;
 
-			case HJ_NEED_NEW_BATCH:
+			case HJ_NEED_NEW_STRIPE:
 
 				/*
-				 * Try to advance to next batch.  Done if there are no more.
+				 * Try to advance to next stripe. Then try to advance to the
+				 * next batch if there are no more stripes in this batch. Done
+				 * if there are no more batches.
 				 */
 				if (parallel)
 				{
-					if (!ExecParallelHashJoinNewBatch(node))
+					if (!ExecParallelHashJoinLoadStripe(node) &&
+						!ExecParallelHashJoinNewBatch(node))
 						return NULL;	/* end of parallel-aware join */
 				}
 				else
 				{
-					if (!ExecHashJoinNewBatch(node))
+					if (!ExecHashJoinLoadStripe(node) &&
+						!ExecHashJoinNewBatch(node))
 						return NULL;	/* end of parallel-oblivious join */
 				}
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
@@ -934,6 +1132,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate->hj_JoinState = HJ_BUILD_HASHTABLE;
 	hjstate->hj_MatchedOuter = false;
 	hjstate->hj_OuterNotEmpty = false;
+	hjstate->hj_CurNumOuterTuples = 0;
+	hjstate->hj_CurOuterMatchStatus = 0;
 
 	return hjstate;
 }
@@ -1066,10 +1266,16 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 	/*
 	 * In the Parallel Hash case we only run the outer plan directly for
 	 * single-batch hash joins.  Otherwise we have to go to batch files, even
-	 * for batch 0.
+	 * for batch 0. For a single-batch hash join which, due to data skew, has
+	 * multiple stripes and is a "fallback" batch, we must still save the
+	 * outer tuples into batch files.
 	 */
-	if (curbatch == 0 && hashtable->nbatch == 1)
+	LWLockAcquire(&hashtable->batches[0].shared->lock, LW_SHARED);
+
+	if (curbatch == 0 && hashtable->nbatch == 1 && !hashtable->batches[0].shared->hashloop_fallback)
 	{
+		LWLockRelease(&hashtable->batches[0].shared->lock);
+
 		slot = ExecProcNode(outerNode);
 
 		while (!TupIsNull(slot))
@@ -1098,21 +1304,36 @@ ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 	}
 	else if (curbatch < hashtable->nbatch)
 	{
+
+		tupleMetadata metadata;
 		MinimalTuple tuple;
 
-		tuple = sts_parallel_scan_next(hashtable->batches[curbatch].outer_tuples,
-									   hashvalue);
+		LWLockRelease(&hashtable->batches[0].shared->lock);
+
+		tuple =
+			sts_parallel_scan_next(hashtable->batches[curbatch].outer_tuples,
+								   &metadata);
+		*hashvalue = metadata.hashvalue;
+
 		if (tuple != NULL)
 		{
 			ExecForceStoreMinimalTuple(tuple,
 									   hjstate->hj_OuterTupleSlot,
 									   false);
+
+			/*
+			 * TODO: should we use tupleid instead of position in the serial
+			 * case too?
+			 */
+			hjstate->hj_OuterTupleSlot->tts_tuplenum = metadata.tupleid;
 			slot = hjstate->hj_OuterTupleSlot;
 			return slot;
 		}
 		else
 			ExecClearTuple(hjstate->hj_OuterTupleSlot);
 	}
+	else
+		LWLockRelease(&hashtable->batches[0].shared->lock);
 
 	/* End of this batch */
 	hashtable->batches[curbatch].outer_eof = true;
@@ -1132,24 +1353,37 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	int			nbatch;
 	int			curbatch;
-	BufFile    *innerFile;
-	TupleTableSlot *slot;
-	uint32		hashvalue;
+	BufFile    *innerFile = NULL;
+	BufFile    *outerFile = NULL;
 
 	nbatch = hashtable->nbatch;
 	curbatch = hashtable->curbatch;
 
-	if (curbatch > 0)
+	/*
+	 * We no longer need the previous outer batch file; close it right away to
+	 * free disk space.
+	 */
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
 	{
-		/*
-		 * We no longer need the previous outer batch file; close it right
-		 * away to free disk space.
-		 */
-		if (hashtable->outerBatchFile[curbatch])
-			BufFileClose(hashtable->outerBatchFile[curbatch]);
+		BufFileClose(hashtable->outerBatchFile[curbatch]);
 		hashtable->outerBatchFile[curbatch] = NULL;
 	}
-	else						/* we just finished the first batch */
+	if (IsHashloopFallback(hashtable))
+	{
+		BufFileClose(hashtable->hashloopBatchFile[curbatch]);
+		hashtable->hashloopBatchFile[curbatch] = NULL;
+	}
+
+	/*
+	 * We are surely done with the inner batch file now
+	 */
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[curbatch])
+	{
+		BufFileClose(hashtable->innerBatchFile[curbatch]);
+		hashtable->innerBatchFile[curbatch] = NULL;
+	}
+
+	if (curbatch == 0)			/* we just finished the first batch */
 	{
 		/*
 		 * Reset some of the skew optimization state variables, since we no
@@ -1214,55 +1448,169 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 		return false;			/* no more batches */
 
 	hashtable->curbatch = curbatch;
+	hashtable->curstripe = STRIPE_DETACHED;
+	hjstate->hj_CurNumOuterTuples = 0;
 
-	/*
-	 * Reload the hash table with the new inner batch (which could be empty)
-	 */
-	ExecHashTableReset(hashtable);
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[curbatch])
+		innerFile = hashtable->innerBatchFile[curbatch];
+
+	if (innerFile && BufFileSeek(innerFile, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
+
+	/* Need to rewind outer when this is the first stripe of a new batch */
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
+		outerFile = hashtable->outerBatchFile[curbatch];
 
-	innerFile = hashtable->innerBatchFile[curbatch];
+	if (outerFile && BufFileSeek(outerFile, 0, 0L, SEEK_SET))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not rewind hash-join temporary file: %m")));
 
-	if (innerFile != NULL)
+	ExecHashJoinLoadStripe(hjstate);
+	return true;
+}
+
+static inline void
+InstrIncrBatchStripes(List *fallback_batches_stats, int curbatch)
+{
+	ListCell   *lc;
+
+	foreach(lc, fallback_batches_stats)
 	{
-		if (BufFileSeek(innerFile, 0, 0, SEEK_SET))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not rewind hash-join temporary file")));
+		FallbackBatchStats *fallback_batch_stats;
+		fallback_batch_stats = lfirst(lc);
 
-		while ((slot = ExecHashJoinGetSavedTuple(hjstate,
-												 innerFile,
-												 &hashvalue,
-												 hjstate->hj_HashTupleSlot)))
+		if (fallback_batch_stats->batchno == curbatch)
 		{
-			/*
-			 * NOTE: some tuples may be sent to future batches.  Also, it is
-			 * possible for hashtable->nbatch to be increased here!
-			 */
-			ExecHashTableInsert(hashtable, slot, hashvalue);
+			fallback_batch_stats->numstripes++;
+			break;
 		}
-
-		/*
-		 * after we build the hash table, the inner batch file is no longer
-		 * needed
-		 */
-		BufFileClose(innerFile);
-		hashtable->innerBatchFile[curbatch] = NULL;
 	}
+}
+
+static inline void
+InstrAppendParallelBatchStripes(List **fallback_batches_stats, int curbatch, int nstripes)
+{
+	FallbackBatchStats *fallback_batch_stats;
+
+	fallback_batch_stats = palloc(sizeof(FallbackBatchStats));
+	fallback_batch_stats->batchno = curbatch;
+	/* Display the total number of stripes as a 1-indexed number */
+	fallback_batch_stats->numstripes = nstripes + 1;
+	*fallback_batches_stats = lappend(*fallback_batches_stats, fallback_batch_stats);
+}
+
+/*
+ * Returns false when the inner batch file is exhausted
+ */
+static int
+ExecHashJoinLoadStripe(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			curbatch = hashtable->curbatch;
+	TupleTableSlot *slot;
+	uint32		hashvalue;
+	bool		loaded_inner = false;
+
+	if (hashtable->curstripe == PHANTOM_STRIPE)
+		return false;
 
 	/*
 	 * Rewind outer batch file (if present), so that we can start reading it.
+	 * TODO: This is only necessary if this is not the first stripe of the
+	 * batch
 	 */
-	if (hashtable->outerBatchFile[curbatch] != NULL)
+	if (hashtable->outerBatchFile && hashtable->outerBatchFile[curbatch])
 	{
 		if (BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0, SEEK_SET))
 			ereport(ERROR,
 					(errcode_for_file_access(),
-					 errmsg("could not rewind hash-join temporary file")));
+					 errmsg("could not rewind hash-join temporary file: %m")));
+	}
+	if (hashtable->innerBatchFile && hashtable->innerBatchFile[curbatch] && hashtable->curbatch == 0 && hashtable->curstripe == 0)
+	{
+		if (BufFileSeek(hashtable->innerBatchFile[curbatch], 0, 0L, SEEK_SET))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not rewind hash-join temporary file: %m")));
 	}
 
-	return true;
+	hashtable->curstripe++;
+
+	if (!hashtable->innerBatchFile || !hashtable->innerBatchFile[curbatch])
+		return false;
+
+	/*
+	 * Reload the hash table with the new inner stripe
+	 */
+	ExecHashTableReset(hashtable);
+
+	while ((slot = ExecHashJoinGetSavedTuple(hjstate,
+											 hashtable->innerBatchFile[curbatch],
+											 &hashvalue,
+											 hjstate->hj_HashTupleSlot)))
+	{
+		/*
+		 * NOTE: some tuples may be sent to future batches.  Also, it is
+		 * possible for hashtable->nbatch to be increased here!
+		 */
+		uint32		hashTupleSize;
+
+		/*
+		 * TODO: wouldn't it be cool if this returned the size of the tuple
+		 * inserted
+		 */
+		ExecHashTableInsert(hashtable, slot, hashvalue);
+		loaded_inner = true;
+
+		if (!IsHashloopFallback(hashtable))
+			continue;
+
+		hashTupleSize = slot->tts_ops->get_minimal_tuple(slot)->t_len + HJTUPLE_OVERHEAD;
+
+		if (hashtable->spaceUsed + hashTupleSize +
+			hashtable->nbuckets_optimal * sizeof(HashJoinTuple)
+			> hashtable->spaceAllowed)
+			break;
+	}
+
+	/*
+	 * if we didn't load anything and it is a FOJ/LOJ fallback batch, we will
+	 * transition to emit unmatched outer tuples next. we want to know how
+	 * many tuples were in the batch in that case, so don't zero it out then
+	 */
+
+	/*
+	 * if we loaded anything into the hashtable or it is the phantom stripe,
+	 * must proceed to probing
+	 */
+	if (loaded_inner)
+	{
+		hjstate->hj_CurNumOuterTuples = 0;
+		InstrIncrBatchStripes(hashtable->fallback_batches_stats, curbatch);
+		return true;
+	}
+
+	if (IsHashloopFallback(hashtable) && HJ_FILL_OUTER(hjstate))
+	{
+		/*
+		 * if we didn't load anything and it is a fallback batch, we will
+		 * prepare to emit outer tuples during the phantom stripe probing
+		 */
+		hashtable->curstripe = PHANTOM_STRIPE;
+		hjstate->hj_EmitOuterTupleId = 0;
+		hjstate->hj_CurOuterMatchStatus = 0;
+		BufFileSeek(hashtable->hashloopBatchFile[curbatch], 0, 0, SEEK_SET);
+		if (hashtable->outerBatchFile[curbatch])
+			BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET);
+		return true;
+	}
+	return false;
 }
 
+
 /*
  * Choose a batch to work on, and attach to it.  Returns true if successful,
  * false if there are no more batches.
@@ -1277,11 +1625,24 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 	/*
 	 * If we were already attached to a batch, remember not to bother checking
 	 * it again, and detach from it (possibly freeing the hash table if we are
-	 * last to detach).
+	 * last to detach). curbatch is set when the batch_barrier phase is either
+	 * PHJ_BATCH_LOADING or PHJ_BATCH_STRIPING (note that the
+	 * PHJ_BATCH_LOADING case will fall through to the PHJ_BATCH_STRIPING
+	 * case). The PHJ_BATCH_STRIPING case returns to the caller. So when this
+	 * function is reentered with a curbatch >= 0 then we must be done
+	 * probing.
 	 */
+
 	if (hashtable->curbatch >= 0)
 	{
-		hashtable->batches[hashtable->curbatch].done = true;
+		ParallelHashJoinBatchAccessor *batch_accessor = &hashtable->batches[hashtable->curbatch];
+
+		if (IsHashloopFallback(hashtable))
+		{
+			InstrAppendParallelBatchStripes(&hashtable->fallback_batches_stats, hashtable->curbatch, batch_accessor->shared->nstripes);
+			sb_end_write(hashtable->batches[hashtable->curbatch].sba);
+		}
+		batch_accessor->done = PHJ_BATCH_ACCESSOR_DONE;
 		ExecHashTableDetachBatch(hashtable);
 	}
 
@@ -1295,13 +1656,8 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 		hashtable->nbatch;
 	do
 	{
-		uint32		hashvalue;
-		MinimalTuple tuple;
-		TupleTableSlot *slot;
-
-		if (!hashtable->batches[batchno].done)
+		if (hashtable->batches[batchno].done != PHJ_BATCH_ACCESSOR_DONE)
 		{
-			SharedTuplestoreAccessor *inner_tuples;
 			Barrier    *batch_barrier =
 				&hashtable->batches[batchno].shared->batch_barrier;
 
@@ -1312,51 +1668,48 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 					/* One backend allocates the hash table. */
 					if (BarrierArriveAndWait(batch_barrier,
 											 WAIT_EVENT_HASH_BATCH_ELECT))
+					{
 						ExecParallelHashTableAlloc(hashtable, batchno);
+
+						/*
+						 * one worker needs to 0 out the read_pages of all the
+						 * participants in the new batch
+						 */
+						sts_reinitialize(hashtable->batches[batchno].inner_tuples);
+					}
 					/* Fall through. */
 
-				case PHJ_BATCH_ALLOCATE:
+				case PHJ_BUILD_ALLOCATE:
 					/* Wait for allocation to complete. */
 					BarrierArriveAndWait(batch_barrier,
 										 WAIT_EVENT_HASH_BATCH_ALLOCATE);
 					/* Fall through. */
 
-				case PHJ_BATCH_LOAD:
-					/* Start (or join in) loading tuples. */
-					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					inner_tuples = hashtable->batches[batchno].inner_tuples;
-					sts_begin_parallel_scan(inner_tuples);
-					while ((tuple = sts_parallel_scan_next(inner_tuples,
-														   &hashvalue)))
-					{
-						ExecForceStoreMinimalTuple(tuple,
-												   hjstate->hj_HashTupleSlot,
-												   false);
-						slot = hjstate->hj_HashTupleSlot;
-						ExecParallelHashTableInsertCurrentBatch(hashtable, slot,
-																hashvalue);
-					}
-					sts_end_parallel_scan(inner_tuples);
-					BarrierArriveAndWait(batch_barrier,
-										 WAIT_EVENT_HASH_BATCH_LOAD);
-					/* Fall through. */
+				case PHJ_BATCH_STRIPE:
 
-				case PHJ_BATCH_PROBE:
+					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
+					sts_begin_parallel_scan(hashtable->batches[batchno].inner_tuples);
+					if (hashtable->batches[batchno].shared->hashloop_fallback)
+						sb_initialize_accessor(hashtable->batches[hashtable->curbatch].sba,
+											   sts_get_tuplenum(hashtable->batches[hashtable->curbatch].outer_tuples));
+					hashtable->curstripe = STRIPE_DETACHED;
+					if (ExecParallelHashJoinLoadStripe(hjstate))
+						return true;
 
 					/*
-					 * This batch is ready to probe.  Return control to
-					 * caller. We stay attached to batch_barrier so that the
-					 * hash table stays alive until everyone's finished
-					 * probing it, but no participant is allowed to wait at
-					 * this barrier again (or else a deadlock could occur).
-					 * All attached participants must eventually detach from
-					 * the barrier and one worker must advance the phase so
-					 * that the final phase is reached.
+					 * ExecParallelHashJoinLoadStripe() will return false from
+					 * here when no more work can be done by this worker on
+					 * this batch. Until further optimized, this worker will
+					 * have detached from the stripe_barrier and should close
+					 * its outer match statuses bitmap and then detach from
+					 * the batch. In order to reuse the code below, fall
+					 * through, even though the phase will not have been
+					 * advanced
 					 */
-					ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
-					sts_begin_parallel_scan(hashtable->batches[batchno].outer_tuples);
+					if (hashtable->batches[batchno].shared->hashloop_fallback)
+						sb_end_write(hashtable->batches[batchno].sba);
 
-					return true;
+					/* Fall through. */
 				case PHJ_BATCH_SCAN:
 
 					/*
@@ -1382,8 +1735,16 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 					 * Already done.  Detach and go around again (if any
 					 * remain).
 					 */
+
+					/*
+					 * In case the leader joins late, we have to make sure
+					 * that all workers have the final number of stripes.
+					 */
+					if (hashtable->batches[batchno].shared->hashloop_fallback)
+						InstrAppendParallelBatchStripes(&hashtable->fallback_batches_stats, batchno, hashtable->batches[batchno].shared->nstripes);
 					BarrierDetach(batch_barrier);
-					hashtable->batches[batchno].done = true;
+					hashtable->batches[batchno].done = PHJ_BATCH_ACCESSOR_DONE;
+
 					hashtable->curbatch = -1;
 					break;
 
@@ -1398,6 +1759,244 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 	return false;
 }
 
+
+
+/*
+ * Returns true if ready to probe and false if the inner is exhausted
+ * (there are no more stripes)
+ */
+bool
+ExecParallelHashJoinLoadStripe(HashJoinState *hjstate)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			batchno = hashtable->curbatch;
+	ParallelHashJoinBatchAccessor *batch_accessor = &(hashtable->batches[batchno]);
+	ParallelHashJoinBatch *batch = batch_accessor->shared;
+	Barrier    *stripe_barrier = &batch->stripe_barrier;
+	SharedTuplestoreAccessor *outer_tuples;
+	SharedTuplestoreAccessor *inner_tuples;
+
+	outer_tuples = hashtable->batches[batchno].outer_tuples;
+	inner_tuples = hashtable->batches[batchno].inner_tuples;
+
+	if (hashtable->curstripe >= 0)
+	{
+		/*
+		 * If a worker is already attached to a stripe, wait until all
+		 * participants have finished probing and detach. The last worker,
+		 * however, can re-attach to the stripe_barrier and proceed to load
+		 * and probe the other stripes
+		 *
+		 * After finishing with participating in a stripe, if a worker is the
+		 * only one working on a batch, it will continue working on it.
+		 * However, if a worker is not the only worker working on a batch, it
+		 * would risk deadlock if it waits on the barrier. Instead, it will
+		 * detach from the stripe, and, eventually the batch.
+		 *
+		 * This means all stripes after the first stripe will be executed
+		 * serially. TODO: allow workers to provisionally detach from the
+		 * batch and reattach later if there is still work to be done. I had a
+		 * patch that did this. Workers who were not the last worker saved the
+		 * state of the stripe barrier upon detaching and then mark the batch
+		 * as "provisionally" done (not done). Later, when the worker comes
+		 * back to the batch in the batch phase machine, if the batch is not
+		 * complete and the phase has advanced since the worker was last
+		 * participating, then the worker can join back in. This had problems.
+		 * There were synchronization issues with workers having multiple
+		 * outer match status bitmap files open at the same time, so, I had
+		 * workers close their bitmap and make a new one the next time they
+		 * joined in. This didn't work with the current code because the
+		 * original outer match status bitmap file that the worker had created
+		 * while probing stripe 1 did not get combined into the combined
+		 * bitmap This could be specifically fixed, but I think it is better
+		 * to address the lack of parallel execution for stripes after stripe
+		 * 0 more holistically.
+		 */
+		if (!BarrierArriveAndDetach(stripe_barrier))
+		{
+			sb_end_write(batch_accessor->sba);
+			hashtable->curstripe = STRIPE_DETACHED;
+			return false;
+		}
+
+		/*
+		 * This isn't a race condition if no other workers can stay attached
+		 * to this barrier in the intervening time. Basically, if you attach
+		 * to a stripe barrier in the PHJ_STRIPE_DONE phase, detach
+		 * immediately and move on.
+		 */
+		BarrierAttach(stripe_barrier);
+	}
+	else if (hashtable->curstripe == STRIPE_DETACHED)
+	{
+		int			phase = BarrierAttach(stripe_barrier);
+
+		/*
+		 * If a worker enters this phase machine for the first time for this
+		 * batch on a stripe number greater than the batch's maximum stripe
+		 * number, then: 1) The batch is done, or 2) The batch is on the
+		 * phantom stripe that's used for hashloop fallback. Either way the
+		 * worker can't contribute, so it will just detach and move on.
+		 */
+		if (PHJ_STRIPE_NUMBER(phase) > batch->nstripes ||
+			PHJ_STRIPE_PHASE(phase) == PHJ_STRIPE_DONE)
+			return ExecHashTableDetachStripe(hashtable);
+	}
+	else if (hashtable->curstripe == PHANTOM_STRIPE)
+	{
+		/* Only the last worker will execute this code. */
+		sts_end_parallel_scan(outer_tuples);
+
+		/*
+		 * TODO: ideally this would go somewhere in the batch phase machine
+		 * Putting it in ExecHashTableDetachBatch didn't do the trick
+		 */
+		sb_end_read(batch_accessor->sba);
+		return ExecHashTableDetachStripe(hashtable);
+	}
+
+	hashtable->curstripe = PHJ_STRIPE_NUMBER(BarrierPhase(stripe_barrier));
+
+	/*
+	 * The outer side is exhausted and either 1) the current stripe of the
+	 * inner side is exhausted and it is time to advance the stripe 2) the
+	 * last stripe of the inner side is exhausted and it is time to advance
+	 * the batch
+	 */
+	for (;;)
+	{
+		MinimalTuple tuple;
+		tupleMetadata metadata;
+
+		bool		overflow_required = false;
+		int			phase = BarrierPhase(stripe_barrier);
+
+		switch (PHJ_STRIPE_PHASE(phase))
+		{
+			case PHJ_STRIPE_ELECTING:
+				if (BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_ELECT))
+					sts_reinitialize(outer_tuples);
+				/* FALLTHROUGH */
+			case PHJ_STRIPE_RESETTING:
+
+				/*
+				 * This barrier allows the elected worker to finish resetting
+				 * the read_page for the outer side as well as allowing the
+				 * worker which was elected to clear out the hashtable from
+				 * the last stripe to finish.
+				 */
+				BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_RESET);
+				/* FALLTHROUGH */
+			case PHJ_STRIPE_LOADING:
+
+				/*
+				 * Start (or join in) loading the next stripe of inner tuples.
+				 */
+				sts_begin_parallel_scan(inner_tuples);
+
+				/*
+				 * TODO: add functionality to pre-alloc some memory before
+				 * calling sts_parallel_scan_next() because that will reserve
+				 * an additional STS_CHUNK for every stripe for each worker
+				 * that won't fit, so we should first see if the chunk would
+				 * fit before getting the assignment
+				 */
+				while ((tuple = sts_parallel_scan_next(inner_tuples, &metadata)))
+				{
+					ExecForceStoreMinimalTuple(tuple, hjstate->hj_HashTupleSlot, false);
+					if (!ExecParallelHashTableInsertCurrentBatch(hashtable, hjstate->hj_HashTupleSlot, metadata.hashvalue, sta_get_read_participant(inner_tuples)))
+					{
+						overflow_required = true;
+						pg_atomic_test_set_flag(&batch->overflow_required);
+						break;
+					}
+				}
+
+				if (BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOAD))
+				{
+					if (!pg_atomic_unlocked_test_flag(&batch->overflow_required))
+						batch->nstripes++;
+				}
+				/* FALLTHROUGH */
+			case PHJ_STRIPE_OVERFLOWING:
+				if (overflow_required)
+				{
+					Assert(tuple);
+					sts_spill_leftover_tuples(inner_tuples, tuple, metadata.hashvalue);
+				}
+				BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_OVERFLOW);
+
+				/* FALLTHROUGH */
+			case PHJ_STRIPE_PROBING:
+				{
+					/*
+					 * do this again here in case a worker began the scan and
+					 * then entered after loading before probing
+					 */
+					sts_end_parallel_scan(inner_tuples);
+					sts_begin_parallel_scan(outer_tuples);
+					return true;
+				}
+
+			case PHJ_STRIPE_DONE:
+				if (PHJ_STRIPE_NUMBER(phase) >= batch->nstripes)
+				{
+					/*
+					 * Handle the phantom stripe case.
+					 */
+					if (batch->hashloop_fallback && HJ_FILL_OUTER(hjstate))
+						goto fallback_stripe;
+
+					/* Return if this is the last stripe */
+					return ExecHashTableDetachStripe(hashtable);
+				}
+
+				/* this, effectively, increments the stripe number */
+				if (BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOAD))
+				{
+					ExecParallelHashTableRecycle(hashtable);
+					pg_atomic_clear_flag(&batch->overflow_required);
+				}
+
+				hashtable->curstripe++;
+				continue;
+
+			default:
+				elog(ERROR, "unexpected stripe phase %d. pid %i. batch %i.", BarrierPhase(stripe_barrier), MyProcPid, batchno);
+		}
+	}
+
+fallback_stripe:
+	sb_end_write(batch_accessor->sba);
+
+	/* Ensure that only a single worker is attached to the barrier */
+	if (!BarrierArriveAndWait(stripe_barrier, WAIT_EVENT_HASH_STRIPE_LOAD))
+		return ExecHashTableDetachStripe(hashtable);
+
+	/* No one except the last worker will run this code */
+	hashtable->curstripe = PHANTOM_STRIPE;
+
+	ExecParallelHashTableRecycle(hashtable);
+	pg_atomic_clear_flag(&batch->overflow_required);
+
+	/*
+	 * If all workers (including this one) have finished probing the batch,
+	 * one worker is elected to Loop through the outer match status files from
+	 * all workers that were attached to this batch Combine them into one
+	 * bitmap Use the bitmap, loop through the outer batch file again, and
+	 * emit unmatched tuples All workers will detach from the batch barrier
+	 * and the last worker will clean up the hashtable. All workers except the
+	 * last worker will end their scans of the outer and inner side. The last
+	 * worker will end its scan of the inner side
+	 */
+	sb_combine(batch_accessor->sba);
+	sts_reinitialize(outer_tuples);
+
+	sts_begin_parallel_scan(outer_tuples);
+
+	return true;
+}
+
 /*
  * ExecHashJoinSaveTuple
  *		save a tuple to a batch file.
@@ -1570,6 +2169,9 @@ ExecReScanHashJoin(HashJoinState *node)
 	node->hj_MatchedOuter = false;
 	node->hj_FirstOuterTupleSlot = NULL;
 
+	node->hj_CurNumOuterTuples = 0;
+	node->hj_CurOuterMatchStatus = 0;
+
 	/*
 	 * if chgParam of subnode is not null then plan will be re-scanned by
 	 * first ExecProcNode.
@@ -1600,7 +2202,6 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 	ExprContext *econtext = hjstate->js.ps.ps_ExprContext;
 	HashJoinTable hashtable = hjstate->hj_HashTable;
 	TupleTableSlot *slot;
-	uint32		hashvalue;
 	int			i;
 
 	Assert(hjstate->hj_FirstOuterTupleSlot == NULL);
@@ -1609,6 +2210,7 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 	for (;;)
 	{
 		bool		isnull;
+		tupleMetadata metadata;
 
 		slot = ExecProcNode(outerState);
 		if (TupIsNull(slot))
@@ -1616,8 +2218,7 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 		econtext->ecxt_outertuple = slot;
 
 		ResetExprContext(econtext);
-
-		hashvalue = DatumGetUInt32(ExecEvalExprSwitchContext(hjstate->hj_OuterHash,
+		metadata.hashvalue = DatumGetUInt32(ExecEvalExprSwitchContext(hjstate->hj_OuterHash,
 															 econtext,
 															 &isnull));
 
@@ -1626,12 +2227,20 @@ ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate)
 			int			batchno;
 			int			bucketno;
 			bool		shouldFree;
+			SharedTuplestoreAccessor *accessor;
+
 			MinimalTuple mintup = ExecFetchSlotMinimalTuple(slot, &shouldFree);
 
-			ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno,
+			ExecHashGetBucketAndBatch(hashtable, metadata.hashvalue, &bucketno,
 									  &batchno);
+			accessor = hashtable->batches[batchno].outer_tuples;
+
+			/* cannot count on deterministic order of tupleids */
+			metadata.tupleid = sts_increment_ntuples(accessor);
+
 			sts_puttuple(hashtable->batches[batchno].outer_tuples,
-						 &hashvalue, mintup);
+						 &metadata.hashvalue,
+						 mintup);
 
 			if (shouldFree)
 				heap_free_minimal_tuple(mintup);
@@ -1692,6 +2301,8 @@ ExecHashJoinInitializeDSM(HashJoinState *state, ParallelContext *pcxt)
 	LWLockInitialize(&pstate->lock,
 					 LWTRANCHE_PARALLEL_HASH_JOIN);
 	BarrierInit(&pstate->build_barrier, 0);
+	BarrierInit(&pstate->eviction_barrier, 0);
+	BarrierInit(&pstate->repartition_barrier, 0);
 	BarrierInit(&pstate->grow_batches_barrier, 0);
 	BarrierInit(&pstate->grow_buckets_barrier, 0);
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 16144c2b72d..63acf623b7d 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -117,11 +117,20 @@ CHECKPOINT_START	"Waiting for a checkpoint to start."
 EXECUTE_GATHER	"Waiting for activity from a child process while executing a <literal>Gather</literal> plan node."
 HASH_BATCH_ALLOCATE	"Waiting for an elected Parallel Hash participant to allocate a hash table."
 HASH_BATCH_ELECT	"Waiting to elect a Parallel Hash participant to allocate a hash table."
-HASH_BATCH_LOAD	"Waiting for other Parallel Hash participants to finish loading a hash table."
+HASH_STRIPE_ELECT	"Waiting to elect a stripe of arbitrary tuples of a Parallel Hash participant."
+HASH_STRIPE_RESET	"Waiting to reset a stripe of arbitrary tuples of a Parallel Hash participant."
+HASH_STRIPE_LOAD	"Waiting for other Parallel Hash participants to finish loading a stripe of arbitrary tuples into a hash table."
+HASH_STRIPE_OVERFLOW	"Waiting for an event when a stripe of arbitrary tuples of a Parallel Hash participant will be overflowed."
+HASH_STRIPE_PROBE	"Waiting for a probing of a stripe of a Parallel Hash participant."
 HASH_BUILD_ALLOCATE	"Waiting for an elected Parallel Hash participant to allocate the initial hash table."
 HASH_BUILD_ELECT	"Waiting to elect a Parallel Hash participant to allocate the initial hash table."
 HASH_BUILD_HASH_INNER	"Waiting for other Parallel Hash participants to finish hashing the inner relation."
 HASH_BUILD_HASH_OUTER	"Waiting for other Parallel Hash participants to finish partitioning the outer relation."
+HASH_EVICT_ELECT	"Waiting to elect Parallel Hash participant to evict a stripe of arbitrary tuples of Parallel Hash participant."
+HASH_EVICT_RESET	"Waiting to reset Parallel Hash participant to evict a stripe of arbitrary tuples of Parallel Hash participant."
+HASH_EVICT_SPILL	"Waiting to spill Parallel Hash participant to evict a stripe of arbitrary tuples of Parallel Hash participant."
+HASH_EVICT_FINISH	"Waiting to finish Parallel Hash participant to evict a stripe of arbitrary tuples of Parallel Hash participant."
+HASH_REPARTITION_BATCH0_DRAIN_QUEUE	"Waiting for an repartition of batch."
 HASH_GROW_BATCHES_DECIDE	"Waiting to elect a Parallel Hash participant to decide on future batch growth."
 HASH_GROW_BATCHES_ELECT	"Waiting to elect a Parallel Hash participant to allocate more batches."
 HASH_GROW_BATCHES_FINISH	"Waiting for an elected Parallel Hash participant to decide on future batch growth."
diff --git a/src/backend/utils/sort/Makefile b/src/backend/utils/sort/Makefile
index 5bfca3040aa..19340d66d32 100644
--- a/src/backend/utils/sort/Makefile
+++ b/src/backend/utils/sort/Makefile
@@ -17,6 +17,7 @@ override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS)
 OBJS = \
 	logtape.o \
 	qsort_interruptible.o \
+	sharedbits.o \
 	sharedtuplestore.o \
 	sortsupport.o \
 	tuplesort.o \
diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c
index 137476a7a77..12abe717adf 100644
--- a/src/backend/utils/sort/sharedtuplestore.c
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -46,19 +46,28 @@ typedef struct SharedTuplestoreChunk
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } SharedTuplestoreChunk;
 
+typedef enum SharedTuplestoreMode
+{
+	WRITING = 0,
+	READING = 1,
+	APPENDING = 2
+} SharedTuplestoreMode;
+
 /* Per-participant shared state. */
 typedef struct SharedTuplestoreParticipant
 {
 	LWLock		lock;
 	BlockNumber read_page;		/* Page number for next read. */
+	bool		rewound;
 	BlockNumber npages;			/* Number of pages written. */
-	bool		writing;		/* Used only for assertions. */
+	SharedTuplestoreMode mode;	/* Used only for assertions. */
 } SharedTuplestoreParticipant;
 
 /* The control object that lives in shared memory. */
 struct SharedTuplestore
 {
 	int			nparticipants;	/* Number of participants that can write. */
+	pg_atomic_uint32 ntuples;	/* Number of tuples in this tuplestore. */
 	int			flags;			/* Flag bits from SHARED_TUPLESTORE_XXX */
 	size_t		meta_data_size; /* Size of per-tuple header. */
 	char		name[NAMEDATALEN];	/* A name for this tuplestore. */
@@ -91,6 +100,8 @@ struct SharedTuplestoreAccessor
 	BlockNumber write_page;		/* The next page to write to. */
 	char	   *write_pointer;	/* Current write pointer within chunk. */
 	char	   *write_end;		/* One past the end of the current chunk. */
+	bool		participated;	/* Did the worker participate in writing this
+								 * STS at any point */
 };
 
 static void sts_filename(char *name, SharedTuplestoreAccessor *accessor,
@@ -136,6 +147,7 @@ sts_initialize(SharedTuplestore *sts, int participants,
 	Assert(my_participant_number < participants);
 
 	sts->nparticipants = participants;
+	pg_atomic_init_u32(&sts->ntuples, 1);
 	sts->meta_data_size = meta_data_size;
 	sts->flags = flags;
 
@@ -158,7 +170,8 @@ sts_initialize(SharedTuplestore *sts, int participants,
 						 LWTRANCHE_SHARED_TUPLESTORE);
 		sts->participants[i].read_page = 0;
 		sts->participants[i].npages = 0;
-		sts->participants[i].writing = false;
+		sts->participants[i].rewound = false;
+		sts->participants[i].mode = READING;
 	}
 
 	accessor = palloc0(sizeof(SharedTuplestoreAccessor));
@@ -188,6 +201,7 @@ sts_attach(SharedTuplestore *sts,
 	accessor->sts = sts;
 	accessor->fileset = fileset;
 	accessor->context = CurrentMemoryContext;
+	accessor->participated = false;
 
 	return accessor;
 }
@@ -219,7 +233,9 @@ sts_end_write(SharedTuplestoreAccessor *accessor)
 		pfree(accessor->write_chunk);
 		accessor->write_chunk = NULL;
 		accessor->write_file = NULL;
-		accessor->sts->participants[accessor->participant].writing = false;
+		accessor->write_pointer = NULL;
+		accessor->write_end = NULL;
+		accessor->sts->participants[accessor->participant].mode = READING;
 	}
 }
 
@@ -263,7 +279,7 @@ sts_begin_parallel_scan(SharedTuplestoreAccessor *accessor)
 	 * files have stopped growing.
 	 */
 	for (i = 0; i < accessor->sts->nparticipants; ++i)
-		Assert(!accessor->sts->participants[i].writing);
+		Assert((accessor->sts->participants[i].mode == READING) || (accessor->sts->participants[i].mode == APPENDING));
 
 	/*
 	 * We will start out reading the file that THIS backend wrote.  There may
@@ -317,9 +333,10 @@ sts_puttuple(SharedTuplestoreAccessor *accessor, void *meta_data,
 			BufFileCreateFileSet(&accessor->fileset->fs, name);
 		MemoryContextSwitchTo(oldcxt);
 
+		accessor->participated = true;
 		/* Set up the shared state for this backend's file. */
 		participant = &accessor->sts->participants[accessor->participant];
-		participant->writing = true;	/* for assertions only */
+		participant->mode = WRITING;	/* for assertions only */
 	}
 
 	/* Do we have space? */
@@ -488,6 +505,17 @@ sts_read_tuple(SharedTuplestoreAccessor *accessor, void *meta_data)
 	return tuple;
 }
 
+MinimalTuple
+sts_parallel_scan_chunk(SharedTuplestoreAccessor *accessor,
+						void *meta_data,
+						bool inner)
+{
+	Assert(accessor->read_file);
+	if (accessor->read_ntuples < accessor->read_ntuples_available)
+		return sts_read_tuple(accessor, meta_data);
+	return NULL;
+}
+
 /*
  * Get the next tuple in the current parallel scan.
  */
@@ -501,7 +529,13 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	for (;;)
 	{
 		/* Can we read more tuples from the current chunk? */
-		if (accessor->read_ntuples < accessor->read_ntuples_available)
+		/*
+		 * Added a check for accessor->read_file being present here, as it
+		 * became relevant for adaptive hashjoin. TODO: Not sure if this has
+		 * other consequences for correctness
+		 */
+
+		if (accessor->read_ntuples < accessor->read_ntuples_available && accessor->read_file)
 			return sts_read_tuple(accessor, meta_data);
 
 		/* Find the location of a new chunk to read. */
@@ -591,6 +625,56 @@ sts_parallel_scan_next(SharedTuplestoreAccessor *accessor, void *meta_data)
 	return NULL;
 }
 
+uint32
+sts_increment_ntuples(SharedTuplestoreAccessor *accessor)
+{
+	return pg_atomic_fetch_add_u32(&accessor->sts->ntuples, 1);
+}
+
+uint32
+sts_get_tuplenum(SharedTuplestoreAccessor *accessor)
+{
+	return pg_atomic_read_u32(&accessor->sts->ntuples);
+}
+
+int
+sta_get_read_participant(SharedTuplestoreAccessor *accessor)
+{
+	return accessor->read_participant;
+}
+
+void
+sts_spill_leftover_tuples(SharedTuplestoreAccessor *accessor, MinimalTuple tuple, uint32 hashvalue)
+{
+	tupleMetadata metadata;
+	SharedTuplestoreParticipant *participant;
+	char		name[MAXPGPATH];
+
+	metadata.hashvalue = hashvalue;
+	participant = &accessor->sts->participants[accessor->participant];
+	participant->mode = APPENDING;	/* for assertions only */
+
+	sts_filename(name, accessor, accessor->participant);
+	if (!accessor->participated)
+	{
+		accessor->write_file = BufFileCreateFileSet(&accessor->fileset->fs, name);
+		accessor->participated = true;
+	}
+
+	else
+		accessor->write_file = BufFileOpenFileSet(&accessor->fileset->fs, name, O_WRONLY, false);
+
+	BufFileSeek(accessor->write_file, 0, -1, SEEK_END);
+	do
+	{
+		sts_puttuple(accessor, &metadata, tuple);
+	} while ((tuple = sts_parallel_scan_chunk(accessor, &metadata, true)));
+
+	accessor->read_ntuples = 0;
+	accessor->read_ntuples_available = 0;
+	sts_end_write(accessor);
+}
+
 /*
  * Create the name used for the BufFile that a given participant will write.
  */
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index aa5872bc154..90b12ac6828 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -56,6 +56,7 @@ typedef struct ExplainState
 	bool		settings;		/* print modified settings */
 	bool		generic;		/* generate a generic plan */
 	ExplainSerializeOption serialize;	/* serialize the query's output? */
+	bool		usage;			/* print memory usage */
 	ExplainFormat format;		/* output format */
 	/* state for output formatting --- not reset for each new plan tree */
 	int			indent;			/* current indentation level */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 2d8ed8688cd..d134176e03e 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -19,6 +19,7 @@
 #include "storage/barrier.h"
 #include "storage/buffile.h"
 #include "storage/lwlock.h"
+#include "utils/sharedbits.h"
 
 /* ----------------------------------------------------------------
  *				hash-join hash table structures
@@ -153,6 +154,17 @@ typedef struct HashMemoryChunkData *HashMemoryChunk;
 /* tuples exceeding HASH_CHUNK_THRESHOLD bytes are put in their own chunk */
 #define HASH_CHUNK_THRESHOLD	(HASH_CHUNK_SIZE / 4)
 
+/*
+ * HashJoinTableData->curstripe the current stripe number
+ * The phantom stripe refers to the state of the inner side hashtable (empty)
+ * during the final scan of the outer batch file for a batch being processed
+ * using the hashloop fallback algorithm.
+ * In parallel-aware hash join, curstripe is in a detached state
+ * when the worker is not attached to the stripe_barrier.
+ */
+#define PHANTOM_STRIPE -2
+#define STRIPE_DETACHED -1
+
 /*
  * For each batch of a Parallel Hash Join, we have a ParallelHashJoinBatch
  * object in shared memory to coordinate access to it.  Since they are
@@ -163,15 +175,35 @@ typedef struct ParallelHashJoinBatch
 {
 	dsa_pointer buckets;		/* array of hash table buckets */
 	Barrier		batch_barrier;	/* synchronization for joining this batch */
+	Barrier		stripe_barrier; /* synchronization for stripes */
 
 	dsa_pointer chunks;			/* chunks of tuples loaded */
 	size_t		size;			/* size of buckets + chunks in memory */
 	size_t		estimated_size; /* size of buckets + chunks while writing */
-	size_t		ntuples;		/* number of tuples loaded */
+	 /* total number of tuples loaded into batch (in memory and spill files) */
+	size_t		ntuples;
 	size_t		old_ntuples;	/* number of tuples before repartitioning */
 	bool		space_exhausted;
 	bool		skip_unmatched; /* whether to abandon unmatched scan */
 
+	/* Adaptive HashJoin */
+
+	/*
+	 * after finishing build phase, hashloop_fallback cannot change, and does
+	 * not require a lock to read
+	 */
+	pg_atomic_flag overflow_required;
+	bool		hashloop_fallback;
+	int			nstripes;		/* the number of stripes in the batch */
+	/* number of tuples loaded into the hashtable */
+	pg_atomic_uint64 ntuples_in_memory;
+
+	/*
+	 * Note that ntuples will reflect the total number of tuples in the batch
+	 * while ntuples_in_memory will reflect how many tuples are in memory
+	 */
+	LWLock		lock;
+
 	/*
 	 * Variable-sized SharedTuplestore objects follow this struct in memory.
 	 * See the accessor macros below.
@@ -189,10 +221,17 @@ typedef struct ParallelHashJoinBatch
 	 ((char *) ParallelHashJoinBatchInner(batch) +						\
 	  MAXALIGN(sts_estimate(nparticipants))))
 
+/* Accessor for sharedbits following a ParallelHashJoinBatch. */
+#define ParallelHashJoinBatchOuterBits(batch, nparticipants) \
+	((SharedBits *)												\
+	 ((char *) ParallelHashJoinBatchOuter(batch, nparticipants) +						\
+	  MAXALIGN(sts_estimate(nparticipants))))
+
 /* Total size of a ParallelHashJoinBatch and tuplestores. */
 #define EstimateParallelHashJoinBatch(hashtable)						\
 	(MAXALIGN(sizeof(ParallelHashJoinBatch)) +							\
-	 MAXALIGN(sts_estimate((hashtable)->parallel_state->nparticipants)) * 2)
+	 MAXALIGN(sts_estimate((hashtable)->parallel_state->nparticipants)) * 2 + \
+	 MAXALIGN(sb_estimate((hashtable)->parallel_state->nparticipants)))
 
 /* Accessor for the nth ParallelHashJoinBatch given the base. */
 #define NthParallelHashJoinBatch(base, n)								\
@@ -217,8 +256,18 @@ typedef struct ParallelHashJoinBatchAccessor
 	bool		at_least_one_chunk; /* has this backend allocated a chunk? */
 	bool		outer_eof;		/* has this process hit end of batch? */
 	bool		done;			/* flag to remember that a batch is done */
+	/* -1 for not done, 0 for tentatively done, 1 for done */
 	SharedTuplestoreAccessor *inner_tuples;
 	SharedTuplestoreAccessor *outer_tuples;
+	SharedBitsAccessor *sba;
+
+	/*
+	 * All participants except the last worker working on a batch which has
+	 * fallen back to hashloop processing save the stripe barrier phase and
+	 * detach to avoid the deadlock hazard of waiting on a barrier after
+	 * tuples have been emitted.
+	 */
+	int			last_participating_stripe_phase;
 } ParallelHashJoinBatchAccessor;
 
 /*
@@ -237,8 +286,27 @@ typedef enum ParallelHashGrowth
 	PHJ_GROWTH_NEED_MORE_BATCHES,
 	/* Repartitioning didn't help last time, so don't try to do that again. */
 	PHJ_GROWTH_DISABLED,
+
+	/*
+	 * While repartitioning or, if nbatches would overflow int, disable growth
+	 * in the number of batches
+	 */
+	PHJ_GROWTH_SPILL_BATCH0,
+	PHJ_GROWTH_LOADING
 } ParallelHashGrowth;
 
+typedef enum ParallelHashJoinBatchAccessorStatus
+{
+	/* No more useful work can be done on this batch by this worker */
+	PHJ_BATCH_ACCESSOR_DONE,
+
+	/*
+	 * The worker has not yet checked this batch to see if it can do useful
+	 * work
+	 */
+	PHJ_BATCH_ACCESSOR_NOT_DONE
+}			ParallelHashJoinBatchAccessorStatus;
+
 /*
  * The shared state used to coordinate a Parallel Hash Join.  This is stored
  * in the DSM segment.
@@ -258,6 +326,8 @@ typedef struct ParallelHashJoinState
 	LWLock		lock;			/* lock protecting the above */
 
 	Barrier		build_barrier;	/* synchronization for the build phases */
+	Barrier		eviction_barrier;
+	Barrier		repartition_barrier;
 	Barrier		grow_batches_barrier;
 	Barrier		grow_buckets_barrier;
 	pg_atomic_uint32 distributor;	/* counter for load balancing */
@@ -275,11 +345,44 @@ typedef struct ParallelHashJoinState
 
 /* The phases for probing each batch, used by for batch_barrier. */
 #define PHJ_BATCH_ELECT					0
-#define PHJ_BATCH_ALLOCATE				1
-#define PHJ_BATCH_LOAD					2
 #define PHJ_BATCH_PROBE					3
 #define PHJ_BATCH_SCAN					4
-#define PHJ_BATCH_FREE					5
+#define PHJ_BATCH_STRIPE				2
+#define PHJ_BATCH_FREE					3
+
+/* The phases for probing each stripe of each batch used with stripe barriers */
+#define PHJ_STRIPE_INVALID_PHASE        -1
+#define PHJ_STRIPE_ELECTING				0
+#define PHJ_STRIPE_RESETTING			1
+#define PHJ_STRIPE_LOADING				2
+#define PHJ_STRIPE_OVERFLOWING          3
+#define PHJ_STRIPE_PROBING				4
+#define PHJ_STRIPE_DONE				    5
+#define PHJ_STRIPE_NUMBER(n)            ((n) / 6)
+#define PHJ_STRIPE_PHASE(n)             ((n) % 6)
+
+#define PHJ_EVICT_ELECTING 0
+#define PHJ_EVICT_RESETTING 1
+#define PHJ_EVICT_SPILLING 2
+#define PHJ_EVICT_FINISHING 3
+#define PHJ_EVICT_DONE 4
+#define PHJ_EVICT_PHASE(n)          ((n) % 5)
+
+/*
+ * These phases are now required for repartitioning batch 0 since it can
+ * spill. First all tuples which were resident in the hashtable need to
+ * be relocated either back to the hashtable or to a spill file, if they
+ * would relocate to a batch 1+ given the new number of batches. After
+ * draining the chunk_work_queue, we must drain the batch 0 spill file,
+ * if it exists. Some tuples may have been relocated from the hashtable
+ * to other batches, in which case, space may have been freed up which
+ * the tuples from the batch 0 spill file can occupy. The tuples from the
+ * batch 0 spill file may go to 1) the hashtable, 2) back to the batch 0
+ * spill file in the new generation of batches, 3) to a batch file 1+
+ */
+#define PHJ_REPARTITION_BATCH0_DRAIN_QUEUE 0
+#define PHJ_REPARTITION_BATCH0_DRAIN_SPILL_FILE 1
+#define PHJ_REPARTITION_BATCH0_PHASE(n)  ((n) % 2)
 
 /* The phases of batch growth while hashing, for grow_batches_barrier. */
 #define PHJ_GROW_BATCHES_ELECT			0
@@ -325,8 +428,6 @@ typedef struct HashJoinTableData
 	int			nbatch_original;	/* nbatch when we started inner scan */
 	int			nbatch_outstart;	/* nbatch when we started outer scan */
 
-	bool		growEnabled;	/* flag to shut off nbatch increases */
-
 	double		totalTuples;	/* # tuples obtained from inner plan */
 	double		partialTuples;	/* # tuples obtained from inner plan by me */
 	double		skewTuples;		/* # tuples inserted into skew tuples */
@@ -341,6 +442,18 @@ typedef struct HashJoinTableData
 	BufFile   **innerBatchFile; /* buffered virtual temp file per batch */
 	BufFile   **outerBatchFile; /* buffered virtual temp file per batch */
 
+	/*
+	 * Adaptive hashjoin variables
+	 */
+	BufFile   **hashloopBatchFile;	/* outer match status files if fall back */
+	List	   *fallback_batches_stats; /* per hashjoin batch statistics */
+
+	/*
+	 * current stripe #; 0 during 1st pass, -1 (macro STRIPE_DETACHED) when
+	 * detached, -2 on phantom stripe (macro PHANTOM_STRIPE)
+	 */
+	int			curstripe;
+
 	Size		spaceUsed;		/* memory space currently used by tuples */
 	Size		spaceAllowed;	/* upper limit for space used */
 	Size		spacePeak;		/* peak space used */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index bfd7b6d8445..51db4e957a2 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -14,6 +14,7 @@
 #define INSTRUMENT_H
 
 #include "portability/instr_time.h"
+#include "nodes/pg_list.h"
 
 
 /*
@@ -55,6 +56,12 @@ typedef struct WalUsage
 	uint64		wal_bytes;		/* size of WAL records produced */
 } WalUsage;
 
+typedef struct FallbackBatchStats
+{
+	int			batchno;
+	int			numstripes;
+} FallbackBatchStats;
+
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
 typedef enum InstrumentOption
 {
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index e4eb7bc6359..c67ea058592 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -30,6 +30,7 @@ extern void ExecParallelHashTableAlloc(HashJoinTable hashtable,
 extern void ExecHashTableDestroy(HashJoinTable hashtable);
 extern void ExecHashTableDetach(HashJoinTable hashtable);
 extern void ExecHashTableDetachBatch(HashJoinTable hashtable);
+extern bool ExecHashTableDetachStripe(HashJoinTable hashtable);
 extern void ExecParallelHashTableSetCurrentBatch(HashJoinTable hashtable,
 												 int batchno);
 
@@ -39,9 +40,11 @@ extern void ExecHashTableInsert(HashJoinTable hashtable,
 extern void ExecParallelHashTableInsert(HashJoinTable hashtable,
 										TupleTableSlot *slot,
 										uint32 hashvalue);
-extern void ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable,
+extern MinimalTuple
+			ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable,
 													TupleTableSlot *slot,
-													uint32 hashvalue);
+													uint32 hashvalue,
+													int read_participant);
 extern void ExecHashGetBucketAndBatch(HashJoinTable hashtable,
 									  uint32 hashvalue,
 									  int *bucketno,
@@ -55,6 +58,8 @@ extern bool ExecScanHashTableForUnmatched(HashJoinState *hjstate,
 extern bool ExecParallelScanHashTableForUnmatched(HashJoinState *hjstate,
 												  ExprContext *econtext);
 extern void ExecHashTableReset(HashJoinTable hashtable);
+extern void
+			ExecParallelHashTableRecycle(HashJoinTable hashtable);
 extern void ExecHashTableResetMatchFlags(HashJoinTable hashtable);
 extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									bool try_combined_hash_mem,
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index b82655e7e55..15cdf18d0a3 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -128,6 +128,7 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+	uint32		tts_tuplenum;	/* a tuple id for use when ctid cannot be used */
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -454,6 +455,7 @@ static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
 	slot->tts_ops->clear(slot);
+	slot->tts_tuplenum = 0;		/* TODO: should this be done elsewhere? */
 
 	return slot;
 }
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 182a6956bb0..4e2258c0c1b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2238,6 +2238,10 @@ typedef struct HashJoinState
 	int			hj_JoinState;
 	bool		hj_MatchedOuter;
 	bool		hj_OuterNotEmpty;
+	/* Adaptive Hashjoin variables */
+	int			hj_CurNumOuterTuples;	/* number of outer tuples in a batch */
+	unsigned int hj_CurOuterMatchStatus;
+	int			hj_EmitOuterTupleId;
 } HashJoinState;
 
 
@@ -2759,6 +2763,7 @@ typedef struct HashInstrumentation
 	int			nbatch;			/* number of batches at end of execution */
 	int			nbatch_original;	/* planned number of batches */
 	Size		space_peak;		/* peak memory usage in bytes */
+	List	   *fallback_batches_stats; /* per hashjoin batch stats */
 } HashInstrumentation;
 
 /* ----------------
diff --git a/src/include/utils/sharedtuplestore.h b/src/include/utils/sharedtuplestore.h
index dcf676ff5fb..08cd47a93bf 100644
--- a/src/include/utils/sharedtuplestore.h
+++ b/src/include/utils/sharedtuplestore.h
@@ -22,6 +22,17 @@ typedef struct SharedTuplestore SharedTuplestore;
 
 struct SharedTuplestoreAccessor;
 typedef struct SharedTuplestoreAccessor SharedTuplestoreAccessor;
+struct tupleMetadata;
+typedef struct tupleMetadata tupleMetadata;
+struct tupleMetadata
+{
+	uint32		hashvalue;
+	union
+	{
+		uint32		tupleid;	/* tuple number or id on the outer side */
+		int			stripe;		/* stripe number for inner side */
+	};
+};
 
 /*
  * A flag indicating that the tuplestore will only be scanned once, so backing
@@ -58,4 +69,14 @@ extern void sts_puttuple(SharedTuplestoreAccessor *accessor,
 extern MinimalTuple sts_parallel_scan_next(SharedTuplestoreAccessor *accessor,
 										   void *meta_data);
 
+extern uint32 sts_increment_ntuples(SharedTuplestoreAccessor *accessor);
+extern uint32 sts_get_tuplenum(SharedTuplestoreAccessor *accessor);
+extern int	sta_get_read_participant(SharedTuplestoreAccessor *accessor);
+extern void sts_spill_leftover_tuples(SharedTuplestoreAccessor *accessor, MinimalTuple tuple, uint32 hashvalue);
+
+extern MinimalTuple sts_parallel_scan_chunk(SharedTuplestoreAccessor *accessor,
+											void *meta_data,
+											bool inner);
+
+
 #endif							/* SHAREDTUPLESTORE_H */
diff --git a/src/test/regress/expected/join_hash.out b/src/test/regress/expected/join_hash.out
index 4fc34a0e72a..068d08aed7f 100644
--- a/src/test/regress/expected/join_hash.out
+++ b/src/test/regress/expected/join_hash.out
@@ -574,12 +574,12 @@ set hash_mem_multiplier = 1.0;
 explain (costs off)
   select count(*) from join_foo
     left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
-    on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+    on join_foo.id < ss.id  1 and join_foo.id > ss.id - 1;
                                      QUERY PLAN                                     
 ------------------------------------------------------------------------------------
  Aggregate
    ->  Nested Loop Left Join
-         Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+         Join Filter: ((join_foo.id < (b1.id  1)) AND (join_foo.id > (b1.id - 1)))
          ->  Seq Scan on join_foo
          ->  Gather
                Workers Planned: 2
@@ -592,7 +592,7 @@ explain (costs off)
 
 select count(*) from join_foo
   left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
-  on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+  on join_foo.id < ss.id  1 and join_foo.id > ss.id - 1;
  count 
 -------
      3
@@ -603,7 +603,7 @@ select final > 1 as multibatch
 $$
   select count(*) from join_foo
     left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
-    on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+    on join_foo.id < ss.id  1 and join_foo.id > ss.id - 1;
 $$);
  multibatch 
 ------------
@@ -626,12 +626,12 @@ set hash_mem_multiplier = 1.0;
 explain (costs off)
   select count(*) from join_foo
     left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
-    on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+    on join_foo.id < ss.id  1 and join_foo.id > ss.id - 1;
                                      QUERY PLAN                                     
 ------------------------------------------------------------------------------------
  Aggregate
    ->  Nested Loop Left Join
-         Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+         Join Filter: ((join_foo.id < (b1.id  1)) AND (join_foo.id > (b1.id - 1)))
          ->  Seq Scan on join_foo
          ->  Gather
                Workers Planned: 2
@@ -644,7 +644,7 @@ explain (costs off)
 
 select count(*) from join_foo
   left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
-  on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+  on join_foo.id < ss.id  1 and join_foo.id > ss.id - 1;
  count 
 -------
      3
@@ -655,7 +655,7 @@ select final > 1 as multibatch
 $$
   select count(*) from join_foo
     left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
-    on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+    on join_foo.id < ss.id  1 and join_foo.id > ss.id - 1;
 $$);
  multibatch 
 ------------
@@ -678,12 +678,12 @@ set hash_mem_multiplier = 1.0;
 explain (costs off)
   select count(*) from join_foo
     left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
-    on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+    on join_foo.id < ss.id  1 and join_foo.id > ss.id - 1;
                                      QUERY PLAN                                     
 ------------------------------------------------------------------------------------
  Aggregate
    ->  Nested Loop Left Join
-         Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+         Join Filter: ((join_foo.id < (b1.id  1)) AND (join_foo.id > (b1.id - 1)))
          ->  Seq Scan on join_foo
          ->  Gather
                Workers Planned: 2
@@ -696,7 +696,7 @@ explain (costs off)
 
 select count(*) from join_foo
   left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
-  on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+  on join_foo.id < ss.id  1 and join_foo.id > ss.id - 1;
  count 
 -------
      3
@@ -707,7 +707,7 @@ select final > 1 as multibatch
 $$
   select count(*) from join_foo
     left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
-    on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+    on join_foo.id < ss.id  1 and join_foo.id > ss.id - 1;
 $$);
  multibatch 
 ------------
@@ -730,12 +730,12 @@ set hash_mem_multiplier = 1.0;
 explain (costs off)
   select count(*) from join_foo
     left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
-    on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+    on join_foo.id < ss.id  1 and join_foo.id > ss.id - 1;
                                      QUERY PLAN                                     
 ------------------------------------------------------------------------------------
  Aggregate
    ->  Nested Loop Left Join
-         Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+         Join Filter: ((join_foo.id < (b1.id  1)) AND (join_foo.id > (b1.id - 1)))
          ->  Seq Scan on join_foo
          ->  Gather
                Workers Planned: 2
@@ -748,7 +748,7 @@ explain (costs off)
 
 select count(*) from join_foo
   left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
-  on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+  on join_foo.id < ss.id  1 and join_foo.id > ss.id - 1;
  count 
 -------
      3
@@ -759,7 +759,7 @@ select final > 1 as multibatch
 $$
   select count(*) from join_foo
     left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
-    on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+    on join_foo.id < ss.id  1 and join_foo.id > ss.id - 1;
 $$);
  multibatch 
 ------------
@@ -914,46 +914,1998 @@ rollback to settings;
 -- the hash table)
 -- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
 -- sts_puttuple oversized tuple cases because it's multi-batch)
-savepoint settings;
-set max_parallel_workers_per_gather = 2;
-set enable_parallel_hash = on;
-set work_mem = '128kB';
-set hash_mem_multiplier = 1.0;
-explain (costs off)
-  select length(max(s.t))
-  from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+-- savepoint settings;
+-- set max_parallel_workers_per_gather = 2;
+-- set enable_parallel_hash = on;
+-- TODO: throw an error when this happens: cannot set work_mem lower than the side of a single tuple
+-- TODO: ensure that oversize tuple code is still exercised (should be with some of the stub stuff below)
+-- TODO: commented this out since it would crash otherwise
+-- this test is no longer multi-batch, so, perhaps, it should be removed
+-- set work_mem = '128kB';
+-- explain (costs off)
+--   select length(max(s.t))
+--   from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+-- select length(max(s.t))
+-- from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+-- select final > 1 as multibatch
+--   from hash_join_batches(
+-- $$
+--   select length(max(s.t))
+--   from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+-- $$);
+-- rollback to settings;
+ rollback;
+ -- Verify that hash key expressions reference the correct
+ -- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+@@ -1013,3 994,1968 @@ WHERE
+ (1 row)
+ 
+ ROLLBACK;
+-- Serial Adaptive Hash Join
+BEGIN;
+CREATE TYPE stub AS (hash INTEGER, value CHAR(8090));
+CREATE FUNCTION stub_hash(item stub)
+RETURNS INTEGER AS $$
+DECLARE
+  batch_size INTEGER;
+BEGIN
+  batch_size := 4;
+  RETURN item.hash << (batch_size - 1);
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+CREATE FUNCTION stub_eq(item1 stub, item2 stub)
+RETURNS BOOLEAN AS $$
+BEGIN
+  RETURN item1.hash = item2.hash AND item1.value = item2.value;
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+CREATE OPERATOR = (
+  FUNCTION = stub_eq,
+  LEFTARG = stub,
+  RIGHTARG = stub,
+  COMMUTATOR = =,
+  HASHES, MERGES
+);
+CREATE OPERATOR CLASS stub_hash_ops
+DEFAULT FOR TYPE stub USING hash AS
+  OPERATOR 1 =(stub, stub),
+  FUNCTION 1 stub_hash(stub);
+CREATE TABLE probeside(a stub);
+ALTER TABLE probeside ALTER COLUMN a SET STORAGE PLAIN;
+-- non-fallback batch with unmatched outer tuple
+INSERT INTO probeside SELECT '(2, "")' FROM generate_series(1, 1);
+-- fallback batch unmatched outer tuple (in first stripe maybe)
+INSERT INTO probeside SELECT '(1, "unmatched outer tuple")' FROM generate_series(1, 1);
+-- fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(1, "")' FROM generate_series(1, 5);
+-- fallback batch unmatched outer tuple (in last stripe maybe)
+-- When numbatches=4, hash 5 maps to batch 1, but after numbatches doubles to
+-- 8 batches hash 5 maps to batch 5.
+INSERT INTO probeside SELECT '(5, "")' FROM generate_series(1, 1);
+-- non-fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(3, "")' FROM generate_series(1, 1);
+-- batch with 3 stripes where non-first/non-last stripe contains unmatched outer tuple
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 5);
+INSERT INTO probeside SELECT '(6, "unmatched outer tuple")' FROM generate_series(1, 1);
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 1);
+CREATE TABLE hashside_wide(a stub, id int);
+ALTER TABLE hashside_wide ALTER COLUMN a SET STORAGE PLAIN;
+-- falls back with an unmatched inner tuple that is in fist, middle, and last
+-- stripe
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in first stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in middle stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in last stripe")', 1 FROM generate_series(1, 1);
+-- doesn't fall back -- matched tuple
+INSERT INTO hashside_wide SELECT '(3, "")', 3 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(6, "")', 6 FROM generate_series(1, 20);
+ANALYZE probeside, hashside_wide;
+SET enable_nestloop TO off;
+SET enable_mergejoin TO off;
+SET work_mem = 64;
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash | btrim 
+------+-----------------------+----+------+-------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+(215 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
                            QUERY PLAN                           
 ----------------------------------------------------------------
- Finalize Aggregate
-   ->  Gather
-         Workers Planned: 2
-         ->  Partial Aggregate
-               ->  Parallel Hash Left Join
-                     Hash Cond: (wide.id = wide_1.id)
-                     ->  Parallel Seq Scan on wide
-                     ->  Parallel Hash
-                           ->  Parallel Seq Scan on wide wide_1
-(9 rows)
+ Hash Left Join (actual rows=215 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
 
-select length(max(s.t))
-from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
- length 
---------
- 320000
-(1 row)
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash | btrim | id | hash |                 btrim                  
+------+-------+----+------+----------------------------------------
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    3 |       |  3 |    3 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+      |       |  1 |    1 | unmatched inner tuple in first stripe
+      |       |  1 |    1 | unmatched inner tuple in last stripe
+      |       |  1 |    1 | unmatched inner tuple in middle stripe
+(214 rows)
 
-select final > 1 as multibatch
-  from hash_join_batches(
-$$
-  select length(max(s.t))
-  from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
-$$);
- multibatch 
-------------
- t
-(1 row)
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Right Join (actual rows=214 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+FULL OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash |                 btrim                  
+------+-----------------------+----+------+----------------------------------------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+      |                       |  1 |    1 | unmatched inner tuple in first stripe
+      |                       |  1 |    1 | unmatched inner tuple in last stripe
+      |                       |  1 |    1 | unmatched inner tuple in middle stripe
+(218 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+FULL OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Full Join (actual rows=218 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+-- semi-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Semi Join (actual rows=12 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+ hash | btrim 
+------+-------
+    1 | 
+    1 | 
+    1 | 
+    1 | 
+    1 | 
+    3 | 
+    6 | 
+    6 | 
+    6 | 
+    6 | 
+    6 | 
+    6 | 
+(12 rows)
+
+-- anti-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Anti Join (actual rows=4 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+ hash |         btrim         
+------+-----------------------
+    1 | unmatched outer tuple
+    2 | 
+    5 | 
+    6 | unmatched outer tuple
+(4 rows)
+
+-- parallel LOJ test case with two batches falling back
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_parallel_hash = on;
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Gather (actual rows=215 loops=1)
+   Workers Planned: 1
+   Workers Launched: 1
+   ->  Parallel Hash Left Join (actual rows=108 loops=2)
+         Hash Cond: (probeside.a = hashside_wide.a)
+         ->  Parallel Seq Scan on probeside (actual rows=16 loops=1)
+         ->  Parallel Hash (actual rows=21 loops=2)
+               Buckets: 8 (originally 8)  Batches: 128 (originally 8)
+               Batch: 1  Stripes: 3
+               Batch: 6  Stripes: 3
+               ->  Parallel Seq Scan on hashside_wide (actual rows=42 loops=1)
+(11 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash | btrim 
+------+-----------------------+----+------+-------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+(215 rows)
 
 rollback to settings;
+-- Test spill of batch 0 gives correct results.
+CREATE TABLE probeside_batch0(id int generated always as identity, a stub);
+ALTER TABLE probeside_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO probeside_batch0(a) SELECT '(0, "")' FROM generate_series(1, 13);
+INSERT INTO probeside_batch0(a) SELECT '(0, "unmatched outer")' FROM generate_series(1, 1);
+CREATE TABLE hashside_wide_batch0(id int generated always as identity, a stub);
+ALTER TABLE hashside_wide_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+ANALYZE probeside_batch0, hashside_wide_batch0;
+SELECT
+       hashside_wide_batch0.id as hashside_id, 
+       (hashside_wide_batch0.a).hash as hashside_hash,
+        probeside_batch0.id as probeside_id, 
+       (probeside_batch0.a).hash as probeside_hash,
+        TRIM((probeside_batch0.a).value) as probeside_trimmed_value,
+        TRIM((hashside_wide_batch0.a).value) as hashside_trimmed_value 
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5, 6;
+ hashside_id | hashside_hash | probeside_id | probeside_hash | probeside_trimmed_value | hashside_trimmed_value 
+-------------+---------------+--------------+----------------+-------------------------+------------------------
+           1 |             0 |            1 |              0 |                         | 
+           1 |             0 |            2 |              0 |                         | 
+           1 |             0 |            3 |              0 |                         | 
+           1 |             0 |            4 |              0 |                         | 
+           1 |             0 |            5 |              0 |                         | 
+           1 |             0 |            6 |              0 |                         | 
+           1 |             0 |            7 |              0 |                         | 
+           1 |             0 |            8 |              0 |                         | 
+           1 |             0 |            9 |              0 |                         | 
+           1 |             0 |           10 |              0 |                         | 
+           1 |             0 |           11 |              0 |                         | 
+           1 |             0 |           12 |              0 |                         | 
+           1 |             0 |           13 |              0 |                         | 
+           2 |             0 |            1 |              0 |                         | 
+           2 |             0 |            2 |              0 |                         | 
+           2 |             0 |            3 |              0 |                         | 
+           2 |             0 |            4 |              0 |                         | 
+           2 |             0 |            5 |              0 |                         | 
+           2 |             0 |            6 |              0 |                         | 
+           2 |             0 |            7 |              0 |                         | 
+           2 |             0 |            8 |              0 |                         | 
+           2 |             0 |            9 |              0 |                         | 
+           2 |             0 |           10 |              0 |                         | 
+           2 |             0 |           11 |              0 |                         | 
+           2 |             0 |           12 |              0 |                         | 
+           2 |             0 |           13 |              0 |                         | 
+           3 |             0 |            1 |              0 |                         | 
+           3 |             0 |            2 |              0 |                         | 
+           3 |             0 |            3 |              0 |                         | 
+           3 |             0 |            4 |              0 |                         | 
+           3 |             0 |            5 |              0 |                         | 
+           3 |             0 |            6 |              0 |                         | 
+           3 |             0 |            7 |              0 |                         | 
+           3 |             0 |            8 |              0 |                         | 
+           3 |             0 |            9 |              0 |                         | 
+           3 |             0 |           10 |              0 |                         | 
+           3 |             0 |           11 |              0 |                         | 
+           3 |             0 |           12 |              0 |                         | 
+           3 |             0 |           13 |              0 |                         | 
+           4 |             0 |            1 |              0 |                         | 
+           4 |             0 |            2 |              0 |                         | 
+           4 |             0 |            3 |              0 |                         | 
+           4 |             0 |            4 |              0 |                         | 
+           4 |             0 |            5 |              0 |                         | 
+           4 |             0 |            6 |              0 |                         | 
+           4 |             0 |            7 |              0 |                         | 
+           4 |             0 |            8 |              0 |                         | 
+           4 |             0 |            9 |              0 |                         | 
+           4 |             0 |           10 |              0 |                         | 
+           4 |             0 |           11 |              0 |                         | 
+           4 |             0 |           12 |              0 |                         | 
+           4 |             0 |           13 |              0 |                         | 
+           5 |             0 |            1 |              0 |                         | 
+           5 |             0 |            2 |              0 |                         | 
+           5 |             0 |            3 |              0 |                         | 
+           5 |             0 |            4 |              0 |                         | 
+           5 |             0 |            5 |              0 |                         | 
+           5 |             0 |            6 |              0 |                         | 
+           5 |             0 |            7 |              0 |                         | 
+           5 |             0 |            8 |              0 |                         | 
+           5 |             0 |            9 |              0 |                         | 
+           5 |             0 |           10 |              0 |                         | 
+           5 |             0 |           11 |              0 |                         | 
+           5 |             0 |           12 |              0 |                         | 
+           5 |             0 |           13 |              0 |                         | 
+           6 |             0 |            1 |              0 |                         | 
+           6 |             0 |            2 |              0 |                         | 
+           6 |             0 |            3 |              0 |                         | 
+           6 |             0 |            4 |              0 |                         | 
+           6 |             0 |            5 |              0 |                         | 
+           6 |             0 |            6 |              0 |                         | 
+           6 |             0 |            7 |              0 |                         | 
+           6 |             0 |            8 |              0 |                         | 
+           6 |             0 |            9 |              0 |                         | 
+           6 |             0 |           10 |              0 |                         | 
+           6 |             0 |           11 |              0 |                         | 
+           6 |             0 |           12 |              0 |                         | 
+           6 |             0 |           13 |              0 |                         | 
+           7 |             0 |            1 |              0 |                         | 
+           7 |             0 |            2 |              0 |                         | 
+           7 |             0 |            3 |              0 |                         | 
+           7 |             0 |            4 |              0 |                         | 
+           7 |             0 |            5 |              0 |                         | 
+           7 |             0 |            6 |              0 |                         | 
+           7 |             0 |            7 |              0 |                         | 
+           7 |             0 |            8 |              0 |                         | 
+           7 |             0 |            9 |              0 |                         | 
+           7 |             0 |           10 |              0 |                         | 
+           7 |             0 |           11 |              0 |                         | 
+           7 |             0 |           12 |              0 |                         | 
+           7 |             0 |           13 |              0 |                         | 
+           8 |             0 |            1 |              0 |                         | 
+           8 |             0 |            2 |              0 |                         | 
+           8 |             0 |            3 |              0 |                         | 
+           8 |             0 |            4 |              0 |                         | 
+           8 |             0 |            5 |              0 |                         | 
+           8 |             0 |            6 |              0 |                         | 
+           8 |             0 |            7 |              0 |                         | 
+           8 |             0 |            8 |              0 |                         | 
+           8 |             0 |            9 |              0 |                         | 
+           8 |             0 |           10 |              0 |                         | 
+           8 |             0 |           11 |              0 |                         | 
+           8 |             0 |           12 |              0 |                         | 
+           8 |             0 |           13 |              0 |                         | 
+           9 |             0 |            1 |              0 |                         | 
+           9 |             0 |            2 |              0 |                         | 
+           9 |             0 |            3 |              0 |                         | 
+           9 |             0 |            4 |              0 |                         | 
+           9 |             0 |            5 |              0 |                         | 
+           9 |             0 |            6 |              0 |                         | 
+           9 |             0 |            7 |              0 |                         | 
+           9 |             0 |            8 |              0 |                         | 
+           9 |             0 |            9 |              0 |                         | 
+           9 |             0 |           10 |              0 |                         | 
+           9 |             0 |           11 |              0 |                         | 
+           9 |             0 |           12 |              0 |                         | 
+           9 |             0 |           13 |              0 |                         | 
+          10 |             0 |            1 |              0 |                         | 
+          10 |             0 |            2 |              0 |                         | 
+          10 |             0 |            3 |              0 |                         | 
+          10 |             0 |            4 |              0 |                         | 
+          10 |             0 |            5 |              0 |                         | 
+          10 |             0 |            6 |              0 |                         | 
+          10 |             0 |            7 |              0 |                         | 
+          10 |             0 |            8 |              0 |                         | 
+          10 |             0 |            9 |              0 |                         | 
+          10 |             0 |           10 |              0 |                         | 
+          10 |             0 |           11 |              0 |                         | 
+          10 |             0 |           12 |              0 |                         | 
+          10 |             0 |           13 |              0 |                         | 
+          11 |             0 |            1 |              0 |                         | 
+          11 |             0 |            2 |              0 |                         | 
+          11 |             0 |            3 |              0 |                         | 
+          11 |             0 |            4 |              0 |                         | 
+          11 |             0 |            5 |              0 |                         | 
+          11 |             0 |            6 |              0 |                         | 
+          11 |             0 |            7 |              0 |                         | 
+          11 |             0 |            8 |              0 |                         | 
+          11 |             0 |            9 |              0 |                         | 
+          11 |             0 |           10 |              0 |                         | 
+          11 |             0 |           11 |              0 |                         | 
+          11 |             0 |           12 |              0 |                         | 
+          11 |             0 |           13 |              0 |                         | 
+          12 |             0 |            1 |              0 |                         | 
+          12 |             0 |            2 |              0 |                         | 
+          12 |             0 |            3 |              0 |                         | 
+          12 |             0 |            4 |              0 |                         | 
+          12 |             0 |            5 |              0 |                         | 
+          12 |             0 |            6 |              0 |                         | 
+          12 |             0 |            7 |              0 |                         | 
+          12 |             0 |            8 |              0 |                         | 
+          12 |             0 |            9 |              0 |                         | 
+          12 |             0 |           10 |              0 |                         | 
+          12 |             0 |           11 |              0 |                         | 
+          12 |             0 |           12 |              0 |                         | 
+          12 |             0 |           13 |              0 |                         | 
+          13 |             0 |            1 |              0 |                         | 
+          13 |             0 |            2 |              0 |                         | 
+          13 |             0 |            3 |              0 |                         | 
+          13 |             0 |            4 |              0 |                         | 
+          13 |             0 |            5 |              0 |                         | 
+          13 |             0 |            6 |              0 |                         | 
+          13 |             0 |            7 |              0 |                         | 
+          13 |             0 |            8 |              0 |                         | 
+          13 |             0 |            9 |              0 |                         | 
+          13 |             0 |           10 |              0 |                         | 
+          13 |             0 |           11 |              0 |                         | 
+          13 |             0 |           12 |              0 |                         | 
+          13 |             0 |           13 |              0 |                         | 
+          14 |             0 |            1 |              0 |                         | 
+          14 |             0 |            2 |              0 |                         | 
+          14 |             0 |            3 |              0 |                         | 
+          14 |             0 |            4 |              0 |                         | 
+          14 |             0 |            5 |              0 |                         | 
+          14 |             0 |            6 |              0 |                         | 
+          14 |             0 |            7 |              0 |                         | 
+          14 |             0 |            8 |              0 |                         | 
+          14 |             0 |            9 |              0 |                         | 
+          14 |             0 |           10 |              0 |                         | 
+          14 |             0 |           11 |              0 |                         | 
+          14 |             0 |           12 |              0 |                         | 
+          14 |             0 |           13 |              0 |                         | 
+          15 |             0 |            1 |              0 |                         | 
+          15 |             0 |            2 |              0 |                         | 
+          15 |             0 |            3 |              0 |                         | 
+          15 |             0 |            4 |              0 |                         | 
+          15 |             0 |            5 |              0 |                         | 
+          15 |             0 |            6 |              0 |                         | 
+          15 |             0 |            7 |              0 |                         | 
+          15 |             0 |            8 |              0 |                         | 
+          15 |             0 |            9 |              0 |                         | 
+          15 |             0 |           10 |              0 |                         | 
+          15 |             0 |           11 |              0 |                         | 
+          15 |             0 |           12 |              0 |                         | 
+          15 |             0 |           13 |              0 |                         | 
+          16 |             0 |            1 |              0 |                         | 
+          16 |             0 |            2 |              0 |                         | 
+          16 |             0 |            3 |              0 |                         | 
+          16 |             0 |            4 |              0 |                         | 
+          16 |             0 |            5 |              0 |                         | 
+          16 |             0 |            6 |              0 |                         | 
+          16 |             0 |            7 |              0 |                         | 
+          16 |             0 |            8 |              0 |                         | 
+          16 |             0 |            9 |              0 |                         | 
+          16 |             0 |           10 |              0 |                         | 
+          16 |             0 |           11 |              0 |                         | 
+          16 |             0 |           12 |              0 |                         | 
+          16 |             0 |           13 |              0 |                         | 
+          17 |             0 |            1 |              0 |                         | 
+          17 |             0 |            2 |              0 |                         | 
+          17 |             0 |            3 |              0 |                         | 
+          17 |             0 |            4 |              0 |                         | 
+          17 |             0 |            5 |              0 |                         | 
+          17 |             0 |            6 |              0 |                         | 
+          17 |             0 |            7 |              0 |                         | 
+          17 |             0 |            8 |              0 |                         | 
+          17 |             0 |            9 |              0 |                         | 
+          17 |             0 |           10 |              0 |                         | 
+          17 |             0 |           11 |              0 |                         | 
+          17 |             0 |           12 |              0 |                         | 
+          17 |             0 |           13 |              0 |                         | 
+          18 |             0 |            1 |              0 |                         | 
+          18 |             0 |            2 |              0 |                         | 
+          18 |             0 |            3 |              0 |                         | 
+          18 |             0 |            4 |              0 |                         | 
+          18 |             0 |            5 |              0 |                         | 
+          18 |             0 |            6 |              0 |                         | 
+          18 |             0 |            7 |              0 |                         | 
+          18 |             0 |            8 |              0 |                         | 
+          18 |             0 |            9 |              0 |                         | 
+          18 |             0 |           10 |              0 |                         | 
+          18 |             0 |           11 |              0 |                         | 
+          18 |             0 |           12 |              0 |                         | 
+          18 |             0 |           13 |              0 |                         | 
+          19 |             0 |            1 |              0 |                         | 
+          19 |             0 |            2 |              0 |                         | 
+          19 |             0 |            3 |              0 |                         | 
+          19 |             0 |            4 |              0 |                         | 
+          19 |             0 |            5 |              0 |                         | 
+          19 |             0 |            6 |              0 |                         | 
+          19 |             0 |            7 |              0 |                         | 
+          19 |             0 |            8 |              0 |                         | 
+          19 |             0 |            9 |              0 |                         | 
+          19 |             0 |           10 |              0 |                         | 
+          19 |             0 |           11 |              0 |                         | 
+          19 |             0 |           12 |              0 |                         | 
+          19 |             0 |           13 |              0 |                         | 
+          20 |             0 |            1 |              0 |                         | 
+          20 |             0 |            2 |              0 |                         | 
+          20 |             0 |            3 |              0 |                         | 
+          20 |             0 |            4 |              0 |                         | 
+          20 |             0 |            5 |              0 |                         | 
+          20 |             0 |            6 |              0 |                         | 
+          20 |             0 |            7 |              0 |                         | 
+          20 |             0 |            8 |              0 |                         | 
+          20 |             0 |            9 |              0 |                         | 
+          20 |             0 |           10 |              0 |                         | 
+          20 |             0 |           11 |              0 |                         | 
+          20 |             0 |           12 |              0 |                         | 
+          20 |             0 |           13 |              0 |                         | 
+          21 |             0 |            1 |              0 |                         | 
+          21 |             0 |            2 |              0 |                         | 
+          21 |             0 |            3 |              0 |                         | 
+          21 |             0 |            4 |              0 |                         | 
+          21 |             0 |            5 |              0 |                         | 
+          21 |             0 |            6 |              0 |                         | 
+          21 |             0 |            7 |              0 |                         | 
+          21 |             0 |            8 |              0 |                         | 
+          21 |             0 |            9 |              0 |                         | 
+          21 |             0 |           10 |              0 |                         | 
+          21 |             0 |           11 |              0 |                         | 
+          21 |             0 |           12 |              0 |                         | 
+          21 |             0 |           13 |              0 |                         | 
+          22 |             0 |            1 |              0 |                         | 
+          22 |             0 |            2 |              0 |                         | 
+          22 |             0 |            3 |              0 |                         | 
+          22 |             0 |            4 |              0 |                         | 
+          22 |             0 |            5 |              0 |                         | 
+          22 |             0 |            6 |              0 |                         | 
+          22 |             0 |            7 |              0 |                         | 
+          22 |             0 |            8 |              0 |                         | 
+          22 |             0 |            9 |              0 |                         | 
+          22 |             0 |           10 |              0 |                         | 
+          22 |             0 |           11 |              0 |                         | 
+          22 |             0 |           12 |              0 |                         | 
+          22 |             0 |           13 |              0 |                         | 
+          23 |             0 |            1 |              0 |                         | 
+          23 |             0 |            2 |              0 |                         | 
+          23 |             0 |            3 |              0 |                         | 
+          23 |             0 |            4 |              0 |                         | 
+          23 |             0 |            5 |              0 |                         | 
+          23 |             0 |            6 |              0 |                         | 
+          23 |             0 |            7 |              0 |                         | 
+          23 |             0 |            8 |              0 |                         | 
+          23 |             0 |            9 |              0 |                         | 
+          23 |             0 |           10 |              0 |                         | 
+          23 |             0 |           11 |              0 |                         | 
+          23 |             0 |           12 |              0 |                         | 
+          23 |             0 |           13 |              0 |                         | 
+          24 |             0 |            1 |              0 |                         | 
+          24 |             0 |            2 |              0 |                         | 
+          24 |             0 |            3 |              0 |                         | 
+          24 |             0 |            4 |              0 |                         | 
+          24 |             0 |            5 |              0 |                         | 
+          24 |             0 |            6 |              0 |                         | 
+          24 |             0 |            7 |              0 |                         | 
+          24 |             0 |            8 |              0 |                         | 
+          24 |             0 |            9 |              0 |                         | 
+          24 |             0 |           10 |              0 |                         | 
+          24 |             0 |           11 |              0 |                         | 
+          24 |             0 |           12 |              0 |                         | 
+          24 |             0 |           13 |              0 |                         | 
+          25 |             0 |            1 |              0 |                         | 
+          25 |             0 |            2 |              0 |                         | 
+          25 |             0 |            3 |              0 |                         | 
+          25 |             0 |            4 |              0 |                         | 
+          25 |             0 |            5 |              0 |                         | 
+          25 |             0 |            6 |              0 |                         | 
+          25 |             0 |            7 |              0 |                         | 
+          25 |             0 |            8 |              0 |                         | 
+          25 |             0 |            9 |              0 |                         | 
+          25 |             0 |           10 |              0 |                         | 
+          25 |             0 |           11 |              0 |                         | 
+          25 |             0 |           12 |              0 |                         | 
+          25 |             0 |           13 |              0 |                         | 
+          26 |             0 |            1 |              0 |                         | 
+          26 |             0 |            2 |              0 |                         | 
+          26 |             0 |            3 |              0 |                         | 
+          26 |             0 |            4 |              0 |                         | 
+          26 |             0 |            5 |              0 |                         | 
+          26 |             0 |            6 |              0 |                         | 
+          26 |             0 |            7 |              0 |                         | 
+          26 |             0 |            8 |              0 |                         | 
+          26 |             0 |            9 |              0 |                         | 
+          26 |             0 |           10 |              0 |                         | 
+          26 |             0 |           11 |              0 |                         | 
+          26 |             0 |           12 |              0 |                         | 
+          26 |             0 |           13 |              0 |                         | 
+          27 |             0 |            1 |              0 |                         | 
+          27 |             0 |            2 |              0 |                         | 
+          27 |             0 |            3 |              0 |                         | 
+          27 |             0 |            4 |              0 |                         | 
+          27 |             0 |            5 |              0 |                         | 
+          27 |             0 |            6 |              0 |                         | 
+          27 |             0 |            7 |              0 |                         | 
+          27 |             0 |            8 |              0 |                         | 
+          27 |             0 |            9 |              0 |                         | 
+          27 |             0 |           10 |              0 |                         | 
+          27 |             0 |           11 |              0 |                         | 
+          27 |             0 |           12 |              0 |                         | 
+          27 |             0 |           13 |              0 |                         | 
+             |               |           14 |              0 | unmatched outer         | 
+(352 rows)
+
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+savepoint settings;
+set max_parallel_workers_per_gather = 1;
+set enable_parallel_hash = on;
+set work_mem = '64kB';
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a);
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Gather (actual rows=469 loops=1)
+   Workers Planned: 1
+   Workers Launched: 1
+   ->  Parallel Hash Left Join (actual rows=234 loops=2)
+         Hash Cond: (probeside_batch0.a = hashside_wide_batch0.a)
+         ->  Parallel Seq Scan on probeside_batch0 (actual rows=14 loops=1)
+         ->  Parallel Hash (actual rows=18 loops=2)
+               Buckets: 8 (originally 8)  Batches: 16 (originally 8)
+               Batch: 0  Stripes: 5
+               ->  Parallel Seq Scan on hashside_wide_batch0 (actual rows=36 loops=1)
+(10 rows)
+
+SELECT
+       hashside_wide_batch0.id as hashside_id, 
+       (hashside_wide_batch0.a).hash as hashside_hash,
+        probeside_batch0.id as probeside_id, 
+       (probeside_batch0.a).hash as probeside_hash,
+        TRIM((probeside_batch0.a).value) as probeside_trimmed_value,
+        TRIM((hashside_wide_batch0.a).value) as hashside_trimmed_value 
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5, 6;
+ hashside_id | hashside_hash | probeside_id | probeside_hash | probeside_trimmed_value | hashside_trimmed_value 
+-------------+---------------+--------------+----------------+-------------------------+------------------------
+           1 |             0 |            1 |              0 |                         | 
+           1 |             0 |            2 |              0 |                         | 
+           1 |             0 |            3 |              0 |                         | 
+           1 |             0 |            4 |              0 |                         | 
+           1 |             0 |            5 |              0 |                         | 
+           1 |             0 |            6 |              0 |                         | 
+           1 |             0 |            7 |              0 |                         | 
+           1 |             0 |            8 |              0 |                         | 
+           1 |             0 |            9 |              0 |                         | 
+           1 |             0 |           10 |              0 |                         | 
+           1 |             0 |           11 |              0 |                         | 
+           1 |             0 |           12 |              0 |                         | 
+           1 |             0 |           13 |              0 |                         | 
+           2 |             0 |            1 |              0 |                         | 
+           2 |             0 |            2 |              0 |                         | 
+           2 |             0 |            3 |              0 |                         | 
+           2 |             0 |            4 |              0 |                         | 
+           2 |             0 |            5 |              0 |                         | 
+           2 |             0 |            6 |              0 |                         | 
+           2 |             0 |            7 |              0 |                         | 
+           2 |             0 |            8 |              0 |                         | 
+           2 |             0 |            9 |              0 |                         | 
+           2 |             0 |           10 |              0 |                         | 
+           2 |             0 |           11 |              0 |                         | 
+           2 |             0 |           12 |              0 |                         | 
+           2 |             0 |           13 |              0 |                         | 
+           3 |             0 |            1 |              0 |                         | 
+           3 |             0 |            2 |              0 |                         | 
+           3 |             0 |            3 |              0 |                         | 
+           3 |             0 |            4 |              0 |                         | 
+           3 |             0 |            5 |              0 |                         | 
+           3 |             0 |            6 |              0 |                         | 
+           3 |             0 |            7 |              0 |                         | 
+           3 |             0 |            8 |              0 |                         | 
+           3 |             0 |            9 |              0 |                         | 
+           3 |             0 |           10 |              0 |                         | 
+           3 |             0 |           11 |              0 |                         | 
+           3 |             0 |           12 |              0 |                         | 
+           3 |             0 |           13 |              0 |                         | 
+           4 |             0 |            1 |              0 |                         | 
+           4 |             0 |            2 |              0 |                         | 
+           4 |             0 |            3 |              0 |                         | 
+           4 |             0 |            4 |              0 |                         | 
+           4 |             0 |            5 |              0 |                         | 
+           4 |             0 |            6 |              0 |                         | 
+           4 |             0 |            7 |              0 |                         | 
+           4 |             0 |            8 |              0 |                         | 
+           4 |             0 |            9 |              0 |                         | 
+           4 |             0 |           10 |              0 |                         | 
+           4 |             0 |           11 |              0 |                         | 
+           4 |             0 |           12 |              0 |                         | 
+           4 |             0 |           13 |              0 |                         | 
+           5 |             0 |            1 |              0 |                         | 
+           5 |             0 |            2 |              0 |                         | 
+           5 |             0 |            3 |              0 |                         | 
+           5 |             0 |            4 |              0 |                         | 
+           5 |             0 |            5 |              0 |                         | 
+           5 |             0 |            6 |              0 |                         | 
+           5 |             0 |            7 |              0 |                         | 
+           5 |             0 |            8 |              0 |                         | 
+           5 |             0 |            9 |              0 |                         | 
+           5 |             0 |           10 |              0 |                         | 
+           5 |             0 |           11 |              0 |                         | 
+           5 |             0 |           12 |              0 |                         | 
+           5 |             0 |           13 |              0 |                         | 
+           6 |             0 |            1 |              0 |                         | 
+           6 |             0 |            2 |              0 |                         | 
+           6 |             0 |            3 |              0 |                         | 
+           6 |             0 |            4 |              0 |                         | 
+           6 |             0 |            5 |              0 |                         | 
+           6 |             0 |            6 |              0 |                         | 
+           6 |             0 |            7 |              0 |                         | 
+           6 |             0 |            8 |              0 |                         | 
+           6 |             0 |            9 |              0 |                         | 
+           6 |             0 |           10 |              0 |                         | 
+           6 |             0 |           11 |              0 |                         | 
+           6 |             0 |           12 |              0 |                         | 
+           6 |             0 |           13 |              0 |                         | 
+           7 |             0 |            1 |              0 |                         | 
+           7 |             0 |            2 |              0 |                         | 
+           7 |             0 |            3 |              0 |                         | 
+           7 |             0 |            4 |              0 |                         | 
+           7 |             0 |            5 |              0 |                         | 
+           7 |             0 |            6 |              0 |                         | 
+           7 |             0 |            7 |              0 |                         | 
+           7 |             0 |            8 |              0 |                         | 
+           7 |             0 |            9 |              0 |                         | 
+           7 |             0 |           10 |              0 |                         | 
+           7 |             0 |           11 |              0 |                         | 
+           7 |             0 |           12 |              0 |                         | 
+           7 |             0 |           13 |              0 |                         | 
+           8 |             0 |            1 |              0 |                         | 
+           8 |             0 |            2 |              0 |                         | 
+           8 |             0 |            3 |              0 |                         | 
+           8 |             0 |            4 |              0 |                         | 
+           8 |             0 |            5 |              0 |                         | 
+           8 |             0 |            6 |              0 |                         | 
+           8 |             0 |            7 |              0 |                         | 
+           8 |             0 |            8 |              0 |                         | 
+           8 |             0 |            9 |              0 |                         | 
+           8 |             0 |           10 |              0 |                         | 
+           8 |             0 |           11 |              0 |                         | 
+           8 |             0 |           12 |              0 |                         | 
+           8 |             0 |           13 |              0 |                         | 
+           9 |             0 |            1 |              0 |                         | 
+           9 |             0 |            2 |              0 |                         | 
+           9 |             0 |            3 |              0 |                         | 
+           9 |             0 |            4 |              0 |                         | 
+           9 |             0 |            5 |              0 |                         | 
+           9 |             0 |            6 |              0 |                         | 
+           9 |             0 |            7 |              0 |                         | 
+           9 |             0 |            8 |              0 |                         | 
+           9 |             0 |            9 |              0 |                         | 
+           9 |             0 |           10 |              0 |                         | 
+           9 |             0 |           11 |              0 |                         | 
+           9 |             0 |           12 |              0 |                         | 
+           9 |             0 |           13 |              0 |                         | 
+          10 |             0 |            1 |              0 |                         | 
+          10 |             0 |            2 |              0 |                         | 
+          10 |             0 |            3 |              0 |                         | 
+          10 |             0 |            4 |              0 |                         | 
+          10 |             0 |            5 |              0 |                         | 
+          10 |             0 |            6 |              0 |                         | 
+          10 |             0 |            7 |              0 |                         | 
+          10 |             0 |            8 |              0 |                         | 
+          10 |             0 |            9 |              0 |                         | 
+          10 |             0 |           10 |              0 |                         | 
+          10 |             0 |           11 |              0 |                         | 
+          10 |             0 |           12 |              0 |                         | 
+          10 |             0 |           13 |              0 |                         | 
+          11 |             0 |            1 |              0 |                         | 
+          11 |             0 |            2 |              0 |                         | 
+          11 |             0 |            3 |              0 |                         | 
+          11 |             0 |            4 |              0 |                         | 
+          11 |             0 |            5 |              0 |                         | 
+          11 |             0 |            6 |              0 |                         | 
+          11 |             0 |            7 |              0 |                         | 
+          11 |             0 |            8 |              0 |                         | 
+          11 |             0 |            9 |              0 |                         | 
+          11 |             0 |           10 |              0 |                         | 
+          11 |             0 |           11 |              0 |                         | 
+          11 |             0 |           12 |              0 |                         | 
+          11 |             0 |           13 |              0 |                         | 
+          12 |             0 |            1 |              0 |                         | 
+          12 |             0 |            2 |              0 |                         | 
+          12 |             0 |            3 |              0 |                         | 
+          12 |             0 |            4 |              0 |                         | 
+          12 |             0 |            5 |              0 |                         | 
+          12 |             0 |            6 |              0 |                         | 
+          12 |             0 |            7 |              0 |                         | 
+          12 |             0 |            8 |              0 |                         | 
+          12 |             0 |            9 |              0 |                         | 
+          12 |             0 |           10 |              0 |                         | 
+          12 |             0 |           11 |              0 |                         | 
+          12 |             0 |           12 |              0 |                         | 
+          12 |             0 |           13 |              0 |                         | 
+          13 |             0 |            1 |              0 |                         | 
+          13 |             0 |            2 |              0 |                         | 
+          13 |             0 |            3 |              0 |                         | 
+          13 |             0 |            4 |              0 |                         | 
+          13 |             0 |            5 |              0 |                         | 
+          13 |             0 |            6 |              0 |                         | 
+          13 |             0 |            7 |              0 |                         | 
+          13 |             0 |            8 |              0 |                         | 
+          13 |             0 |            9 |              0 |                         | 
+          13 |             0 |           10 |              0 |                         | 
+          13 |             0 |           11 |              0 |                         | 
+          13 |             0 |           12 |              0 |                         | 
+          13 |             0 |           13 |              0 |                         | 
+          14 |             0 |            1 |              0 |                         | 
+          14 |             0 |            2 |              0 |                         | 
+          14 |             0 |            3 |              0 |                         | 
+          14 |             0 |            4 |              0 |                         | 
+          14 |             0 |            5 |              0 |                         | 
+          14 |             0 |            6 |              0 |                         | 
+          14 |             0 |            7 |              0 |                         | 
+          14 |             0 |            8 |              0 |                         | 
+          14 |             0 |            9 |              0 |                         | 
+          14 |             0 |           10 |              0 |                         | 
+          14 |             0 |           11 |              0 |                         | 
+          14 |             0 |           12 |              0 |                         | 
+          14 |             0 |           13 |              0 |                         | 
+          15 |             0 |            1 |              0 |                         | 
+          15 |             0 |            2 |              0 |                         | 
+          15 |             0 |            3 |              0 |                         | 
+          15 |             0 |            4 |              0 |                         | 
+          15 |             0 |            5 |              0 |                         | 
+          15 |             0 |            6 |              0 |                         | 
+          15 |             0 |            7 |              0 |                         | 
+          15 |             0 |            8 |              0 |                         | 
+          15 |             0 |            9 |              0 |                         | 
+          15 |             0 |           10 |              0 |                         | 
+          15 |             0 |           11 |              0 |                         | 
+          15 |             0 |           12 |              0 |                         | 
+          15 |             0 |           13 |              0 |                         | 
+          16 |             0 |            1 |              0 |                         | 
+          16 |             0 |            2 |              0 |                         | 
+          16 |             0 |            3 |              0 |                         | 
+          16 |             0 |            4 |              0 |                         | 
+          16 |             0 |            5 |              0 |                         | 
+          16 |             0 |            6 |              0 |                         | 
+          16 |             0 |            7 |              0 |                         | 
+          16 |             0 |            8 |              0 |                         | 
+          16 |             0 |            9 |              0 |                         | 
+          16 |             0 |           10 |              0 |                         | 
+          16 |             0 |           11 |              0 |                         | 
+          16 |             0 |           12 |              0 |                         | 
+          16 |             0 |           13 |              0 |                         | 
+          17 |             0 |            1 |              0 |                         | 
+          17 |             0 |            2 |              0 |                         | 
+          17 |             0 |            3 |              0 |                         | 
+          17 |             0 |            4 |              0 |                         | 
+          17 |             0 |            5 |              0 |                         | 
+          17 |             0 |            6 |              0 |                         | 
+          17 |             0 |            7 |              0 |                         | 
+          17 |             0 |            8 |              0 |                         | 
+          17 |             0 |            9 |              0 |                         | 
+          17 |             0 |           10 |              0 |                         | 
+          17 |             0 |           11 |              0 |                         | 
+          17 |             0 |           12 |              0 |                         | 
+          17 |             0 |           13 |              0 |                         | 
+          18 |             0 |            1 |              0 |                         | 
+          18 |             0 |            2 |              0 |                         | 
+          18 |             0 |            3 |              0 |                         | 
+          18 |             0 |            4 |              0 |                         | 
+          18 |             0 |            5 |              0 |                         | 
+          18 |             0 |            6 |              0 |                         | 
+          18 |             0 |            7 |              0 |                         | 
+          18 |             0 |            8 |              0 |                         | 
+          18 |             0 |            9 |              0 |                         | 
+          18 |             0 |           10 |              0 |                         | 
+          18 |             0 |           11 |              0 |                         | 
+          18 |             0 |           12 |              0 |                         | 
+          18 |             0 |           13 |              0 |                         | 
+          19 |             0 |            1 |              0 |                         | 
+          19 |             0 |            2 |              0 |                         | 
+          19 |             0 |            3 |              0 |                         | 
+          19 |             0 |            4 |              0 |                         | 
+          19 |             0 |            5 |              0 |                         | 
+          19 |             0 |            6 |              0 |                         | 
+          19 |             0 |            7 |              0 |                         | 
+          19 |             0 |            8 |              0 |                         | 
+          19 |             0 |            9 |              0 |                         | 
+          19 |             0 |           10 |              0 |                         | 
+          19 |             0 |           11 |              0 |                         | 
+          19 |             0 |           12 |              0 |                         | 
+          19 |             0 |           13 |              0 |                         | 
+          20 |             0 |            1 |              0 |                         | 
+          20 |             0 |            2 |              0 |                         | 
+          20 |             0 |            3 |              0 |                         | 
+          20 |             0 |            4 |              0 |                         | 
+          20 |             0 |            5 |              0 |                         | 
+          20 |             0 |            6 |              0 |                         | 
+          20 |             0 |            7 |              0 |                         | 
+          20 |             0 |            8 |              0 |                         | 
+          20 |             0 |            9 |              0 |                         | 
+          20 |             0 |           10 |              0 |                         | 
+          20 |             0 |           11 |              0 |                         | 
+          20 |             0 |           12 |              0 |                         | 
+          20 |             0 |           13 |              0 |                         | 
+          21 |             0 |            1 |              0 |                         | 
+          21 |             0 |            2 |              0 |                         | 
+          21 |             0 |            3 |              0 |                         | 
+          21 |             0 |            4 |              0 |                         | 
+          21 |             0 |            5 |              0 |                         | 
+          21 |             0 |            6 |              0 |                         | 
+          21 |             0 |            7 |              0 |                         | 
+          21 |             0 |            8 |              0 |                         | 
+          21 |             0 |            9 |              0 |                         | 
+          21 |             0 |           10 |              0 |                         | 
+          21 |             0 |           11 |              0 |                         | 
+          21 |             0 |           12 |              0 |                         | 
+          21 |             0 |           13 |              0 |                         | 
+          22 |             0 |            1 |              0 |                         | 
+          22 |             0 |            2 |              0 |                         | 
+          22 |             0 |            3 |              0 |                         | 
+          22 |             0 |            4 |              0 |                         | 
+          22 |             0 |            5 |              0 |                         | 
+          22 |             0 |            6 |              0 |                         | 
+          22 |             0 |            7 |              0 |                         | 
+          22 |             0 |            8 |              0 |                         | 
+          22 |             0 |            9 |              0 |                         | 
+          22 |             0 |           10 |              0 |                         | 
+          22 |             0 |           11 |              0 |                         | 
+          22 |             0 |           12 |              0 |                         | 
+          22 |             0 |           13 |              0 |                         | 
+          23 |             0 |            1 |              0 |                         | 
+          23 |             0 |            2 |              0 |                         | 
+          23 |             0 |            3 |              0 |                         | 
+          23 |             0 |            4 |              0 |                         | 
+          23 |             0 |            5 |              0 |                         | 
+          23 |             0 |            6 |              0 |                         | 
+          23 |             0 |            7 |              0 |                         | 
+          23 |             0 |            8 |              0 |                         | 
+          23 |             0 |            9 |              0 |                         | 
+          23 |             0 |           10 |              0 |                         | 
+          23 |             0 |           11 |              0 |                         | 
+          23 |             0 |           12 |              0 |                         | 
+          23 |             0 |           13 |              0 |                         | 
+          24 |             0 |            1 |              0 |                         | 
+          24 |             0 |            2 |              0 |                         | 
+          24 |             0 |            3 |              0 |                         | 
+          24 |             0 |            4 |              0 |                         | 
+          24 |             0 |            5 |              0 |                         | 
+          24 |             0 |            6 |              0 |                         | 
+          24 |             0 |            7 |              0 |                         | 
+          24 |             0 |            8 |              0 |                         | 
+          24 |             0 |            9 |              0 |                         | 
+          24 |             0 |           10 |              0 |                         | 
+          24 |             0 |           11 |              0 |                         | 
+          24 |             0 |           12 |              0 |                         | 
+          24 |             0 |           13 |              0 |                         | 
+          25 |             0 |            1 |              0 |                         | 
+          25 |             0 |            2 |              0 |                         | 
+          25 |             0 |            3 |              0 |                         | 
+          25 |             0 |            4 |              0 |                         | 
+          25 |             0 |            5 |              0 |                         | 
+          25 |             0 |            6 |              0 |                         | 
+          25 |             0 |            7 |              0 |                         | 
+          25 |             0 |            8 |              0 |                         | 
+          25 |             0 |            9 |              0 |                         | 
+          25 |             0 |           10 |              0 |                         | 
+          25 |             0 |           11 |              0 |                         | 
+          25 |             0 |           12 |              0 |                         | 
+          25 |             0 |           13 |              0 |                         | 
+          26 |             0 |            1 |              0 |                         | 
+          26 |             0 |            2 |              0 |                         | 
+          26 |             0 |            3 |              0 |                         | 
+          26 |             0 |            4 |              0 |                         | 
+          26 |             0 |            5 |              0 |                         | 
+          26 |             0 |            6 |              0 |                         | 
+          26 |             0 |            7 |              0 |                         | 
+          26 |             0 |            8 |              0 |                         | 
+          26 |             0 |            9 |              0 |                         | 
+          26 |             0 |           10 |              0 |                         | 
+          26 |             0 |           11 |              0 |                         | 
+          26 |             0 |           12 |              0 |                         | 
+          26 |             0 |           13 |              0 |                         | 
+          27 |             0 |            1 |              0 |                         | 
+          27 |             0 |            2 |              0 |                         | 
+          27 |             0 |            3 |              0 |                         | 
+          27 |             0 |            4 |              0 |                         | 
+          27 |             0 |            5 |              0 |                         | 
+          27 |             0 |            6 |              0 |                         | 
+          27 |             0 |            7 |              0 |                         | 
+          27 |             0 |            8 |              0 |                         | 
+          27 |             0 |            9 |              0 |                         | 
+          27 |             0 |           10 |              0 |                         | 
+          27 |             0 |           11 |              0 |                         | 
+          27 |             0 |           12 |              0 |                         | 
+          27 |             0 |           13 |              0 |                         | 
+          28 |             0 |            1 |              0 |                         | 
+          28 |             0 |            2 |              0 |                         | 
+          28 |             0 |            3 |              0 |                         | 
+          28 |             0 |            4 |              0 |                         | 
+          28 |             0 |            5 |              0 |                         | 
+          28 |             0 |            6 |              0 |                         | 
+          28 |             0 |            7 |              0 |                         | 
+          28 |             0 |            8 |              0 |                         | 
+          28 |             0 |            9 |              0 |                         | 
+          28 |             0 |           10 |              0 |                         | 
+          28 |             0 |           11 |              0 |                         | 
+          28 |             0 |           12 |              0 |                         | 
+          28 |             0 |           13 |              0 |                         | 
+          29 |             0 |            1 |              0 |                         | 
+          29 |             0 |            2 |              0 |                         | 
+          29 |             0 |            3 |              0 |                         | 
+          29 |             0 |            4 |              0 |                         | 
+          29 |             0 |            5 |              0 |                         | 
+          29 |             0 |            6 |              0 |                         | 
+          29 |             0 |            7 |              0 |                         | 
+          29 |             0 |            8 |              0 |                         | 
+          29 |             0 |            9 |              0 |                         | 
+          29 |             0 |           10 |              0 |                         | 
+          29 |             0 |           11 |              0 |                         | 
+          29 |             0 |           12 |              0 |                         | 
+          29 |             0 |           13 |              0 |                         | 
+          30 |             0 |            1 |              0 |                         | 
+          30 |             0 |            2 |              0 |                         | 
+          30 |             0 |            3 |              0 |                         | 
+          30 |             0 |            4 |              0 |                         | 
+          30 |             0 |            5 |              0 |                         | 
+          30 |             0 |            6 |              0 |                         | 
+          30 |             0 |            7 |              0 |                         | 
+          30 |             0 |            8 |              0 |                         | 
+          30 |             0 |            9 |              0 |                         | 
+          30 |             0 |           10 |              0 |                         | 
+          30 |             0 |           11 |              0 |                         | 
+          30 |             0 |           12 |              0 |                         | 
+          30 |             0 |           13 |              0 |                         | 
+          31 |             0 |            1 |              0 |                         | 
+          31 |             0 |            2 |              0 |                         | 
+          31 |             0 |            3 |              0 |                         | 
+          31 |             0 |            4 |              0 |                         | 
+          31 |             0 |            5 |              0 |                         | 
+          31 |             0 |            6 |              0 |                         | 
+          31 |             0 |            7 |              0 |                         | 
+          31 |             0 |            8 |              0 |                         | 
+          31 |             0 |            9 |              0 |                         | 
+          31 |             0 |           10 |              0 |                         | 
+          31 |             0 |           11 |              0 |                         | 
+          31 |             0 |           12 |              0 |                         | 
+          31 |             0 |           13 |              0 |                         | 
+          32 |             0 |            1 |              0 |                         | 
+          32 |             0 |            2 |              0 |                         | 
+          32 |             0 |            3 |              0 |                         | 
+          32 |             0 |            4 |              0 |                         | 
+          32 |             0 |            5 |              0 |                         | 
+          32 |             0 |            6 |              0 |                         | 
+          32 |             0 |            7 |              0 |                         | 
+          32 |             0 |            8 |              0 |                         | 
+          32 |             0 |            9 |              0 |                         | 
+          32 |             0 |           10 |              0 |                         | 
+          32 |             0 |           11 |              0 |                         | 
+          32 |             0 |           12 |              0 |                         | 
+          32 |             0 |           13 |              0 |                         | 
+          33 |             0 |            1 |              0 |                         | 
+          33 |             0 |            2 |              0 |                         | 
+          33 |             0 |            3 |              0 |                         | 
+          33 |             0 |            4 |              0 |                         | 
+          33 |             0 |            5 |              0 |                         | 
+          33 |             0 |            6 |              0 |                         | 
+          33 |             0 |            7 |              0 |                         | 
+          33 |             0 |            8 |              0 |                         | 
+          33 |             0 |            9 |              0 |                         | 
+          33 |             0 |           10 |              0 |                         | 
+          33 |             0 |           11 |              0 |                         | 
+          33 |             0 |           12 |              0 |                         | 
+          33 |             0 |           13 |              0 |                         | 
+          34 |             0 |            1 |              0 |                         | 
+          34 |             0 |            2 |              0 |                         | 
+          34 |             0 |            3 |              0 |                         | 
+          34 |             0 |            4 |              0 |                         | 
+          34 |             0 |            5 |              0 |                         | 
+          34 |             0 |            6 |              0 |                         | 
+          34 |             0 |            7 |              0 |                         | 
+          34 |             0 |            8 |              0 |                         | 
+          34 |             0 |            9 |              0 |                         | 
+          34 |             0 |           10 |              0 |                         | 
+          34 |             0 |           11 |              0 |                         | 
+          34 |             0 |           12 |              0 |                         | 
+          34 |             0 |           13 |              0 |                         | 
+          35 |             0 |            1 |              0 |                         | 
+          35 |             0 |            2 |              0 |                         | 
+          35 |             0 |            3 |              0 |                         | 
+          35 |             0 |            4 |              0 |                         | 
+          35 |             0 |            5 |              0 |                         | 
+          35 |             0 |            6 |              0 |                         | 
+          35 |             0 |            7 |              0 |                         | 
+          35 |             0 |            8 |              0 |                         | 
+          35 |             0 |            9 |              0 |                         | 
+          35 |             0 |           10 |              0 |                         | 
+          35 |             0 |           11 |              0 |                         | 
+          35 |             0 |           12 |              0 |                         | 
+          35 |             0 |           13 |              0 |                         | 
+          36 |             0 |            1 |              0 |                         | 
+          36 |             0 |            2 |              0 |                         | 
+          36 |             0 |            3 |              0 |                         | 
+          36 |             0 |            4 |              0 |                         | 
+          36 |             0 |            5 |              0 |                         | 
+          36 |             0 |            6 |              0 |                         | 
+          36 |             0 |            7 |              0 |                         | 
+          36 |             0 |            8 |              0 |                         | 
+          36 |             0 |            9 |              0 |                         | 
+          36 |             0 |           10 |              0 |                         | 
+          36 |             0 |           11 |              0 |                         | 
+          36 |             0 |           12 |              0 |                         | 
+          36 |             0 |           13 |              0 |                         | 
+             |               |           14 |              0 | unmatched outer         | 
+(469 rows)
+
+rollback to settings;
+rollback;
 -- Hash join reuses the HOT status bit to indicate match status. This can only
 -- be guaranteed to produce correct results if all the hash join tuple match
 -- bits are reset before reuse. This is done upon loading them into the
@@ -1127,6 +3079,1971 @@ WHERE
 (1 row)
 
 ROLLBACK;
+-- Serial Adaptive Hash Join
+BEGIN;
+CREATE TYPE stub AS (hash INTEGER, value CHAR(8090));
+CREATE FUNCTION stub_hash(item stub)
+RETURNS INTEGER AS $$
+DECLARE
+  batch_size INTEGER;
+BEGIN
+  batch_size := 4;
+  RETURN item.hash << (batch_size - 1);
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+CREATE FUNCTION stub_eq(item1 stub, item2 stub)
+RETURNS BOOLEAN AS $$
+BEGIN
+  RETURN item1.hash = item2.hash AND item1.value = item2.value;
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+CREATE OPERATOR = (
+  FUNCTION = stub_eq,
+  LEFTARG = stub,
+  RIGHTARG = stub,
+  COMMUTATOR = =,
+  HASHES, MERGES
+);
+CREATE OPERATOR CLASS stub_hash_ops
+DEFAULT FOR TYPE stub USING hash AS
+  OPERATOR 1 =(stub, stub),
+  FUNCTION 1 stub_hash(stub);
+CREATE TABLE probeside(a stub);
+ALTER TABLE probeside ALTER COLUMN a SET STORAGE PLAIN;
+-- non-fallback batch with unmatched outer tuple
+INSERT INTO probeside SELECT '(2, "")' FROM generate_series(1, 1);
+-- fallback batch unmatched outer tuple (in first stripe maybe)
+INSERT INTO probeside SELECT '(1, "unmatched outer tuple")' FROM generate_series(1, 1);
+-- fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(1, "")' FROM generate_series(1, 5);
+-- fallback batch unmatched outer tuple (in last stripe maybe)
+-- When numbatches=4, hash 5 maps to batch 1, but after numbatches doubles to
+-- 8 batches hash 5 maps to batch 5.
+INSERT INTO probeside SELECT '(5, "")' FROM generate_series(1, 1);
+-- non-fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(3, "")' FROM generate_series(1, 1);
+-- batch with 3 stripes where non-first/non-last stripe contains unmatched outer tuple
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 5);
+INSERT INTO probeside SELECT '(6, "unmatched outer tuple")' FROM generate_series(1, 1);
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 1);
+CREATE TABLE hashside_wide(a stub, id int);
+ALTER TABLE hashside_wide ALTER COLUMN a SET STORAGE PLAIN;
+-- falls back with an unmatched inner tuple that is in fist, middle, and last
+-- stripe
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in first stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in middle stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in last stripe")', 1 FROM generate_series(1, 1);
+-- doesn't fall back -- matched tuple
+INSERT INTO hashside_wide SELECT '(3, "")', 3 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(6, "")', 6 FROM generate_series(1, 20);
+ANALYZE probeside, hashside_wide;
+SET enable_nestloop TO off;
+SET enable_mergejoin TO off;
+SET work_mem = 64;
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash | btrim 
+------+-----------------------+----+------+-------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+(215 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
++----------------------------------------------------------------
+ Hash Left Join (actual rows=215 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash | btrim | id | hash |                 btrim                  
++------+-------+----+------+----------------------------------------
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    1 |       |  1 |    1 | 
+    3 |       |  3 |    3 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+    6 |       |  6 |    6 | 
+      |       |  1 |    1 | unmatched inner tuple in first stripe
+      |       |  1 |    1 | unmatched inner tuple in last stripe
+      |       |  1 |    1 | unmatched inner tuple in middle stripe
+(214 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
++----------------------------------------------------------------
+ Hash Right Join (actual rows=214 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+FULL OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash |                 btrim                  
++------+-----------------------+----+------+----------------------------------------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+      |                       |  1 |    1 | unmatched inner tuple in first stripe
+      |                       |  1 |    1 | unmatched inner tuple in last stripe
+      |                       |  1 |    1 | unmatched inner tuple in middle stripe
+(218 rows)
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+FULL OUTER JOIN hashside_wide USING (a);
+                           QUERY PLAN                           
++----------------------------------------------------------------
+ Hash Full Join (actual rows=218 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
++-- semi-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+                           QUERY PLAN                           
++----------------------------------------------------------------
+ Hash Semi Join (actual rows=12 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+ hash | btrim 
++------+-------
+    1 | 
+    1 | 
+    1 | 
+    1 | 
+    1 | 
+    3 | 
+    6 | 
+    6 | 
+    6 | 
+    6 | 
+    6 | 
+    6 | 
+(12 rows)
+
++-- anti-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+                           QUERY PLAN                           
++----------------------------------------------------------------
+ Hash Anti Join (actual rows=4 loops=1)
+   Hash Cond: (probeside.a = hashside_wide.a)
+   ->  Seq Scan on probeside (actual rows=16 loops=1)
+   ->  Hash (actual rows=42 loops=1)
+         Buckets: 8 (originally 8)  Batches: 32 (originally 8)
+         Batch: 1  Stripes: 3
+         Batch: 6  Stripes: 3
+         ->  Seq Scan on hashside_wide (actual rows=42 loops=1)
+(8 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+ hash |         btrim         
++------+-----------------------
+    1 | unmatched outer tuple
+    2 | 
+    5 | 
+    6 | unmatched outer tuple
+(4 rows)
+
++-- parallel LOJ test case with two batches falling back
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_parallel_hash = on;
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+                                  QUERY PLAN                                   
++-------------------------------------------------------------------------------
+ Gather (actual rows=215 loops=1)
+   Workers Planned: 1
+   Workers Launched: 1
+   ->  Parallel Hash Left Join (actual rows=108 loops=2)
+         Hash Cond: (probeside.a = hashside_wide.a)
+         ->  Parallel Seq Scan on probeside (actual rows=16 loops=1)
+         ->  Parallel Hash (actual rows=21 loops=2)
+               Buckets: 8 (originally 8)  Batches: 128 (originally 8)
+               Batch: 1  Stripes: 3
+               Batch: 6  Stripes: 3
+               ->  Parallel Seq Scan on hashside_wide (actual rows=42 loops=1)
+(11 rows)
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+ hash |         btrim         | id | hash | btrim 
++------+-----------------------+----+------+-------
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 |                       |  1 |    1 | 
+    1 | unmatched outer tuple |    |      | 
+    2 |                       |    |      | 
+    3 |                       |  3 |    3 | 
+    5 |                       |    |      | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 |                       |  6 |    6 | 
+    6 | unmatched outer tuple |    |      | 
+(215 rows)
+
+rollback to settings;
++-- Test spill of batch 0 gives correct results.
+CREATE TABLE probeside_batch0(id int generated always as identity, a stub);
+ALTER TABLE probeside_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO probeside_batch0(a) SELECT '(0, "")' FROM generate_series(1, 13);
+INSERT INTO probeside_batch0(a) SELECT '(0, "unmatched outer")' FROM generate_series(1, 1);
+CREATE TABLE hashside_wide_batch0(id int generated always as identity, a stub);
+ALTER TABLE hashside_wide_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+ANALYZE probeside_batch0, hashside_wide_batch0;
+SELECT
+       hashside_wide_batch0.id as hashside_id, 
+       (hashside_wide_batch0.a).hash as hashside_hash,
+        probeside_batch0.id as probeside_id, 
+       (probeside_batch0.a).hash as probeside_hash,
+        TRIM((probeside_batch0.a).value) as probeside_trimmed_value,
+        TRIM((hashside_wide_batch0.a).value) as hashside_trimmed_value 
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5, 6;
+ hashside_id | hashside_hash | probeside_id | probeside_hash | probeside_trimmed_value | hashside_trimmed_value 
++-------------+---------------+--------------+----------------+-------------------------+------------------------
+           1 |             0 |            1 |              0 |                         | 
+           1 |             0 |            2 |              0 |                         | 
+           1 |             0 |            3 |              0 |                         | 
+           1 |             0 |            4 |              0 |                         | 
+           1 |             0 |            5 |              0 |                         | 
+           1 |             0 |            6 |              0 |                         | 
+           1 |             0 |            7 |              0 |                         | 
+           1 |             0 |            8 |              0 |                         | 
+           1 |             0 |            9 |              0 |                         | 
+           1 |             0 |           10 |              0 |                         | 
+           1 |             0 |           11 |              0 |                         | 
+           1 |             0 |           12 |              0 |                         | 
+           1 |             0 |           13 |              0 |                         | 
+           2 |             0 |            1 |              0 |                         | 
+           2 |             0 |            2 |              0 |                         | 
+           2 |             0 |            3 |              0 |                         | 
+           2 |             0 |            4 |              0 |                         | 
+           2 |             0 |            5 |              0 |                         | 
+           2 |             0 |            6 |              0 |                         | 
+           2 |             0 |            7 |              0 |                         | 
+           2 |             0 |            8 |              0 |                         | 
+           2 |             0 |            9 |              0 |                         | 
+           2 |             0 |           10 |              0 |                         | 
+           2 |             0 |           11 |              0 |                         | 
+           2 |             0 |           12 |              0 |                         | 
+           2 |             0 |           13 |              0 |                         | 
+           3 |             0 |            1 |              0 |                         | 
+           3 |             0 |            2 |              0 |                         | 
+           3 |             0 |            3 |              0 |                         | 
+           3 |             0 |            4 |              0 |                         | 
+           3 |             0 |            5 |              0 |                         | 
+           3 |             0 |            6 |              0 |                         | 
+           3 |             0 |            7 |              0 |                         | 
+           3 |             0 |            8 |              0 |                         | 
+           3 |             0 |            9 |              0 |                         | 
+           3 |             0 |           10 |              0 |                         | 
+           3 |             0 |           11 |              0 |                         | 
+           3 |             0 |           12 |              0 |                         | 
+           3 |             0 |           13 |              0 |                         | 
+           4 |             0 |            1 |              0 |                         | 
+           4 |             0 |            2 |              0 |                         | 
+           4 |             0 |            3 |              0 |                         | 
+           4 |             0 |            4 |              0 |                         | 
+           4 |             0 |            5 |              0 |                         | 
+           4 |             0 |            6 |              0 |                         | 
+           4 |             0 |            7 |              0 |                         | 
+           4 |             0 |            8 |              0 |                         | 
+           4 |             0 |            9 |              0 |                         | 
+           4 |             0 |           10 |              0 |                         | 
+           4 |             0 |           11 |              0 |                         | 
+           4 |             0 |           12 |              0 |                         | 
+           4 |             0 |           13 |              0 |                         | 
+           5 |             0 |            1 |              0 |                         | 
+           5 |             0 |            2 |              0 |                         | 
+           5 |             0 |            3 |              0 |                         | 
+           5 |             0 |            4 |              0 |                         | 
+           5 |             0 |            5 |              0 |                         | 
+           5 |             0 |            6 |              0 |                         | 
+           5 |             0 |            7 |              0 |                         | 
+           5 |             0 |            8 |              0 |                         | 
+           5 |             0 |            9 |              0 |                         | 
+           5 |             0 |           10 |              0 |                         | 
+           5 |             0 |           11 |              0 |                         | 
+           5 |             0 |           12 |              0 |                         | 
+           5 |             0 |           13 |              0 |                         | 
+           6 |             0 |            1 |              0 |                         | 
+           6 |             0 |            2 |              0 |                         | 
+           6 |             0 |            3 |              0 |                         | 
+           6 |             0 |            4 |              0 |                         | 
+           6 |             0 |            5 |              0 |                         | 
+           6 |             0 |            6 |              0 |                         | 
+           6 |             0 |            7 |              0 |                         | 
+           6 |             0 |            8 |              0 |                         | 
+           6 |             0 |            9 |              0 |                         | 
+           6 |             0 |           10 |              0 |                         | 
+           6 |             0 |           11 |              0 |                         | 
+           6 |             0 |           12 |              0 |                         | 
+           6 |             0 |           13 |              0 |                         | 
+           7 |             0 |            1 |              0 |                         | 
+           7 |             0 |            2 |              0 |                         | 
+           7 |             0 |            3 |              0 |                         | 
+           7 |             0 |            4 |              0 |                         | 
+           7 |             0 |            5 |              0 |                         | 
+           7 |             0 |            6 |              0 |                         | 
+           7 |             0 |            7 |              0 |                         | 
+           7 |             0 |            8 |              0 |                         | 
+           7 |             0 |            9 |              0 |                         | 
+           7 |             0 |           10 |              0 |                         | 
+           7 |             0 |           11 |              0 |                         | 
+           7 |             0 |           12 |              0 |                         | 
+           7 |             0 |           13 |              0 |                         | 
+           8 |             0 |            1 |              0 |                         | 
+           8 |             0 |            2 |              0 |                         | 
+           8 |             0 |            3 |              0 |                         | 
+           8 |             0 |            4 |              0 |                         | 
+           8 |             0 |            5 |              0 |                         | 
+           8 |             0 |            6 |              0 |                         | 
+           8 |             0 |            7 |              0 |                         | 
+           8 |             0 |            8 |              0 |                         | 
+           8 |             0 |            9 |              0 |                         | 
+           8 |             0 |           10 |              0 |                         | 
+           8 |             0 |           11 |              0 |                         | 
+           8 |             0 |           12 |              0 |                         | 
+           8 |             0 |           13 |              0 |                         | 
+           9 |             0 |            1 |              0 |                         | 
+           9 |             0 |            2 |              0 |                         | 
+           9 |             0 |            3 |              0 |                         | 
+           9 |             0 |            4 |              0 |                         | 
+           9 |             0 |            5 |              0 |                         | 
+           9 |             0 |            6 |              0 |                         | 
+           9 |             0 |            7 |              0 |                         | 
+           9 |             0 |            8 |              0 |                         | 
+           9 |             0 |            9 |              0 |                         | 
+           9 |             0 |           10 |              0 |                         | 
+           9 |             0 |           11 |              0 |                         | 
+           9 |             0 |           12 |              0 |                         | 
+           9 |             0 |           13 |              0 |                         | 
+          10 |             0 |            1 |              0 |                         | 
+          10 |             0 |            2 |              0 |                         | 
+          10 |             0 |            3 |              0 |                         | 
+          10 |             0 |            4 |              0 |                         | 
+          10 |             0 |            5 |              0 |                         | 
+          10 |             0 |            6 |              0 |                         | 
+          10 |             0 |            7 |              0 |                         | 
+          10 |             0 |            8 |              0 |                         | 
+          10 |             0 |            9 |              0 |                         | 
+          10 |             0 |           10 |              0 |                         | 
+          10 |             0 |           11 |              0 |                         | 
+          10 |             0 |           12 |              0 |                         | 
+          10 |             0 |           13 |              0 |                         | 
+          11 |             0 |            1 |              0 |                         | 
+          11 |             0 |            2 |              0 |                         | 
+          11 |             0 |            3 |              0 |                         | 
+          11 |             0 |            4 |              0 |                         | 
+          11 |             0 |            5 |              0 |                         | 
+          11 |             0 |            6 |              0 |                         | 
+          11 |             0 |            7 |              0 |                         | 
+          11 |             0 |            8 |              0 |                         | 
+          11 |             0 |            9 |              0 |                         | 
+          11 |             0 |           10 |              0 |                         | 
+          11 |             0 |           11 |              0 |                         | 
+          11 |             0 |           12 |              0 |                         | 
+          11 |             0 |           13 |              0 |                         | 
+          12 |             0 |            1 |              0 |                         | 
+          12 |             0 |            2 |              0 |                         | 
+          12 |             0 |            3 |              0 |                         | 
+          12 |             0 |            4 |              0 |                         | 
+          12 |             0 |            5 |              0 |                         | 
+          12 |             0 |            6 |              0 |                         | 
+          12 |             0 |            7 |              0 |                         | 
+          12 |             0 |            8 |              0 |                         | 
+          12 |             0 |            9 |              0 |                         | 
+          12 |             0 |           10 |              0 |                         | 
+          12 |             0 |           11 |              0 |                         | 
+          12 |             0 |           12 |              0 |                         | 
+          12 |             0 |           13 |              0 |                         | 
+          13 |             0 |            1 |              0 |                         | 
+          13 |             0 |            2 |              0 |                         | 
+          13 |             0 |            3 |              0 |                         | 
+          13 |             0 |            4 |              0 |                         | 
+          13 |             0 |            5 |              0 |                         | 
+          13 |             0 |            6 |              0 |                         | 
+          13 |             0 |            7 |              0 |                         | 
+          13 |             0 |            8 |              0 |                         | 
+          13 |             0 |            9 |              0 |                         | 
+          13 |             0 |           10 |              0 |                         | 
+          13 |             0 |           11 |              0 |                         | 
+          13 |             0 |           12 |              0 |                         | 
+          13 |             0 |           13 |              0 |                         | 
+          14 |             0 |            1 |              0 |                         | 
+          14 |             0 |            2 |              0 |                         | 
+          14 |             0 |            3 |              0 |                         | 
+          14 |             0 |            4 |              0 |                         | 
+          14 |             0 |            5 |              0 |                         | 
+          14 |             0 |            6 |              0 |                         | 
+          14 |             0 |            7 |              0 |                         | 
+          14 |             0 |            8 |              0 |                         | 
+          14 |             0 |            9 |              0 |                         | 
+          14 |             0 |           10 |              0 |                         | 
+          14 |             0 |           11 |              0 |                         | 
+          14 |             0 |           12 |              0 |                         | 
+          14 |             0 |           13 |              0 |                         | 
+          15 |             0 |            1 |              0 |                         | 
+          15 |             0 |            2 |              0 |                         | 
+          15 |             0 |            3 |              0 |                         | 
+          15 |             0 |            4 |              0 |                         | 
+          15 |             0 |            5 |              0 |                         | 
+          15 |             0 |            6 |              0 |                         | 
+          15 |             0 |            7 |              0 |                         | 
+          15 |             0 |            8 |              0 |                         | 
+          15 |             0 |            9 |              0 |                         | 
+          15 |             0 |           10 |              0 |                         | 
+          15 |             0 |           11 |              0 |                         | 
+          15 |             0 |           12 |              0 |                         | 
+          15 |             0 |           13 |              0 |                         | 
+          16 |             0 |            1 |              0 |                         | 
+          16 |             0 |            2 |              0 |                         | 
+          16 |             0 |            3 |              0 |                         | 
+          16 |             0 |            4 |              0 |                         | 
+          16 |             0 |            5 |              0 |                         | 
+          16 |             0 |            6 |              0 |                         | 
+          16 |             0 |            7 |              0 |                         | 
+          16 |             0 |            8 |              0 |                         | 
+          16 |             0 |            9 |              0 |                         | 
+          16 |             0 |           10 |              0 |                         | 
+          16 |             0 |           11 |              0 |                         | 
+          16 |             0 |           12 |              0 |                         | 
+          16 |             0 |           13 |              0 |                         | 
+          17 |             0 |            1 |              0 |                         | 
+          17 |             0 |            2 |              0 |                         | 
+          17 |             0 |            3 |              0 |                         | 
+          17 |             0 |            4 |              0 |                         | 
+          17 |             0 |            5 |              0 |                         | 
+          17 |             0 |            6 |              0 |                         | 
+          17 |             0 |            7 |              0 |                         | 
+          17 |             0 |            8 |              0 |                         | 
+          17 |             0 |            9 |              0 |                         | 
+          17 |             0 |           10 |              0 |                         | 
+          17 |             0 |           11 |              0 |                         | 
+          17 |             0 |           12 |              0 |                         | 
+          17 |             0 |           13 |              0 |                         | 
+          18 |             0 |            1 |              0 |                         | 
+          18 |             0 |            2 |              0 |                         | 
+          18 |             0 |            3 |              0 |                         | 
+          18 |             0 |            4 |              0 |                         | 
+          18 |             0 |            5 |              0 |                         | 
+          18 |             0 |            6 |              0 |                         | 
+          18 |             0 |            7 |              0 |                         | 
+          18 |             0 |            8 |              0 |                         | 
+          18 |             0 |            9 |              0 |                         | 
+          18 |             0 |           10 |              0 |                         | 
+          18 |             0 |           11 |              0 |                         | 
+          18 |             0 |           12 |              0 |                         | 
+          18 |             0 |           13 |              0 |                         | 
+          19 |             0 |            1 |              0 |                         | 
+          19 |             0 |            2 |              0 |                         | 
+          19 |             0 |            3 |              0 |                         | 
+          19 |             0 |            4 |              0 |                         | 
+          19 |             0 |            5 |              0 |                         | 
+          19 |             0 |            6 |              0 |                         | 
+          19 |             0 |            7 |              0 |                         | 
+          19 |             0 |            8 |              0 |                         | 
+          19 |             0 |            9 |              0 |                         | 
+          19 |             0 |           10 |              0 |                         | 
+          19 |             0 |           11 |              0 |                         | 
+          19 |             0 |           12 |              0 |                         | 
+          19 |             0 |           13 |              0 |                         | 
+          20 |             0 |            1 |              0 |                         | 
+          20 |             0 |            2 |              0 |                         | 
+          20 |             0 |            3 |              0 |                         | 
+          20 |             0 |            4 |              0 |                         | 
+          20 |             0 |            5 |              0 |                         | 
+          20 |             0 |            6 |              0 |                         | 
+          20 |             0 |            7 |              0 |                         | 
+          20 |             0 |            8 |              0 |                         | 
+          20 |             0 |            9 |              0 |                         | 
+          20 |             0 |           10 |              0 |                         | 
+          20 |             0 |           11 |              0 |                         | 
+          20 |             0 |           12 |              0 |                         | 
+          20 |             0 |           13 |              0 |                         | 
+          21 |             0 |            1 |              0 |                         | 
+          21 |             0 |            2 |              0 |                         | 
+          21 |             0 |            3 |              0 |                         | 
+          21 |             0 |            4 |              0 |                         | 
+          21 |             0 |            5 |              0 |                         | 
+          21 |             0 |            6 |              0 |                         | 
+          21 |             0 |            7 |              0 |                         | 
+          21 |             0 |            8 |              0 |                         | 
+          21 |             0 |            9 |              0 |                         | 
+          21 |             0 |           10 |              0 |                         | 
+          21 |             0 |           11 |              0 |                         | 
+          21 |             0 |           12 |              0 |                         | 
+          21 |             0 |           13 |              0 |                         | 
+          22 |             0 |            1 |              0 |                         | 
+          22 |             0 |            2 |              0 |                         | 
+          22 |             0 |            3 |              0 |                         | 
+          22 |             0 |            4 |              0 |                         | 
+          22 |             0 |            5 |              0 |                         | 
+          22 |             0 |            6 |              0 |                         | 
+          22 |             0 |            7 |              0 |                         | 
+          22 |             0 |            8 |              0 |                         | 
+          22 |             0 |            9 |              0 |                         | 
+          22 |             0 |           10 |              0 |                         | 
+          22 |             0 |           11 |              0 |                         | 
+          22 |             0 |           12 |              0 |                         | 
+          22 |             0 |           13 |              0 |                         | 
+          23 |             0 |            1 |              0 |                         | 
+          23 |             0 |            2 |              0 |                         | 
+          23 |             0 |            3 |              0 |                         | 
+          23 |             0 |            4 |              0 |                         | 
+          23 |             0 |            5 |              0 |                         | 
+          23 |             0 |            6 |              0 |                         | 
+          23 |             0 |            7 |              0 |                         | 
+          23 |             0 |            8 |              0 |                         | 
+          23 |             0 |            9 |              0 |                         | 
+          23 |             0 |           10 |              0 |                         | 
+          23 |             0 |           11 |              0 |                         | 
+          23 |             0 |           12 |              0 |                         | 
+          23 |             0 |           13 |              0 |                         | 
+          24 |             0 |            1 |              0 |                         | 
+          24 |             0 |            2 |              0 |                         | 
+          24 |             0 |            3 |              0 |                         | 
+          24 |             0 |            4 |              0 |                         | 
+          24 |             0 |            5 |              0 |                         | 
+          24 |             0 |            6 |              0 |                         | 
+          24 |             0 |            7 |              0 |                         | 
+          24 |             0 |            8 |              0 |                         | 
+          24 |             0 |            9 |              0 |                         | 
+          24 |             0 |           10 |              0 |                         | 
+          24 |             0 |           11 |              0 |                         | 
+          24 |             0 |           12 |              0 |                         | 
+          24 |             0 |           13 |              0 |                         | 
+          25 |             0 |            1 |              0 |                         | 
+          25 |             0 |            2 |              0 |                         | 
+          25 |             0 |            3 |              0 |                         | 
+          25 |             0 |            4 |              0 |                         | 
+          25 |             0 |            5 |              0 |                         | 
+          25 |             0 |            6 |              0 |                         | 
+          25 |             0 |            7 |              0 |                         | 
+          25 |             0 |            8 |              0 |                         | 
+          25 |             0 |            9 |              0 |                         | 
+          25 |             0 |           10 |              0 |                         | 
+          25 |             0 |           11 |              0 |                         | 
+          25 |             0 |           12 |              0 |                         | 
+          25 |             0 |           13 |              0 |                         | 
+          26 |             0 |            1 |              0 |                         | 
+          26 |             0 |            2 |              0 |                         | 
+          26 |             0 |            3 |              0 |                         | 
+          26 |             0 |            4 |              0 |                         | 
+          26 |             0 |            5 |              0 |                         | 
+          26 |             0 |            6 |              0 |                         | 
+          26 |             0 |            7 |              0 |                         | 
+          26 |             0 |            8 |              0 |                         | 
+          26 |             0 |            9 |              0 |                         | 
+          26 |             0 |           10 |              0 |                         | 
+          26 |             0 |           11 |              0 |                         | 
+          26 |             0 |           12 |              0 |                         | 
+          26 |             0 |           13 |              0 |                         | 
+          27 |             0 |            1 |              0 |                         | 
+          27 |             0 |            2 |              0 |                         | 
+          27 |             0 |            3 |              0 |                         | 
+          27 |             0 |            4 |              0 |                         | 
+          27 |             0 |            5 |              0 |                         | 
+          27 |             0 |            6 |              0 |                         | 
+          27 |             0 |            7 |              0 |                         | 
+          27 |             0 |            8 |              0 |                         | 
+          27 |             0 |            9 |              0 |                         | 
+          27 |             0 |           10 |              0 |                         | 
+          27 |             0 |           11 |              0 |                         | 
+          27 |             0 |           12 |              0 |                         | 
+          27 |             0 |           13 |              0 |                         | 
+             |               |           14 |              0 | unmatched outer         | 
+(352 rows)
+
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+savepoint settings;
+set max_parallel_workers_per_gather = 1;
+set enable_parallel_hash = on;
+set work_mem = '64kB';
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a);
+                                      QUERY PLAN                                      
++--------------------------------------------------------------------------------------
+ Gather (actual rows=469 loops=1)
+   Workers Planned: 1
+   Workers Launched: 1
+   ->  Parallel Hash Left Join (actual rows=234 loops=2)
+         Hash Cond: (probeside_batch0.a = hashside_wide_batch0.a)
+         ->  Parallel Seq Scan on probeside_batch0 (actual rows=14 loops=1)
+         ->  Parallel Hash (actual rows=18 loops=2)
+               Buckets: 8 (originally 8)  Batches: 16 (originally 8)
+               Batch: 0  Stripes: 5
+               ->  Parallel Seq Scan on hashside_wide_batch0 (actual rows=36 loops=1)
+(10 rows)
+
+SELECT
+       hashside_wide_batch0.id as hashside_id, 
+       (hashside_wide_batch0.a).hash as hashside_hash,
+        probeside_batch0.id as probeside_id, 
+       (probeside_batch0.a).hash as probeside_hash,
+        TRIM((probeside_batch0.a).value) as probeside_trimmed_value,
+        TRIM((hashside_wide_batch0.a).value) as hashside_trimmed_value 
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5, 6;
+ hashside_id | hashside_hash | probeside_id | probeside_hash | probeside_trimmed_value | hashside_trimmed_value 
++-------------+---------------+--------------+----------------+-------------------------+------------------------
+           1 |             0 |            1 |              0 |                         | 
+           1 |             0 |            2 |              0 |                         | 
+           1 |             0 |            3 |              0 |                         | 
+           1 |             0 |            4 |              0 |                         | 
+           1 |             0 |            5 |              0 |                         | 
+           1 |             0 |            6 |              0 |                         | 
+           1 |             0 |            7 |              0 |                         | 
+           1 |             0 |            8 |              0 |                         | 
+           1 |             0 |            9 |              0 |                         | 
+           1 |             0 |           10 |              0 |                         | 
+           1 |             0 |           11 |              0 |                         | 
+           1 |             0 |           12 |              0 |                         | 
+           1 |             0 |           13 |              0 |                         | 
+           2 |             0 |            1 |              0 |                         | 
+           2 |             0 |            2 |              0 |                         | 
+           2 |             0 |            3 |              0 |                         | 
+           2 |             0 |            4 |              0 |                         | 
+           2 |             0 |            5 |              0 |                         | 
+           2 |             0 |            6 |              0 |                         | 
+           2 |             0 |            7 |              0 |                         | 
+           2 |             0 |            8 |              0 |                         | 
+           2 |             0 |            9 |              0 |                         | 
+           2 |             0 |           10 |              0 |                         | 
+           2 |             0 |           11 |              0 |                         | 
+           2 |             0 |           12 |              0 |                         | 
+           2 |             0 |           13 |              0 |                         | 
+           3 |             0 |            1 |              0 |                         | 
+           3 |             0 |            2 |              0 |                         | 
+           3 |             0 |            3 |              0 |                         | 
+           3 |             0 |            4 |              0 |                         | 
+           3 |             0 |            5 |              0 |                         | 
+           3 |             0 |            6 |              0 |                         | 
+           3 |             0 |            7 |              0 |                         | 
+           3 |             0 |            8 |              0 |                         | 
+           3 |             0 |            9 |              0 |                         | 
+           3 |             0 |           10 |              0 |                         | 
+           3 |             0 |           11 |              0 |                         | 
+           3 |             0 |           12 |              0 |                         | 
+           3 |             0 |           13 |              0 |                         | 
+           4 |             0 |            1 |              0 |                         | 
+           4 |             0 |            2 |              0 |                         | 
+           4 |             0 |            3 |              0 |                         | 
+           4 |             0 |            4 |              0 |                         | 
+           4 |             0 |            5 |              0 |                         | 
+           4 |             0 |            6 |              0 |                         | 
+           4 |             0 |            7 |              0 |                         | 
+           4 |             0 |            8 |              0 |                         | 
+           4 |             0 |            9 |              0 |                         | 
+           4 |             0 |           10 |              0 |                         | 
+           4 |             0 |           11 |              0 |                         | 
+           4 |             0 |           12 |              0 |                         | 
+           4 |             0 |           13 |              0 |                         | 
+           5 |             0 |            1 |              0 |                         | 
+           5 |             0 |            2 |              0 |                         | 
+           5 |             0 |            3 |              0 |                         | 
+           5 |             0 |            4 |              0 |                         | 
+           5 |             0 |            5 |              0 |                         | 
+           5 |             0 |            6 |              0 |                         | 
+           5 |             0 |            7 |              0 |                         | 
+           5 |             0 |            8 |              0 |                         | 
+           5 |             0 |            9 |              0 |                         | 
+           5 |             0 |           10 |              0 |                         | 
+           5 |             0 |           11 |              0 |                         | 
+           5 |             0 |           12 |              0 |                         | 
+           5 |             0 |           13 |              0 |                         | 
+           6 |             0 |            1 |              0 |                         | 
+           6 |             0 |            2 |              0 |                         | 
+           6 |             0 |            3 |              0 |                         | 
+           6 |             0 |            4 |              0 |                         | 
+           6 |             0 |            5 |              0 |                         | 
+           6 |             0 |            6 |              0 |                         | 
+           6 |             0 |            7 |              0 |                         | 
+           6 |             0 |            8 |              0 |                         | 
+           6 |             0 |            9 |              0 |                         | 
+           6 |             0 |           10 |              0 |                         | 
+           6 |             0 |           11 |              0 |                         | 
+           6 |             0 |           12 |              0 |                         | 
+           6 |             0 |           13 |              0 |                         | 
+           7 |             0 |            1 |              0 |                         | 
+           7 |             0 |            2 |              0 |                         | 
+           7 |             0 |            3 |              0 |                         | 
+           7 |             0 |            4 |              0 |                         | 
+           7 |             0 |            5 |              0 |                         | 
+           7 |             0 |            6 |              0 |                         | 
+           7 |             0 |            7 |              0 |                         | 
+           7 |             0 |            8 |              0 |                         | 
+           7 |             0 |            9 |              0 |                         | 
+           7 |             0 |           10 |              0 |                         | 
+           7 |             0 |           11 |              0 |                         | 
+           7 |             0 |           12 |              0 |                         | 
+           7 |             0 |           13 |              0 |                         | 
+           8 |             0 |            1 |              0 |                         | 
+           8 |             0 |            2 |              0 |                         | 
+           8 |             0 |            3 |              0 |                         | 
+           8 |             0 |            4 |              0 |                         | 
+           8 |             0 |            5 |              0 |                         | 
+           8 |             0 |            6 |              0 |                         | 
+           8 |             0 |            7 |              0 |                         | 
+           8 |             0 |            8 |              0 |                         | 
+           8 |             0 |            9 |              0 |                         | 
+           8 |             0 |           10 |              0 |                         | 
+           8 |             0 |           11 |              0 |                         | 
+           8 |             0 |           12 |              0 |                         | 
+           8 |             0 |           13 |              0 |                         | 
+           9 |             0 |            1 |              0 |                         | 
+           9 |             0 |            2 |              0 |                         | 
+           9 |             0 |            3 |              0 |                         | 
+           9 |             0 |            4 |              0 |                         | 
+           9 |             0 |            5 |              0 |                         | 
+           9 |             0 |            6 |              0 |                         | 
+           9 |             0 |            7 |              0 |                         | 
+           9 |             0 |            8 |              0 |                         | 
+           9 |             0 |            9 |              0 |                         | 
+           9 |             0 |           10 |              0 |                         | 
+           9 |             0 |           11 |              0 |                         | 
+           9 |             0 |           12 |              0 |                         | 
+           9 |             0 |           13 |              0 |                         | 
+          10 |             0 |            1 |              0 |                         | 
+          10 |             0 |            2 |              0 |                         | 
+          10 |             0 |            3 |              0 |                         | 
+          10 |             0 |            4 |              0 |                         | 
+          10 |             0 |            5 |              0 |                         | 
+          10 |             0 |            6 |              0 |                         | 
+          10 |             0 |            7 |              0 |                         | 
+          10 |             0 |            8 |              0 |                         | 
+          10 |             0 |            9 |              0 |                         | 
+          10 |             0 |           10 |              0 |                         | 
+          10 |             0 |           11 |              0 |                         | 
+          10 |             0 |           12 |              0 |                         | 
+          10 |             0 |           13 |              0 |                         | 
+          11 |             0 |            1 |              0 |                         | 
+          11 |             0 |            2 |              0 |                         | 
+          11 |             0 |            3 |              0 |                         | 
+          11 |             0 |            4 |              0 |                         | 
+          11 |             0 |            5 |              0 |                         | 
+          11 |             0 |            6 |              0 |                         | 
+          11 |             0 |            7 |              0 |                         | 
+          11 |             0 |            8 |              0 |                         | 
+          11 |             0 |            9 |              0 |                         | 
+          11 |             0 |           10 |              0 |                         | 
+          11 |             0 |           11 |              0 |                         | 
+          11 |             0 |           12 |              0 |                         | 
+          11 |             0 |           13 |              0 |                         | 
+          12 |             0 |            1 |              0 |                         | 
+          12 |             0 |            2 |              0 |                         | 
+          12 |             0 |            3 |              0 |                         | 
+          12 |             0 |            4 |              0 |                         | 
+          12 |             0 |            5 |              0 |                         | 
+          12 |             0 |            6 |              0 |                         | 
+          12 |             0 |            7 |              0 |                         | 
+          12 |             0 |            8 |              0 |                         | 
+          12 |             0 |            9 |              0 |                         | 
+          12 |             0 |           10 |              0 |                         | 
+          12 |             0 |           11 |              0 |                         | 
+          12 |             0 |           12 |              0 |                         | 
+          12 |             0 |           13 |              0 |                         | 
+          13 |             0 |            1 |              0 |                         | 
+          13 |             0 |            2 |              0 |                         | 
+          13 |             0 |            3 |              0 |                         | 
+          13 |             0 |            4 |              0 |                         | 
+          13 |             0 |            5 |              0 |                         | 
+          13 |             0 |            6 |              0 |                         | 
+          13 |             0 |            7 |              0 |                         | 
+          13 |             0 |            8 |              0 |                         | 
+          13 |             0 |            9 |              0 |                         | 
+          13 |             0 |           10 |              0 |                         | 
+          13 |             0 |           11 |              0 |                         | 
+          13 |             0 |           12 |              0 |                         | 
+          13 |             0 |           13 |              0 |                         | 
+          14 |             0 |            1 |              0 |                         | 
+          14 |             0 |            2 |              0 |                         | 
+          14 |             0 |            3 |              0 |                         | 
+          14 |             0 |            4 |              0 |                         | 
+          14 |             0 |            5 |              0 |                         | 
+          14 |             0 |            6 |              0 |                         | 
+          14 |             0 |            7 |              0 |                         | 
+          14 |             0 |            8 |              0 |                         | 
+          14 |             0 |            9 |              0 |                         | 
+          14 |             0 |           10 |              0 |                         | 
+          14 |             0 |           11 |              0 |                         | 
+          14 |             0 |           12 |              0 |                         | 
+          14 |             0 |           13 |              0 |                         | 
+          15 |             0 |            1 |              0 |                         | 
+          15 |             0 |            2 |              0 |                         | 
+          15 |             0 |            3 |              0 |                         | 
+          15 |             0 |            4 |              0 |                         | 
+          15 |             0 |            5 |              0 |                         | 
+          15 |             0 |            6 |              0 |                         | 
+          15 |             0 |            7 |              0 |                         | 
+          15 |             0 |            8 |              0 |                         | 
+          15 |             0 |            9 |              0 |                         | 
+          15 |             0 |           10 |              0 |                         | 
+          15 |             0 |           11 |              0 |                         | 
+          15 |             0 |           12 |              0 |                         | 
+          15 |             0 |           13 |              0 |                         | 
+          16 |             0 |            1 |              0 |                         | 
+          16 |             0 |            2 |              0 |                         | 
+          16 |             0 |            3 |              0 |                         | 
+          16 |             0 |            4 |              0 |                         | 
+          16 |             0 |            5 |              0 |                         | 
+          16 |             0 |            6 |              0 |                         | 
+          16 |             0 |            7 |              0 |                         | 
+          16 |             0 |            8 |              0 |                         | 
+          16 |             0 |            9 |              0 |                         | 
+          16 |             0 |           10 |              0 |                         | 
+          16 |             0 |           11 |              0 |                         | 
+          16 |             0 |           12 |              0 |                         | 
+          16 |             0 |           13 |              0 |                         | 
+          17 |             0 |            1 |              0 |                         | 
+          17 |             0 |            2 |              0 |                         | 
+          17 |             0 |            3 |              0 |                         | 
+          17 |             0 |            4 |              0 |                         | 
+          17 |             0 |            5 |              0 |                         | 
+          17 |             0 |            6 |              0 |                         | 
+          17 |             0 |            7 |              0 |                         | 
+          17 |             0 |            8 |              0 |                         | 
+          17 |             0 |            9 |              0 |                         | 
+          17 |             0 |           10 |              0 |                         | 
+          17 |             0 |           11 |              0 |                         | 
+          17 |             0 |           12 |              0 |                         | 
+          17 |             0 |           13 |              0 |                         | 
+          18 |             0 |            1 |              0 |                         | 
+          18 |             0 |            2 |              0 |                         | 
+          18 |             0 |            3 |              0 |                         | 
+          18 |             0 |            4 |              0 |                         | 
+          18 |             0 |            5 |              0 |                         | 
+          18 |             0 |            6 |              0 |                         | 
+          18 |             0 |            7 |              0 |                         | 
+          18 |             0 |            8 |              0 |                         | 
+          18 |             0 |            9 |              0 |                         | 
+          18 |             0 |           10 |              0 |                         | 
+          18 |             0 |           11 |              0 |                         | 
+          18 |             0 |           12 |              0 |                         | 
+          18 |             0 |           13 |              0 |                         | 
+          19 |             0 |            1 |              0 |                         | 
+          19 |             0 |            2 |              0 |                         | 
+          19 |             0 |            3 |              0 |                         | 
+          19 |             0 |            4 |              0 |                         | 
+          19 |             0 |            5 |              0 |                         | 
+          19 |             0 |            6 |              0 |                         | 
+          19 |             0 |            7 |              0 |                         | 
+          19 |             0 |            8 |              0 |                         | 
+          19 |             0 |            9 |              0 |                         | 
+          19 |             0 |           10 |              0 |                         | 
+          19 |             0 |           11 |              0 |                         | 
+          19 |             0 |           12 |              0 |                         | 
+          19 |             0 |           13 |              0 |                         | 
+          20 |             0 |            1 |              0 |                         | 
+          20 |             0 |            2 |              0 |                         | 
+          20 |             0 |            3 |              0 |                         | 
+          20 |             0 |            4 |              0 |                         | 
+          20 |             0 |            5 |              0 |                         | 
+          20 |             0 |            6 |              0 |                         | 
+          20 |             0 |            7 |              0 |                         | 
+          20 |             0 |            8 |              0 |                         | 
+          20 |             0 |            9 |              0 |                         | 
+          20 |             0 |           10 |              0 |                         | 
+          20 |             0 |           11 |              0 |                         | 
+          20 |             0 |           12 |              0 |                         | 
+          20 |             0 |           13 |              0 |                         | 
+          21 |             0 |            1 |              0 |                         | 
+          21 |             0 |            2 |              0 |                         | 
+          21 |             0 |            3 |              0 |                         | 
+          21 |             0 |            4 |              0 |                         | 
+          21 |             0 |            5 |              0 |                         | 
+          21 |             0 |            6 |              0 |                         | 
+          21 |             0 |            7 |              0 |                         | 
+          21 |             0 |            8 |              0 |                         | 
+          21 |             0 |            9 |              0 |                         | 
+          21 |             0 |           10 |              0 |                         | 
+          21 |             0 |           11 |              0 |                         | 
+          21 |             0 |           12 |              0 |                         | 
+          21 |             0 |           13 |              0 |                         | 
+          22 |             0 |            1 |              0 |                         | 
+          22 |             0 |            2 |              0 |                         | 
+          22 |             0 |            3 |              0 |                         | 
+          22 |             0 |            4 |              0 |                         | 
+          22 |             0 |            5 |              0 |                         | 
+          22 |             0 |            6 |              0 |                         | 
+          22 |             0 |            7 |              0 |                         | 
+          22 |             0 |            8 |              0 |                         | 
+          22 |             0 |            9 |              0 |                         | 
+          22 |             0 |           10 |              0 |                         | 
+          22 |             0 |           11 |              0 |                         | 
+          22 |             0 |           12 |              0 |                         | 
+          22 |             0 |           13 |              0 |                         | 
+          23 |             0 |            1 |              0 |                         | 
+          23 |             0 |            2 |              0 |                         | 
+          23 |             0 |            3 |              0 |                         | 
+          23 |             0 |            4 |              0 |                         | 
+          23 |             0 |            5 |              0 |                         | 
+          23 |             0 |            6 |              0 |                         | 
+          23 |             0 |            7 |              0 |                         | 
+          23 |             0 |            8 |              0 |                         | 
+          23 |             0 |            9 |              0 |                         | 
+          23 |             0 |           10 |              0 |                         | 
+          23 |             0 |           11 |              0 |                         | 
+          23 |             0 |           12 |              0 |                         | 
+          23 |             0 |           13 |              0 |                         | 
+          24 |             0 |            1 |              0 |                         | 
+          24 |             0 |            2 |              0 |                         | 
+          24 |             0 |            3 |              0 |                         | 
+          24 |             0 |            4 |              0 |                         | 
+          24 |             0 |            5 |              0 |                         | 
+          24 |             0 |            6 |              0 |                         | 
+          24 |             0 |            7 |              0 |                         | 
+          24 |             0 |            8 |              0 |                         | 
+          24 |             0 |            9 |              0 |                         | 
+          24 |             0 |           10 |              0 |                         | 
+          24 |             0 |           11 |              0 |                         | 
+          24 |             0 |           12 |              0 |                         | 
+          24 |             0 |           13 |              0 |                         | 
+          25 |             0 |            1 |              0 |                         | 
+          25 |             0 |            2 |              0 |                         | 
+          25 |             0 |            3 |              0 |                         | 
+          25 |             0 |            4 |              0 |                         | 
+          25 |             0 |            5 |              0 |                         | 
+          25 |             0 |            6 |              0 |                         | 
+          25 |             0 |            7 |              0 |                         | 
+          25 |             0 |            8 |              0 |                         | 
+          25 |             0 |            9 |              0 |                         | 
+          25 |             0 |           10 |              0 |                         | 
+          25 |             0 |           11 |              0 |                         | 
+          25 |             0 |           12 |              0 |                         | 
+          25 |             0 |           13 |              0 |                         | 
+          26 |             0 |            1 |              0 |                         | 
+          26 |             0 |            2 |              0 |                         | 
+          26 |             0 |            3 |              0 |                         | 
+          26 |             0 |            4 |              0 |                         | 
+          26 |             0 |            5 |              0 |                         | 
+          26 |             0 |            6 |              0 |                         | 
+          26 |             0 |            7 |              0 |                         | 
+          26 |             0 |            8 |              0 |                         | 
+          26 |             0 |            9 |              0 |                         | 
+          26 |             0 |           10 |              0 |                         | 
+          26 |             0 |           11 |              0 |                         | 
+          26 |             0 |           12 |              0 |                         | 
+          26 |             0 |           13 |              0 |                         | 
+          27 |             0 |            1 |              0 |                         | 
+          27 |             0 |            2 |              0 |                         | 
+          27 |             0 |            3 |              0 |                         | 
+          27 |             0 |            4 |              0 |                         | 
+          27 |             0 |            5 |              0 |                         | 
+          27 |             0 |            6 |              0 |                         | 
+          27 |             0 |            7 |              0 |                         | 
+          27 |             0 |            8 |              0 |                         | 
+          27 |             0 |            9 |              0 |                         | 
+          27 |             0 |           10 |              0 |                         | 
+          27 |             0 |           11 |              0 |                         | 
+          27 |             0 |           12 |              0 |                         | 
+          27 |             0 |           13 |              0 |                         | 
+          28 |             0 |            1 |              0 |                         | 
+          28 |             0 |            2 |              0 |                         | 
+          28 |             0 |            3 |              0 |                         | 
+          28 |             0 |            4 |              0 |                         | 
+          28 |             0 |            5 |              0 |                         | 
+          28 |             0 |            6 |              0 |                         | 
+          28 |             0 |            7 |              0 |                         | 
+          28 |             0 |            8 |              0 |                         | 
+          28 |             0 |            9 |              0 |                         | 
+          28 |             0 |           10 |              0 |                         | 
+          28 |             0 |           11 |              0 |                         | 
+          28 |             0 |           12 |              0 |                         | 
+          28 |             0 |           13 |              0 |                         | 
+          29 |             0 |            1 |              0 |                         | 
+          29 |             0 |            2 |              0 |                         | 
+          29 |             0 |            3 |              0 |                         | 
+          29 |             0 |            4 |              0 |                         | 
+          29 |             0 |            5 |              0 |                         | 
+          29 |             0 |            6 |              0 |                         | 
+          29 |             0 |            7 |              0 |                         | 
+          29 |             0 |            8 |              0 |                         | 
+          29 |             0 |            9 |              0 |                         | 
+          29 |             0 |           10 |              0 |                         | 
+          29 |             0 |           11 |              0 |                         | 
+          29 |             0 |           12 |              0 |                         | 
+          29 |             0 |           13 |              0 |                         | 
+          30 |             0 |            1 |              0 |                         | 
+          30 |             0 |            2 |              0 |                         | 
+          30 |             0 |            3 |              0 |                         | 
+          30 |             0 |            4 |              0 |                         | 
+          30 |             0 |            5 |              0 |                         | 
+          30 |             0 |            6 |              0 |                         | 
+          30 |             0 |            7 |              0 |                         | 
+          30 |             0 |            8 |              0 |                         | 
+          30 |             0 |            9 |              0 |                         | 
+          30 |             0 |           10 |              0 |                         | 
+          30 |             0 |           11 |              0 |                         | 
+          30 |             0 |           12 |              0 |                         | 
+          30 |             0 |           13 |              0 |                         | 
+          31 |             0 |            1 |              0 |                         | 
+          31 |             0 |            2 |              0 |                         | 
+          31 |             0 |            3 |              0 |                         | 
+          31 |             0 |            4 |              0 |                         | 
+          31 |             0 |            5 |              0 |                         | 
+          31 |             0 |            6 |              0 |                         | 
+          31 |             0 |            7 |              0 |                         | 
+          31 |             0 |            8 |              0 |                         | 
+          31 |             0 |            9 |              0 |                         | 
+          31 |             0 |           10 |              0 |                         | 
+          31 |             0 |           11 |              0 |                         | 
+          31 |             0 |           12 |              0 |                         | 
+          31 |             0 |           13 |              0 |                         | 
+          32 |             0 |            1 |              0 |                         | 
+          32 |             0 |            2 |              0 |                         | 
+          32 |             0 |            3 |              0 |                         | 
+          32 |             0 |            4 |              0 |                         | 
+          32 |             0 |            5 |              0 |                         | 
+          32 |             0 |            6 |              0 |                         | 
+          32 |             0 |            7 |              0 |                         | 
+          32 |             0 |            8 |              0 |                         | 
+          32 |             0 |            9 |              0 |                         | 
+          32 |             0 |           10 |              0 |                         | 
+          32 |             0 |           11 |              0 |                         | 
+          32 |             0 |           12 |              0 |                         | 
+          32 |             0 |           13 |              0 |                         | 
+          33 |             0 |            1 |              0 |                         | 
+          33 |             0 |            2 |              0 |                         | 
+          33 |             0 |            3 |              0 |                         | 
+          33 |             0 |            4 |              0 |                         | 
+          33 |             0 |            5 |              0 |                         | 
+          33 |             0 |            6 |              0 |                         | 
+          33 |             0 |            7 |              0 |                         | 
+          33 |             0 |            8 |              0 |                         | 
+          33 |             0 |            9 |              0 |                         | 
+          33 |             0 |           10 |              0 |                         | 
+          33 |             0 |           11 |              0 |                         | 
+          33 |             0 |           12 |              0 |                         | 
+          33 |             0 |           13 |              0 |                         | 
+          34 |             0 |            1 |              0 |                         | 
+          34 |             0 |            2 |              0 |                         | 
+          34 |             0 |            3 |              0 |                         | 
+          34 |             0 |            4 |              0 |                         | 
+          34 |             0 |            5 |              0 |                         | 
+          34 |             0 |            6 |              0 |                         | 
+          34 |             0 |            7 |              0 |                         | 
+          34 |             0 |            8 |              0 |                         | 
+          34 |             0 |            9 |              0 |                         | 
+          34 |             0 |           10 |              0 |                         | 
+          34 |             0 |           11 |              0 |                         | 
+          34 |             0 |           12 |              0 |                         | 
+          34 |             0 |           13 |              0 |                         | 
+          35 |             0 |            1 |              0 |                         | 
+          35 |             0 |            2 |              0 |                         | 
+          35 |             0 |            3 |              0 |                         | 
+          35 |             0 |            4 |              0 |                         | 
+          35 |             0 |            5 |              0 |                         | 
+          35 |             0 |            6 |              0 |                         | 
+          35 |             0 |            7 |              0 |                         | 
+          35 |             0 |            8 |              0 |                         | 
+          35 |             0 |            9 |              0 |                         | 
+          35 |             0 |           10 |              0 |                         | 
+          35 |             0 |           11 |              0 |                         | 
+          35 |             0 |           12 |              0 |                         | 
+          35 |             0 |           13 |              0 |                         | 
+          36 |             0 |            1 |              0 |                         | 
+          36 |             0 |            2 |              0 |                         | 
+          36 |             0 |            3 |              0 |                         | 
+          36 |             0 |            4 |              0 |                         | 
+          36 |             0 |            5 |              0 |                         | 
+          36 |             0 |            6 |              0 |                         | 
+          36 |             0 |            7 |              0 |                         | 
+          36 |             0 |            8 |              0 |                         | 
+          36 |             0 |            9 |              0 |                         | 
+          36 |             0 |           10 |              0 |                         | 
+          36 |             0 |           11 |              0 |                         | 
+          36 |             0 |           12 |              0 |                         | 
+          36 |             0 |           13 |              0 |                         | 
+             |               |           14 |              0 | unmatched outer         | 
+(469 rows)
+
+rollback to settings;
+rollback;
 -- Verify that we behave sanely when the inner hash keys contain parameters
 -- (that is, outer or lateral references).  This situation has to defeat
 -- re-use of the inner hash table across rescans.
@@ -1136,7 +5053,7 @@ explain (costs off)
 select i8.q2, ss.* from
 int8_tbl i8,
 lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
-         on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+         on t1.fivethous = i4.f1i8.q2 order by 1,2) ss;
                         QUERY PLAN                         
 -----------------------------------------------------------
  Nested Loop
@@ -1144,7 +5061,7 @@ lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
    ->  Sort
          Sort Key: t1.fivethous, i4.f1
          ->  Hash Join
-               Hash Cond: (t1.fivethous = (i4.f1 + i8.q2))
+               Hash Cond: (t1.fivethous = (i4.f1  i8.q2))
                ->  Seq Scan on tenk1 t1
                ->  Hash
                      ->  Seq Scan on int4_tbl i4
@@ -1153,7 +5070,7 @@ lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
 select i8.q2, ss.* from
 int8_tbl i8,
 lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
-         on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+         on t1.fivethous = i4.f1i8.q2 order by 1,2) ss;
  q2  | fivethous | f1 
 -----+-----------+----
  456 |       456 |  0
diff --git a/src/test/regress/sql/join_hash.sql b/src/test/regress/sql/join_hash.sql
index 6b0688ab0a6..8cd4fec4e32 100644
--- a/src/test/regress/sql/join_hash.sql
+++ b/src/test/regress/sql/join_hash.sql
@@ -488,22 +488,26 @@ rollback to settings;
 
 -- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
 -- sts_puttuple oversized tuple cases because it's multi-batch)
-savepoint settings;
-set max_parallel_workers_per_gather = 2;
-set enable_parallel_hash = on;
-set work_mem = '128kB';
-set hash_mem_multiplier = 1.0;
-explain (costs off)
-  select length(max(s.t))
-  from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
-select length(max(s.t))
-from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
-select final > 1 as multibatch
-  from hash_join_batches(
-$$
-  select length(max(s.t))
-  from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
-$$);
+-- savepoint settings;
+-- set max_parallel_workers_per_gather = 2;
+-- set enable_parallel_hash = on;
+-- TODO: throw an error when this happens: cannot set work_mem lower than the side of a single tuple
+-- TODO: ensure that oversize tuple code is still exercised (should be with some of the stub stuff below)
+-- TODO: commented this out since it would crash otherwise
+-- this test is no longer multi-batch, so, perhaps, it should be removed
+-- set work_mem = '128kB';
+-- explain (costs off)
+--   select length(max(s.t))
+--   from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+-- select length(max(s.t))
+-- from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+-- select final > 1 as multibatch
+--   from hash_join_batches(
+-- $$
+--   select length(max(s.t))
+--   from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+-- $$);
+-- rollback to settings;
 rollback to settings;
 
 
@@ -605,6 +609,184 @@ WHERE
 
 ROLLBACK;
 
+-- Serial Adaptive Hash Join
+
+BEGIN;
+CREATE TYPE stub AS (hash INTEGER, value CHAR(8090));
+
+CREATE FUNCTION stub_hash(item stub)
+RETURNS INTEGER AS $$
+DECLARE
+  batch_size INTEGER;
+BEGIN
+  batch_size := 4;
+  RETURN item.hash << (batch_size - 1);
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+
+CREATE FUNCTION stub_eq(item1 stub, item2 stub)
+RETURNS BOOLEAN AS $$
+BEGIN
+  RETURN item1.hash = item2.hash AND item1.value = item2.value;
+END; $$ LANGUAGE plpgsql IMMUTABLE LEAKPROOF STRICT PARALLEL SAFE;
+
+CREATE OPERATOR = (
+  FUNCTION = stub_eq,
+  LEFTARG = stub,
+  RIGHTARG = stub,
+  COMMUTATOR = =,
+  HASHES, MERGES
+);
+
+CREATE OPERATOR CLASS stub_hash_ops
+DEFAULT FOR TYPE stub USING hash AS
+  OPERATOR 1 =(stub, stub),
+  FUNCTION 1 stub_hash(stub);
+
+CREATE TABLE probeside(a stub);
+ALTER TABLE probeside ALTER COLUMN a SET STORAGE PLAIN;
+-- non-fallback batch with unmatched outer tuple
+INSERT INTO probeside SELECT '(2, "")' FROM generate_series(1, 1);
+-- fallback batch unmatched outer tuple (in first stripe maybe)
+INSERT INTO probeside SELECT '(1, "unmatched outer tuple")' FROM generate_series(1, 1);
+-- fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(1, "")' FROM generate_series(1, 5);
+-- fallback batch unmatched outer tuple (in last stripe maybe)
+-- When numbatches=4, hash 5 maps to batch 1, but after numbatches doubles to
+-- 8 batches hash 5 maps to batch 5.
+INSERT INTO probeside SELECT '(5, "")' FROM generate_series(1, 1);
+-- non-fallback batch matched outer tuple
+INSERT INTO probeside SELECT '(3, "")' FROM generate_series(1, 1);
+-- batch with 3 stripes where non-first/non-last stripe contains unmatched outer tuple
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 5);
+INSERT INTO probeside SELECT '(6, "unmatched outer tuple")' FROM generate_series(1, 1);
+INSERT INTO probeside SELECT '(6, "")' FROM generate_series(1, 1);
+
+CREATE TABLE hashside_wide(a stub, id int);
+ALTER TABLE hashside_wide ALTER COLUMN a SET STORAGE PLAIN;
+-- falls back with an unmatched inner tuple that is in fist, middle, and last
+-- stripe
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in first stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in middle stripe")', 1 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(1, "")', 1 FROM generate_series(1, 9);
+INSERT INTO hashside_wide SELECT '(1, "unmatched inner tuple in last stripe")', 1 FROM generate_series(1, 1);
+
+-- doesn't fall back -- matched tuple
+INSERT INTO hashside_wide SELECT '(3, "")', 3 FROM generate_series(1, 1);
+INSERT INTO hashside_wide SELECT '(6, "")', 6 FROM generate_series(1, 20);
+
+ANALYZE probeside, hashside_wide;
+
+SET enable_nestloop TO off;
+SET enable_mergejoin TO off;
+SET work_mem = 64;
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+RIGHT OUTER JOIN hashside_wide USING (a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+FULL OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+FULL OUTER JOIN hashside_wide USING (a);
+
+-- semi-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+
+-- anti-join testcase
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off)
+SELECT probeside.* FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value)
+FROM probeside WHERE NOT EXISTS (SELECT * FROM hashside_wide WHERE probeside.a=a) ORDER BY 1, 2;
+
+-- parallel LOJ test case with two batches falling back
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_parallel_hash = on;
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a);
+
+SELECT (probeside.a).hash, TRIM((probeside.a).value), hashside_wide.id, (hashside_wide.a).hash, TRIM((hashside_wide.a).value)
+FROM probeside
+LEFT OUTER JOIN hashside_wide USING (a)
+ORDER BY 1, 2, 3, 4, 5;
+rollback to settings;
+
+-- Test spill of batch 0 gives correct results.
+CREATE TABLE probeside_batch0(id int generated always as identity, a stub);
+ALTER TABLE probeside_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO probeside_batch0(a) SELECT '(0, "")' FROM generate_series(1, 13);
+INSERT INTO probeside_batch0(a) SELECT '(0, "unmatched outer")' FROM generate_series(1, 1);
+
+CREATE TABLE hashside_wide_batch0(id int generated always as identity, a stub);
+ALTER TABLE hashside_wide_batch0 ALTER COLUMN a SET STORAGE PLAIN;
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+ANALYZE probeside_batch0, hashside_wide_batch0;
+
+SELECT
+       hashside_wide_batch0.id as hashside_id,
+       (hashside_wide_batch0.a).hash as hashside_hash,
+        probeside_batch0.id as probeside_id,
+       (probeside_batch0.a).hash as probeside_hash,
+        TRIM((probeside_batch0.a).value) as probeside_trimmed_value,
+        TRIM((hashside_wide_batch0.a).value) as hashside_trimmed_value
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5, 6;
+
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+
+savepoint settings;
+set max_parallel_workers_per_gather = 1;
+set enable_parallel_hash = on;
+set work_mem = '64kB';
+
+INSERT INTO hashside_wide_batch0(a) SELECT '(0, "")' FROM generate_series(1, 9);
+
+EXPLAIN (ANALYZE, summary off, timing off, costs off, usage off) SELECT * FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a);
+
+SELECT
+       hashside_wide_batch0.id as hashside_id,
+       (hashside_wide_batch0.a).hash as hashside_hash,
+        probeside_batch0.id as probeside_id,
+       (probeside_batch0.a).hash as probeside_hash,
+        TRIM((probeside_batch0.a).value) as probeside_trimmed_value,
+        TRIM((hashside_wide_batch0.a).value) as hashside_trimmed_value
+FROM probeside_batch0
+LEFT OUTER JOIN hashside_wide_batch0 USING (a)
+ORDER BY 1, 2, 3, 4, 5, 6;
+rollback to settings;
+
+rollback;
+
 -- Verify that we behave sanely when the inner hash keys contain parameters
 -- (that is, outer or lateral references).  This situation has to defeat
 -- re-use of the inner hash table across rescans.