lockup in parallel hash join on dikkop (freebsd 14.0-current)

Started by Tomas Vondraabout 3 years ago40 messageshackers
Jump to latest
#1Tomas Vondra
tomas.vondra@2ndquadrant.com

Hi,

I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
did not report any results for a couple days, and it seems it got into
an infinite loop in REL_11_STABLE when building hash table in a parallel
hashjoin, or something like that.

It seems to be progressing now, probably because I attached gdb to the
workers to get backtraces, which does signals etc.

Anyway, in 'ps ax' I saw this:

94545 - Ss 0:03.39 postgres: buildfarm regression [local] SELECT
94627 - Is 0:00.03 postgres: parallel worker for PID 94545
94628 - Is 0:00.02 postgres: parallel worker for PID 94545

and the backend was stuck waiting on this query:

select final > 1 as multibatch
from hash_join_batches(
$$
select count(*) from join_foo
left join (select b1.id, b1.t from join_bar b1 join join_bar
b2 using (id)) ss
on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
$$);

This started on 2023-01-20 23:23:18.125, and the next log (after I did
the gdb stuff), is from 2023-01-26 20:05:16.751. Quite a bit of time.

It seems all three processes are doing WaitEventSetWait, either through
a ConditionVariable, or WaitLatch. But I don't have any good idea of
what might have broken - and as it got "unstuck" I can't investigate
more. But I see there's nodeHash and parallelism, and I recall there's a
lot of gotchas due to how the backends cooperate when building the hash
table, etc. Thomas, any idea what might be wrong?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

94628.bt.txttext/plain; charset=UTF-8; name=94628.bt.txtDownload
94627.bt.txttext/plain; charset=UTF-8; name=94627.bt.txtDownload
94545.bt.txttext/plain; charset=UTF-8; name=94545.bt.txtDownload
query.logtext/x-log; charset=UTF-8; name=query.logDownload
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#1)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
did not report any results for a couple days, and it seems it got into
an infinite loop in REL_11_STABLE when building hash table in a parallel
hashjoin, or something like that.

It seems to be progressing now, probably because I attached gdb to the
workers to get backtraces, which does signals etc.

That reminds me of cases that I saw several times on my now-deceased
animal florican:

/messages/by-id/2245838.1645902425@sss.pgh.pa.us

There's clearly something rotten somewhere in there, but whether
it's our bug or FreeBSD's isn't clear.

regards, tom lane

#3Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#2)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

On Fri, Jan 27, 2023 at 9:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
did not report any results for a couple days, and it seems it got into
an infinite loop in REL_11_STABLE when building hash table in a parallel
hashjoin, or something like that.

It seems to be progressing now, probably because I attached gdb to the
workers to get backtraces, which does signals etc.

That reminds me of cases that I saw several times on my now-deceased
animal florican:

/messages/by-id/2245838.1645902425@sss.pgh.pa.us

There's clearly something rotten somewhere in there, but whether
it's our bug or FreeBSD's isn't clear.

And if it's ours, it's possibly in latch code and not anything higher
(I mean, not in condition variables, barriers, or parallel hash join)
because I saw a similar hang in the shm_mq stuff which uses the latch
API directly. Note that 13 switched to kqueue but still used the
self-pipe, and 14 switched to a signal event, and this hasn't been
reported in those releases or later, which makes the poll() code path
a key suspect.

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#3)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

On Fri, Jan 27, 2023 at 9:57 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Jan 27, 2023 at 9:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
did not report any results for a couple days, and it seems it got into
an infinite loop in REL_11_STABLE when building hash table in a parallel
hashjoin, or something like that.

It seems to be progressing now, probably because I attached gdb to the
workers to get backtraces, which does signals etc.

That reminds me of cases that I saw several times on my now-deceased
animal florican:

/messages/by-id/2245838.1645902425@sss.pgh.pa.us

There's clearly something rotten somewhere in there, but whether
it's our bug or FreeBSD's isn't clear.

And if it's ours, it's possibly in latch code and not anything higher
(I mean, not in condition variables, barriers, or parallel hash join)
because I saw a similar hang in the shm_mq stuff which uses the latch
API directly. Note that 13 switched to kqueue but still used the
self-pipe, and 14 switched to a signal event, and this hasn't been
reported in those releases or later, which makes the poll() code path
a key suspect.

Also, 14 changed the flag/memory barrier dance (maybe_sleeping), but
13 did it the same way as 11 + 12. So between 12 and 13 we have just
the poll -> kqueue change.

#5Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#4)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

After 1000 make check loops, and 1000 make -C src/test/modules/test_shm_mq
check loops, on the same FBSD 13.1 machine as elver which has failed
like this once before, I haven't been able to reproduce this on
REL_12_STABLE. Not really sure how to chase this, but if you see this
situation again, I'd been interested to see the output of fstat -p PID
(shows bytes in pipes) and procstat -j PID (shows pending signals) for
all PIDs involved (before connecting a debugger or doing anything else
that might make it return with EINTR, after which we know it continues
happily because it then sees latch->is_set next time around the loop).
If poll() is not returning when there are bytes ready to read from the
self-pipe, which fstat can show, I think that'd indicate a kernel bug.
If procstat -j shows signals pending but somehow it's still blocked in
the syscall. Otherwise, it might indicate a compiler or postgres bug,
but I don't have any particular theories.

#6Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#5)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

Hi,

On 2023-01-27 22:23:58 +1300, Thomas Munro wrote:

After 1000 make check loops, and 1000 make -C src/test/modules/test_shm_mq
check loops, on the same FBSD 13.1 machine as elver which has failed
like this once before, I haven't been able to reproduce this on
REL_12_STABLE.

Did you use the same compiler / compilation flags as when elver hit it?
Clearly Tomas' case was with at least some optimizations enabled.

Except that you're saying that you hit this on elver (amd64), I think it'd be
interesting that we see the failure on an arm host, which has a less strict
memory order model than x86.

IIUC elver previously hit this on 12?

Greetings,

Andres Freund

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#6)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

Andres Freund <andres@anarazel.de> writes:

Except that you're saying that you hit this on elver (amd64), I think it'd be
interesting that we see the failure on an arm host, which has a less strict
memory order model than x86.

I also saw it on florican, which is/was an i386 machine using clang and
pretty standard build options other than
'CFLAGS' => '-msse2 -O2',
so I think this isn't too much about machine architecture or compiler
flags.

Machine speed might matter though. elver is a good deal faster than
florican was, and dikkop is slower yet. I gather Thomas has seen this
only once on elver, but I saw it maybe a dozen times over a couple of
years on florican, and now dikkop has hit it after not so many runs.

regards, tom lane

#8Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#7)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

Hi,

On 2023-01-27 23:18:39 -0500, Tom Lane wrote:

I also saw it on florican, which is/was an i386 machine using clang and
pretty standard build options other than
'CFLAGS' => '-msse2 -O2',
so I think this isn't too much about machine architecture or compiler
flags.

Ah. Florican dropped of the BF status page and I was too lazy to look
deeper. You have a penchant for odd architectures, so it didn't seem too crazy
:)

Machine speed might matter though. elver is a good deal faster than
florican was, and dikkop is slower yet. I gather Thomas has seen this
only once on elver, but I saw it maybe a dozen times over a couple of
years on florican, and now dikkop has hit it after not so many runs.

Re-reading the old thread, it is interesting that you tried hard to reproduce
it outside of the BF, without success:
/messages/by-id/2398828.1646000688@sss.pgh.pa.us

Such problems are quite annoying. Last time I hit such a case was
/messages/by-id/20220325052654.3xpbmntatyofau2w@alap3.anarazel.de
but I can't see anything like that being the issue here.

Greetings,

Andres Freund

#9Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#6)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

On Sat, Jan 28, 2023 at 4:42 PM Andres Freund <andres@anarazel.de> wrote:

Did you use the same compiler / compilation flags as when elver hit it?
Clearly Tomas' case was with at least some optimizations enabled.

I did use the same compiler version and optimisation level, clang
llvmorg-13.0.0-0-gd7b669b3a303 at -O2.

#10Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#8)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

On 1/28/23 05:53, Andres Freund wrote:

Hi,

On 2023-01-27 23:18:39 -0500, Tom Lane wrote:

I also saw it on florican, which is/was an i386 machine using clang and
pretty standard build options other than
'CFLAGS' => '-msse2 -O2',
so I think this isn't too much about machine architecture or compiler
flags.

Ah. Florican dropped of the BF status page and I was too lazy to look
deeper. You have a penchant for odd architectures, so it didn't seem too crazy
:)

Machine speed might matter though. elver is a good deal faster than
florican was, and dikkop is slower yet. I gather Thomas has seen this
only once on elver, but I saw it maybe a dozen times over a couple of
years on florican, and now dikkop has hit it after not so many runs.

Re-reading the old thread, it is interesting that you tried hard to reproduce
it outside of the BF, without success:
/messages/by-id/2398828.1646000688@sss.pgh.pa.us

Such problems are quite annoying. Last time I hit such a case was
/messages/by-id/20220325052654.3xpbmntatyofau2w@alap3.anarazel.de
but I can't see anything like that being the issue here.

FWIW I'll wait for dikkop to finish the current buildfarm run (it's
currently chewing on HEAD) and then will try to do runs of the 'joins'
test in a loop. That's where dikkop got stuck before.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#10)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

On 1/28/23 13:05, Tomas Vondra wrote:

FWIW I'll wait for dikkop to finish the current buildfarm run (it's
currently chewing on HEAD) and then will try to do runs of the 'joins'
test in a loop. That's where dikkop got stuck before.

So I did that - same configure options as the buildfarm client, and a
'make check' (with only tests up to the 'join' suite, because that's
where it got stuck before). And it took only ~15 runs (~1h) to hit this
again on dikkop.

As before, there are three processes - leader + 2 workers, but the query
is different - this time it's this one:

-- A couple of other hash join tests unrelated to work_mem management.
-- Check that EXPLAIN ANALYZE has data even if the leader doesn't
participate
savepoint settings;
set local max_parallel_workers_per_gather = 2;
set local work_mem = '4MB';
set local parallel_leader_participation = off;
select * from hash_join_batches(
$$
select count(*) from simple r join simple s using (id);
$$);

I managed to collect the fstat/procstat stuff Thomas asked for, and the
backtraces - attached. I still have the core files, in case we look at
something. As before, running gcore on the second worker (29081) gets
this unstuck - it sends some signal that apparently wakes it up.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

bt.29046.logtext/x-log; charset=UTF-8; name=bt.29046.logDownload
bt.29080.logtext/x-log; charset=UTF-8; name=bt.29080.logDownload
bt.29081.logtext/x-log; charset=UTF-8; name=bt.29081.logDownload
fstat.29046.logtext/x-log; charset=UTF-8; name=fstat.29046.logDownload
fstat.29080.logtext/x-log; charset=UTF-8; name=fstat.29080.logDownload
fstat.29081.logtext/x-log; charset=UTF-8; name=fstat.29081.logDownload
procstat.29046.logtext/x-log; charset=UTF-8; name=procstat.29046.logDownload
procstat.29080.logtext/x-log; charset=UTF-8; name=procstat.29080.logDownload
procstat.29081.logtext/x-log; charset=UTF-8; name=procstat.29081.logDownload
ps-ax.logtext/x-log; charset=UTF-8; name=ps-ax.logDownload
#12Thomas Munro
thomas.munro@gmail.com
In reply to: Tomas Vondra (#11)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

So I did that - same configure options as the buildfarm client, and a
'make check' (with only tests up to the 'join' suite, because that's
where it got stuck before). And it took only ~15 runs (~1h) to hit this
again on dikkop.

That's good news.

I managed to collect the fstat/procstat stuff Thomas asked for, and the
backtraces - attached. I still have the core files, in case we look at
something. As before, running gcore on the second worker (29081) gets
this unstuck - it sends some signal that apparently wakes it up.

Thanks! As expected, no bytes in the pipe for any those processes.
Unfortunately I gave the wrong procstat command, it should be -i, not
-j. Does "procstat -i /path/to/core | grep USR1" show P (pending) for
that stuck process? Silly question really, I don't really expect
poll() to be misbehaving in such a basic way.

I was talking to Andres on IM about this yesterday and he pointed out
a potential out-of-order hazard: WaitEventSetWait() sets "waiting" (to
tell the signal handler to write to the self-pipe) and then reads
latch->is_set with neither compiler nor memory barrier, which doesn't
seem right because we might see a value of latch->is_set from before
"waiting" was true, and yet the signal handler might also have run
while "waiting" was false so the self-pipe doesn't save us, despite
the length of the comment about that. Can you reproduce it with this
change?

--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1011,6 +1011,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
                 * ordering, so that we cannot miss seeing is_set if a notificat
ion
                 * has already been queued.
                 */
+               pg_memory_barrier();
                if (set->latch && set->latch->is_set)
                {
                        occurred_events->fd = PGINVALID_SOCKET;
#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Thomas Munro (#12)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

On 1/29/23 18:26, Thomas Munro wrote:

On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

So I did that - same configure options as the buildfarm client, and a
'make check' (with only tests up to the 'join' suite, because that's
where it got stuck before). And it took only ~15 runs (~1h) to hit this
again on dikkop.

That's good news.

I managed to collect the fstat/procstat stuff Thomas asked for, and the
backtraces - attached. I still have the core files, in case we look at
something. As before, running gcore on the second worker (29081) gets
this unstuck - it sends some signal that apparently wakes it up.

Thanks! As expected, no bytes in the pipe for any those processes.
Unfortunately I gave the wrong procstat command, it should be -i, not
-j. Does "procstat -i /path/to/core | grep USR1" show P (pending) for
that stuck process? Silly question really, I don't really expect
poll() to be misbehaving in such a basic way.

It shows "--C" for all three processes, which should mean "will be caught".

I was talking to Andres on IM about this yesterday and he pointed out
a potential out-of-order hazard: WaitEventSetWait() sets "waiting" (to
tell the signal handler to write to the self-pipe) and then reads
latch->is_set with neither compiler nor memory barrier, which doesn't
seem right because we might see a value of latch->is_set from before
"waiting" was true, and yet the signal handler might also have run
while "waiting" was false so the self-pipe doesn't save us, despite
the length of the comment about that. Can you reproduce it with this
change?

Will do, but I'll wait for another lockup to see how frequent it
actually is. I'm now at ~90 runs total, and it didn't happen again yet.
So hitting it after 15 runs might have been a bit of a luck.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#12)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

Hi,

On 2023-01-30 06:26:02 +1300, Thomas Munro wrote:

On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

So I did that - same configure options as the buildfarm client, and a
'make check' (with only tests up to the 'join' suite, because that's
where it got stuck before). And it took only ~15 runs (~1h) to hit this
again on dikkop.

That's good news.

Indeed.

As annoying as it is, it might be worth reproing it once or twice more, just
to have a feeling for how long we need to run to have confidence in a fix.

I was talking to Andres on IM about this yesterday and he pointed out
a potential out-of-order hazard: WaitEventSetWait() sets "waiting" (to
tell the signal handler to write to the self-pipe) and then reads
latch->is_set with neither compiler nor memory barrier, which doesn't
seem right because we might see a value of latch->is_set from before
"waiting" was true, and yet the signal handler might also have run
while "waiting" was false so the self-pipe doesn't save us, despite
the length of the comment about that. Can you reproduce it with this
change?

--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1011,6 +1011,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
* ordering, so that we cannot miss seeing is_set if a notificat
ion
* has already been queued.
*/
+               pg_memory_barrier();
if (set->latch && set->latch->is_set)
{
occurred_events->fd = PGINVALID_SOCKET;

I think we need a barrier in SetLatch(), after is_set = true. We have that in
some of the newer branches (due to the maybe_sleeping logic), but not in the
older branches.

Greetings,

Andres Freund

#15Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#13)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

Hi,

On 2023-01-29 18:39:05 +0100, Tomas Vondra wrote:

Will do, but I'll wait for another lockup to see how frequent it
actually is. I'm now at ~90 runs total, and it didn't happen again yet.
So hitting it after 15 runs might have been a bit of a luck.

Was there a difference in how much load there was on the machine between
"reproduced in 15 runs" and "not reproed in 90"? If indeed lack of barriers
is related to the issue, an increase in context switches could substantially
change the behaviour (in both directions). More intra-process context
switches can amount to "probabilistic barriers" because that'll be a
barrier. At the same time it can make it more likely that the relatively
narrow window in WaitEventSetWait() is hit, or lead to larger delays
processing signals.

Greetings,

Andres Freund

#16Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#15)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

On 1/29/23 18:53, Andres Freund wrote:

Hi,

On 2023-01-29 18:39:05 +0100, Tomas Vondra wrote:

Will do, but I'll wait for another lockup to see how frequent it
actually is. I'm now at ~90 runs total, and it didn't happen again yet.
So hitting it after 15 runs might have been a bit of a luck.

Was there a difference in how much load there was on the machine between
"reproduced in 15 runs" and "not reproed in 90"? If indeed lack of barriers
is related to the issue, an increase in context switches could substantially
change the behaviour (in both directions). More intra-process context
switches can amount to "probabilistic barriers" because that'll be a
barrier. At the same time it can make it more likely that the relatively
narrow window in WaitEventSetWait() is hit, or lead to larger delays
processing signals.

No. The only thing the machine is doing is

while /usr/bin/true; do
make check
done

I can't reduce the workload further, because the "join" test is in a
separate parallel group (I cut down parallel_schedule). I could make the
machine busier, of course.

However, the other lockup I saw was when using serial_schedule, so I
guess lower concurrency makes it more likely.

But who knows ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17Thomas Munro
thomas.munro@gmail.com
In reply to: Tomas Vondra (#16)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

On Mon, Jan 30, 2023 at 7:08 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

However, the other lockup I saw was when using serial_schedule, so I
guess lower concurrency makes it more likely.

FWIW "psql db -f src/test/regress/sql/join_hash.sql | cat" also works
(I mean, it's self-contained and doesn't need anything else from make
check; pipe to cat just disables the pager); that's how I've been
trying (and failing) to reproduce this on various computers. I also
did a lot of "make -C src/test/module/test_shm_mq installcheck" loops,
at the same time, because that's where my animal hung.

#18Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#12)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

On Mon, Jan 30, 2023 at 6:26 AM Thomas Munro <thomas.munro@gmail.com> wrote:

out-of-order hazard

I've been trying to understand how that could happen, but my CPU-fu is
weak. Let me try to write an argument for why it can't happen, so
that later I can look back at how stupid and naive I was. We have A
B, and if the CPU sees no dependency and decides to execute B A
(pipelined), shouldn't an interrupt either wait for the whole
schemozzle to commit first (if not in a hurry), or nuke it, handle the
IPI and restart, or something? After an hour of reviewing random
slides from classes on out-of-order execution and reorder buffers and
the like, I think the term for making sure that interrupts run with
the illusion of in-order execution maintained is called "precise
interrupts", and it is expected in all modern architectures, after the
early OoO pioneers lost their minds trying to program without it. I
guess generally you want that because it would otherwise run your
interrupt handler in a completely uncertain environment, and
specifically in this case it would reach our signal handler which
reads A's output (waiting) and writes to B's input (is_set), so B IPI
A surely shouldn't be allowed?

As for compiler barriers, I see that elver's compiler isn't reordering the code.

Maybe it's a much dumber sort of a concurrency problem: stale cache
line due to missing barrier, but... commit db0f6cad488 made us also
set our own latch (a second time) when someone sets our latch in
releases 9.something to 13. Which should mean that we're guaranteed
to see is_set = true in the scenario described, because we'll clobber
it ourselves if we have to, for good measure.

If our secondary SetLatch() sees it's already set and decides not to
set it, then it's possible that the code we interrupted was about to
run ResetLatch(), but any code doing that must next check its expected
exit condition (or it has a common-or-garden latch protocol bug, as
has been discovered from time in the tree...).

/me wanders away with a renewed fear of computers and the vast
complexities they hide

#19Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#18)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

Hi,

On 2023-01-30 15:22:34 +1300, Thomas Munro wrote:

On Mon, Jan 30, 2023 at 6:26 AM Thomas Munro <thomas.munro@gmail.com> wrote:

out-of-order hazard

I've been trying to understand how that could happen, but my CPU-fu is
weak. Let me try to write an argument for why it can't happen, so
that later I can look back at how stupid and naive I was. We have A
B, and if the CPU sees no dependency and decides to execute B A
(pipelined), shouldn't an interrupt either wait for the whole
schemozzle to commit first (if not in a hurry), or nuke it, handle the
IPI and restart, or something?

In a core local view, yes, I think so. But I don't think that's how it can
work on multi-core, and even more so, multi-socket machines. Imagine how it'd
influence latency if every interrupt on any CPU would prevent all out-of-order
execution on any CPU.

After an hour of reviewing randoma
slides from classes on out-of-order execution and reorder buffers and
the like, I think the term for making sure that interrupts run with
the illusion of in-order execution maintained is called "precise
interrupts", and it is expected in all modern architectures, after the
early OoO pioneers lost their minds trying to program without it. I
guess generally you want that because it would otherwise run your
interrupt handler in a completely uncertain environment, and
specifically in this case it would reach our signal handler which
reads A's output (waiting) and writes to B's input (is_set), so B IPI
A surely shouldn't be allowed?

Userspace signals aren't delivered synchronously during hardware interrupts
afaik - and I don't think they even possibly could be (after all the process
possibly isn't scheduled).

I think what you're talking about with precise interrupts above is purely
about the single-core view, and mostly about hardware interrupts for faults
etc. The CPU will unwind state from speculatively executed code etc on
interrupt, sure - but I think that's separate from guaranteeing that you can't
have stale cache contents *due to work by another CPU*.

I'm not even sure that userspace signals are generally delivered via an
immediate hardware interrupt, or whether they're processed at the next
scheduler tick. After all, we know that multiple signals are coalesced, which
certainly isn't compatible with synchronous execution. But it could be that
that just happens when the target of a signal is not currently scheduled.

Maybe it's a much dumber sort of a concurrency problem: stale cache
line due to missing barrier, but... commit db0f6cad488 made us also
set our own latch (a second time) when someone sets our latch in
releases 9.something to 13.

But this part does indeed put a crimp on some potential theories.

TBH, I'd be in favor of just adding the barriers for good measure, even if we
don't know if it's a live bug today - it seems incredibly fragile.

Greetings,

Andres Freund

#20Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#19)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

On Mon, Jan 30, 2023 at 6:36 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-01-30 15:22:34 +1300, Thomas Munro wrote:

On Mon, Jan 30, 2023 at 6:26 AM Thomas Munro <thomas.munro@gmail.com> wrote:

out-of-order hazard

I've been trying to understand how that could happen, but my CPU-fu is
weak. Let me try to write an argument for why it can't happen, so
that later I can look back at how stupid and naive I was. We have A
B, and if the CPU sees no dependency and decides to execute B A
(pipelined), shouldn't an interrupt either wait for the whole
schemozzle to commit first (if not in a hurry), or nuke it, handle the
IPI and restart, or something?

In a core local view, yes, I think so. But I don't think that's how it can
work on multi-core, and even more so, multi-socket machines. Imagine how it'd
influence latency if every interrupt on any CPU would prevent all out-of-order
execution on any CPU.

Good. Yeah, I was talking only about a single thread/core.

After an hour of reviewing randoma
slides from classes on out-of-order execution and reorder buffers and
the like, I think the term for making sure that interrupts run with
the illusion of in-order execution maintained is called "precise
interrupts", and it is expected in all modern architectures, after the
early OoO pioneers lost their minds trying to program without it. I
guess generally you want that because it would otherwise run your
interrupt handler in a completely uncertain environment, and
specifically in this case it would reach our signal handler which
reads A's output (waiting) and writes to B's input (is_set), so B IPI
A surely shouldn't be allowed?

Userspace signals aren't delivered synchronously during hardware interrupts
afaik - and I don't think they even possibly could be (after all the process
possibly isn't scheduled).

Yeah, they're not synchronous and the target might not even be
running. BUT if a suitable thread is running then AFAICT an IPI is
delivered to that sucker to get it running the handler ASAP, at least
on the three OSes I looked at. (See breadcrumbs below).

I think what you're talking about with precise interrupts above is purely
about the single-core view, and mostly about hardware interrupts for faults
etc. The CPU will unwind state from speculatively executed code etc on
interrupt, sure - but I think that's separate from guaranteeing that you can't
have stale cache contents *due to work by another CPU*.

Yeah. I get the cache problem, a separate issue that does indeed look
pretty dodgy. I guess I wrote my email out-of-order: at the end I
speculated that cache coherency probably can't explain this failure at
least in THAT bit of the source, because of that funky extra
self-SetLatch(). I just got spooked by the mention of out-of-order
execution and I wanted to chase it down and straighten out my
understanding.

I'm not even sure that userspace signals are generally delivered via an
immediate hardware interrupt, or whether they're processed at the next
scheduler tick. After all, we know that multiple signals are coalesced, which
certainly isn't compatible with synchronous execution. But it could be that
that just happens when the target of a signal is not currently scheduled.

FreeBSD: By default, they are when possible, eg if the process is
currently running a suitable thread. You can set sysctl
kern.smp.forward_signal_enabled=0 to turn that off, and then it works
more like the way you imagined (checking for pending signals at
various arbitrary times, not sure). See tdsigwakeup() ->
forward_signal() -> ipi_cpu().

Linux: Well it certainly smells approximately similar. See
signal_wake_up_state() -> kick_process() -> smp_send_reschedule() ->
smp_cross_call() -> __ipi_send_mask(). The comment for kick_process()
explains that it's using the scheduler IPI to get signals handled
ASAP.

Darwin: ... -> cpu_signal() -> something that talks about IPIs

Coalescing is happening not only at the pending signal level (an
invention of the OS), and then for the inter-processor wakeups there
is also interrupt coalescing. It's latches all the way down.

#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#16)
#22Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#21)
#23Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#22)
#24Thomas Munro
thomas.munro@gmail.com
In reply to: Tomas Vondra (#23)
#25Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Thomas Munro (#24)
#26Thomas Munro
thomas.munro@gmail.com
In reply to: Tomas Vondra (#25)
#27Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Thomas Munro (#26)
#28Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#27)
#29Thomas Munro
thomas.munro@gmail.com
In reply to: Tomas Vondra (#28)
#30Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#29)
#31Alexander Lakhin
exclusion@gmail.com
In reply to: Thomas Munro (#30)
#32Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alexander Lakhin (#31)
#33Alexander Lakhin
exclusion@gmail.com
In reply to: Tomas Vondra (#32)
#34Robert Haas
robertmhaas@gmail.com
In reply to: Alexander Lakhin (#31)
#35Alexander Lakhin
exclusion@gmail.com
In reply to: Robert Haas (#34)
#36Thomas Munro
thomas.munro@gmail.com
In reply to: Alexander Lakhin (#35)
#37Alexander Lakhin
exclusion@gmail.com
In reply to: Alexander Lakhin (#35)
#38Thomas Munro
thomas.munro@gmail.com
In reply to: Alexander Lakhin (#37)
#39Alexander Lakhin
exclusion@gmail.com
In reply to: Thomas Munro (#38)
#40Thomas Munro
thomas.munro@gmail.com
In reply to: Alexander Lakhin (#39)