LISTEN/NOTIFY testing woes
I ran into a couple of issues while trying to devise a regression test
illustrating the LISTEN-in-serializable-transaction issue Mark Dilger
reported. The first one is that an isolation test in which we expect
to see a cross-process NOTIFY immediately after a COMMIT turns out to
be not very stable: on my machine, it works as long as you're just
running the isolation tests by themselves, but it usually falls over
if I'm running check-world with any amount of parallelism. The reason
for this seems to be that incoming notifies are only checked for when
we're about to wait for client input. At that point we've already
sent the ReadyForQuery ('Z') protocol message, which will cause libpq
to stand down from looking for more input and return a null from
PQgetResult(). Depending on timing, the following Notify protocol
messages might arrive quickly enough that isolationtester.c sees them
before it goes off to do something else, but that's not very reliable.
In the case of self-notifies, postgres.c ensures that those get
transmitted to the frontend *before* ReadyForQuery, and this is what
makes self-notify cases stable enough to survive buildfarm testing.
I'm a bit surprised, now that I've seen this effect, that the existing
cross-session notify tests in async-notify.spec haven't given us
problems in the buildfarm. (Maybe, now that I just pushed those into
the back branches, we'll start to see some failures?) Anyway, what
I propose to do about this is patch 0001 attached, which tweaks
postgres.c to ensure that any cross-session notifies that arrived
during the just-finished transaction are also guaranteed to be sent
to the client before, not after, ReadyForQuery.
Another thing that I discovered while testing this is that as of HEAD,
you can't run "make installcheck" for the isolation tests more than
once without restarting the server. If you do, you get a test result
mismatch because the async-notify test's first invocation of
pg_notification_queue_usage() returns a positive value. Which is
entirely unsurprising, because the previous iteration ensured that
it would, and we've done nothing to make the queue tail advance since
then.
This seems both undesirable for our own testing, and rather bogus
from users' standpoints as well. However, I think a simple fix is
available: just make the SQL pg_notification_queue_usage() function
advance the queue tail before measuring, as in 0002 below. This will
restore the behavior of that function to what it was before 51004c717,
and it doesn't seem like it'd cost any performance in any plausible
use-cases.
0002 is only needed in HEAD, but I'll have to back-patch 0001 as
far as 9.6, to support a test case for the problem Mark discovered
and to ensure that back-patching b10f40bf0 doesn't cause any issues.
BTW, the fix and test case for Mark's issue look like 0003. Without
the 0001 patch, it's unstable exactly when the "listener2: NOTIFY "c1"
with payload "" from notifier" message comes out. But modulo that
issue, this test case reliably shows the assertion failure in the
back branches.
regards, tom lane
Attachments:
0001-stabilize-notify-report-timing.patchtext/x-diff; charset=us-ascii; name=0001-stabilize-notify-report-timing.patchDownload+11-0
0002-stabilize-queue-usage-results.patchtext/x-diff; charset=us-ascii; name=0002-stabilize-queue-usage-results.patchDownload+3-0
0003-fix-serializable-listen.patchtext/x-diff; charset=us-ascii; name=0003-fix-serializable-listen.patchDownload+61-14
On 11/23/19 5:01 PM, Tom Lane wrote:
I ran into a couple of issues while trying to devise a regression test
illustrating the LISTEN-in-serializable-transaction issue Mark Dilger
reported. The first one is that an isolation test in which we expect
to see a cross-process NOTIFY immediately after a COMMIT turns out to
be not very stable: on my machine, it works as long as you're just
running the isolation tests by themselves, but it usually falls over
if I'm running check-world with any amount of parallelism. The reason
for this seems to be that incoming notifies are only checked for when
we're about to wait for client input. At that point we've already
sent the ReadyForQuery ('Z') protocol message, which will cause libpq
to stand down from looking for more input and return a null from
PQgetResult(). Depending on timing, the following Notify protocol
messages might arrive quickly enough that isolationtester.c sees them
before it goes off to do something else, but that's not very reliable.
Thanks for working on this, Tom.
I have finished reading and applying your three patches and have moved
on to testing them. I hope to finish the review soon.
--
Mark Dilger
Hoi Tom,
On Sun, 24 Nov 2019 at 02:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:
This seems both undesirable for our own testing, and rather bogus
from users' standpoints as well. However, I think a simple fix is
available: just make the SQL pg_notification_queue_usage() function
advance the queue tail before measuring, as in 0002 below. This will
restore the behavior of that function to what it was before 51004c717,
and it doesn't seem like it'd cost any performance in any plausible
use-cases.
This was one of those open points in the previous patches where it
wasn't quite clear what the correct behaviour should be. This fixes
the issue, but the question in my mind is: do we want to document this
fact and can users rely on this behaviour? If we go with the argument
that the delay in cleaning up should be entirely invisible, then I
guess this patch is the correct one that makes the made changes
invisible. Arguably not doing this means we'd have to document the
values are possibly out of date.
So I think this patch does the right thing.
Have a nice day,
--
Martijn van Oosterhout <kleptog@gmail.com> http://svana.org/kleptog/
Martijn van Oosterhout <kleptog@gmail.com> writes:
On Sun, 24 Nov 2019 at 02:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:
This seems both undesirable for our own testing, and rather bogus
from users' standpoints as well. However, I think a simple fix is
available: just make the SQL pg_notification_queue_usage() function
advance the queue tail before measuring, as in 0002 below. This will
restore the behavior of that function to what it was before 51004c717,
and it doesn't seem like it'd cost any performance in any plausible
use-cases.
This was one of those open points in the previous patches where it
wasn't quite clear what the correct behaviour should be. This fixes
the issue, but the question in my mind is: do we want to document this
fact and can users rely on this behaviour? If we go with the argument
that the delay in cleaning up should be entirely invisible, then I
guess this patch is the correct one that makes the made changes
invisible. Arguably not doing this means we'd have to document the
values are possibly out of date.
So I think this patch does the right thing.
Thanks for looking! In the light of morning, there's one other
thing bothering me about this patch: it means that the function has
side-effects, even though those effects are at the implementation
level and shouldn't be user-visible. We do already have it marked
"volatile", so that's OK, but I notice that it's not parallel
restricted. The isolation test still passes when I set
force_parallel_mode = regress, so apparently it works to run this
code in a parallel worker, but that seems pretty scary to me;
certainly nothing in async.c was written with that in mind.
I think we'd be well advised to adjust pg_proc.dat to mark
pg_notification_queue_usage() as parallel-restricted, so that
it only executes in the main session process. It's hard to
see any use-case for parallelizing it that would justify even
a small chance of new bugs.
regards, tom lane
On 11/23/19 8:50 PM, Mark Dilger wrote:
On 11/23/19 5:01 PM, Tom Lane wrote:
I ran into a couple of issues while trying to devise a regression test
illustrating the LISTEN-in-serializable-transaction issue Mark Dilger
reported.� The first one is that an isolation test in which we expect
to see a cross-process NOTIFY immediately after a COMMIT turns out to
be not very stable: on my machine, it works as long as you're just
running the isolation tests by themselves, but it usually falls over
if I'm running check-world with any amount of parallelism.� The reason
for this seems to be that incoming notifies are only checked for when
we're about to wait for client input.� At that point we've already
sent the ReadyForQuery ('Z') protocol message, which will cause libpq
to stand down from looking for more input and return a null from
PQgetResult().� Depending on timing, the following Notify protocol
messages might arrive quickly enough that isolationtester.c sees them
before it goes off to do something else, but that's not very reliable.Thanks for working on this, Tom.
I have finished reading and applying your three patches and have moved
on to testing them.� I hope to finish the review soon.
After applying all three patches, the stress test that originally
uncovered the assert in predicate.c no longer triggers any asserts.
`check-world` passes. The code and comments look good.
Your patches are ready for commit.
--
Mark Dilger
Mark Dilger <hornschnorter@gmail.com> writes:
On 11/23/19 8:50 PM, Mark Dilger wrote:
I have finished reading and applying your three patches and have moved
on to testing them. I hope to finish the review soon.
After applying all three patches, the stress test that originally
uncovered the assert in predicate.c no longer triggers any asserts.
`check-world` passes. The code and comments look good.
Thanks for reviewing!
After sleeping on it, I'm not really happy with what I did in
PrepareTransaction (that is, invent a separate PrePrepare_Notify
function). The idea was to keep that looking parallel to what
CommitTransaction does, and preserve infrastructure against the
day that somebody gets motivated to allow LISTEN or NOTIFY in
a prepared transaction. But on second thought, what would surely
happen when that feature gets added is just that AtPrepare_Notify
would serialize the pending LISTEN/NOTIFY actions into the 2PC
state file. There wouldn't be any need for PrePrepare_Notify,
so there's no point in introducing that. I'll just move the
comment saying that nothing has to happen at that point for NOTIFY.
regards, tom lane
On 11/24/19 10:39 AM, Tom Lane wrote:
Mark Dilger <hornschnorter@gmail.com> writes:
On 11/23/19 8:50 PM, Mark Dilger wrote:
I have finished reading and applying your three patches and have moved
on to testing them. I hope to finish the review soon.After applying all three patches, the stress test that originally
uncovered the assert in predicate.c no longer triggers any asserts.
`check-world` passes. The code and comments look good.Thanks for reviewing!
After sleeping on it, I'm not really happy with what I did in
PrepareTransaction (that is, invent a separate PrePrepare_Notify
function). The idea was to keep that looking parallel to what
CommitTransaction does, and preserve infrastructure against the
day that somebody gets motivated to allow LISTEN or NOTIFY in
a prepared transaction. But on second thought, what would surely
happen when that feature gets added is just that AtPrepare_Notify
would serialize the pending LISTEN/NOTIFY actions into the 2PC
state file. There wouldn't be any need for PrePrepare_Notify,
so there's no point in introducing that. I'll just move the
comment saying that nothing has to happen at that point for NOTIFY.
I looked at that. I thought it was an interesting decision to
factor out that error to its own function while leaving a
similar error inline just a little below in xact.c:
if ((MyXactFlags & XACT_FLAGS_ACCESSEDTEMPNAMESPACE))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot PREPARE a transaction that has operated
on temporary objects")));
I assumed you had factored it out in anticipation of supporting notify
here in the future. If you want to backtrack that decision and leave it
inline, you would still keep the test rather than just a comment, right?
It sounds like you intend to let AtPrepare_Notify catch the problem
later, but since that's just an Assert and not an ereport(ERROR), that
provides less error checking for non-assert builds.
--
Mark Dilger
Mark Dilger <hornschnorter@gmail.com> writes:
On 11/24/19 10:39 AM, Tom Lane wrote:
After sleeping on it, I'm not really happy with what I did in
PrepareTransaction (that is, invent a separate PrePrepare_Notify
function). The idea was to keep that looking parallel to what
CommitTransaction does, and preserve infrastructure against the
day that somebody gets motivated to allow LISTEN or NOTIFY in
a prepared transaction. But on second thought, what would surely
happen when that feature gets added is just that AtPrepare_Notify
would serialize the pending LISTEN/NOTIFY actions into the 2PC
state file. There wouldn't be any need for PrePrepare_Notify,
so there's no point in introducing that. I'll just move the
comment saying that nothing has to happen at that point for NOTIFY.
I assumed you had factored it out in anticipation of supporting notify
here in the future. If you want to backtrack that decision and leave it
inline, you would still keep the test rather than just a comment, right?
No, there wouldn't be any error condition; that's just needed because the
feature isn't implemented yet. So I'll leave that alone; the only thing
that needs to happen now in the PREPARE code path is to adjust the one
comment.
regards, tom lane
On 11/24/19 11:04 AM, Tom Lane wrote:
Mark Dilger <hornschnorter@gmail.com> writes:
On 11/24/19 10:39 AM, Tom Lane wrote:
After sleeping on it, I'm not really happy with what I did in
PrepareTransaction (that is, invent a separate PrePrepare_Notify
function). The idea was to keep that looking parallel to what
CommitTransaction does, and preserve infrastructure against the
day that somebody gets motivated to allow LISTEN or NOTIFY in
a prepared transaction. But on second thought, what would surely
happen when that feature gets added is just that AtPrepare_Notify
would serialize the pending LISTEN/NOTIFY actions into the 2PC
state file. There wouldn't be any need for PrePrepare_Notify,
so there's no point in introducing that. I'll just move the
comment saying that nothing has to happen at that point for NOTIFY.I assumed you had factored it out in anticipation of supporting notify
here in the future. If you want to backtrack that decision and leave it
inline, you would still keep the test rather than just a comment, right?No, there wouldn't be any error condition; that's just needed because the
feature isn't implemented yet. So I'll leave that alone; the only thing
that needs to happen now in the PREPARE code path is to adjust the one
comment.
Ok.
--
Mark Dilger