NOTIFY does not work as expected

Started by Andreyalmost 8 years ago24 messagesbugs

parihaaraka@gmail.com

almost 8 years ago

PostgreSQL 9.6.9, 10.4 (broken):
A: listen test;
A: select pg_sleep(5);
1
2
B: notify test, 'test1';
3
4
5
A: done
6
7
8
9
B: notify test, 'test2';
A:
* notification received:
server process id: 2837
channel: test
payload: test
* notification received:
server process id: 2837
channel: test
payload: test2

PostgreSQL 9.6.2 and earlier (workds as expected)
A: listen test;
A: select pg_sleep(5);
1
2
B: notify test, 'test1';
3
4
5
A: done
A:
* notification received:
server process id: 2837
channel: test
payload: test
6
7
8
9
B: notify test, 'test2';
A:
* notification received:
server process id: 2837
channel: test
payload: test2

----------------
* This is not client-side problem - I've verified it via Wireshark.
** I've caught the bug on Postres Pro Std builds, but according to the fact
that these builds are mostly identical, the bug is related to both of them.

Jeff Janes

jeff.janes@gmail.com

almost 8 years ago

In reply to: Andrey (#1)

Re: NOTIFY does not work as expected

On Mon, Jul 2, 2018 at 4:33 PM, Andrey <parihaaraka@gmail.com> wrote:

PostgreSQL 9.6.9, 10.4 (broken):
A: listen test;
A: select pg_sleep(5);
1
2
B: notify test, 'test1';
3
4
5
A: done
6
7
8
9
B: notify test, 'test2';
A:
* notification received:
server process id: 2837
channel: test
payload: test
* notification received:
server process id: 2837
channel: test
payload: test2

PostgreSQL 9.6.2 and earlier (workds as expected)
A: listen test;
A: select pg_sleep(5);
1
2
B: notify test, 'test1';
3
4
5
A: done
A:
* notification received:
server process id: 2837
channel: test
payload: test
6
7
8
9
B: notify test, 'test2';
A:
* notification received:
server process id: 2837
channel: test
payload: test2

I don't think this is a bug. I don't see that the docs promise one
behavior over the other, so it is really a dealer's choice. Also, I can't
reliably reproduce the reported 9.6.2 behavior on my own 9.6.2 server.

Cheers,

Jeff

Andrey

parihaaraka@gmail.com

almost 8 years ago

In reply to: Jeff Janes (#2)

Re: NOTIFY does not work as expected

man: "... if a listening session receives a notification signal while it is
within a transaction, the notification event will not be delivered to its
connected client until _just after_ the transaction is completed (either
committed or aborted)."
Expected behavior is to deliver notification after pg_sleep is finished.
Currently the one may hold opened connection (being idle and listening
socket) for a long time, close it and never deliver the notification (if
was busy while being notified).
All delivering problems were about overflowed queue, merging notifications
and transactions handling. If we must not rely on delivering at all, then
NOTIFY makes no sense.
вт, 3 июля 2018 г. в 3:37, Jeff Janes <jeff.janes@gmail.com>:

Show quoted text

On Mon, Jul 2, 2018 at 4:33 PM, Andrey <parihaaraka@gmail.com> wrote:

PostgreSQL 9.6.9, 10.4 (broken):
A: listen test;
A: select pg_sleep(5);
1
2
B: notify test, 'test1';
3
4
5
A: done
6
7
8
9
B: notify test, 'test2';
A:
* notification received:
server process id: 2837
channel: test
payload: test
* notification received:
server process id: 2837
channel: test
payload: test2

PostgreSQL 9.6.2 and earlier (workds as expected)
A: listen test;
A: select pg_sleep(5);
1
2
B: notify test, 'test1';
3
4
5
A: done
A:
* notification received:
server process id: 2837
channel: test
payload: test
6
7
8
9
B: notify test, 'test2';
A:
* notification received:
server process id: 2837
channel: test
payload: test2

I don't think this is a bug. I don't see that the docs promise one
behavior over the other, so it is really a dealer's choice. Also, I can't
reliably reproduce the reported 9.6.2 behavior on my own 9.6.2 server.

Cheers,

Jeff

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Jeff Janes (#2)

Re: NOTIFY does not work as expected

Jeff Janes <jeff.janes@gmail.com> writes:

On Mon, Jul 2, 2018 at 4:33 PM, Andrey <parihaaraka@gmail.com> wrote:

[ delayed receipt of notifications ]

I don't think this is a bug. I don't see that the docs promise one
behavior over the other, so it is really a dealer's choice. Also, I can't
reliably reproduce the reported 9.6.2 behavior on my own 9.6.2 server.

FWIW, it looks like a bug to me. I don't have time to investigate
further right now, though. It's also not at all clear whether the
issue is in the server or libpq or psql ...

regards, tom lane

Jeff Janes

jeff.janes@gmail.com

almost 8 years ago

In reply to: Tom Lane (#4)

Re: NOTIFY does not work as expected

On Tue, Jul 3, 2018 at 1:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

On Mon, Jul 2, 2018 at 4:33 PM, Andrey <parihaaraka@gmail.com> wrote:

[ delayed receipt of notifications ]

I don't think this is a bug. I don't see that the docs promise one
behavior over the other, so it is really a dealer's choice. Also, I

can't

reliably reproduce the reported 9.6.2 behavior on my own 9.6.2 server.

FWIW, it looks like a bug to me. I don't have time to investigate
further right now, though. It's also not at all clear whether the
issue is in the server or libpq or psql ...

In my hands, it bisects down to this:

commit 4f85fde8eb860f263384fffdca660e16e77c7f76
Author: Andres Freund <andres@anarazel.de>
Date: Tue Feb 3 22:25:20 2015 +0100

Introduce and use infrastructure for interrupt processing during client
reads.

But that was committed in 9.5.dev, not between 9.6.2 and 9.6.9.

It is on the server side. This is testing with psql, and it doesn't seem
to matter which version of it. Maybe there is something between 9.6.2 and
9.6.9 that shows up with another client or more.

Cheers,

Jeff

Jeff Janes

jeff.janes@gmail.com

almost 8 years ago

In reply to: Jeff Janes (#5)

Re: NOTIFY does not work as expected

On Tue, Jul 3, 2018 at 12:30 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Jul 3, 2018 at 1:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

On Mon, Jul 2, 2018 at 4:33 PM, Andrey <parihaaraka@gmail.com> wrote:

[ delayed receipt of notifications ]

I don't think this is a bug. I don't see that the docs promise one
behavior over the other, so it is really a dealer's choice. Also, I

can't

reliably reproduce the reported 9.6.2 behavior on my own 9.6.2 server.

FWIW, it looks like a bug to me. I don't have time to investigate
further right now, though. It's also not at all clear whether the
issue is in the server or libpq or psql ...

In my hands, it bisects down to this:

commit 4f85fde8eb860f263384fffdca660e16e77c7f76
Author: Andres Freund <andres@anarazel.de>
Date: Tue Feb 3 22:25:20 2015 +0100

Introduce and use infrastructure for interrupt processing during
client reads.

But that was committed in 9.5.dev, not between 9.6.2 and 9.6.9.

It is on the server side. This is testing with psql, and it doesn't seem
to matter which version of it. Maybe there is something between 9.6.2 and
9.6.9 that shows up with another client or more.

Further diagnosis here is that in the "working" case the client receives a
single packet from the server containing both the pg_sleep response, and
async response, in that order, and the client processes both of them. In
the "broken" case, the client receives a single packet from the server
containing the pg_sleep response, and processes it, and then blocks on user
input. The async response is immediately available in the next packet if
the client would ask for it, but the client doesn't do so.

If I am diagnosing the right problem, this still doesn't seem like a bug to
me.

Andey, what is it that you saw in Wireshark?

Cheers,

Jeff

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Jeff Janes (#6)

Re: NOTIFY does not work as expected

Jeff Janes <jeff.janes@gmail.com> writes:

Further diagnosis here is that in the "working" case the client receives a
single packet from the server containing both the pg_sleep response, and
async response, in that order, and the client processes both of them. In
the "broken" case, the client receives a single packet from the server
containing the pg_sleep response, and processes it, and then blocks on user
input. The async response is immediately available in the next packet if
the client would ask for it, but the client doesn't do so.

This suggests that 4f85fde8e introduced an extra output-flush operation
into the code path, ie it must be flushing the output buffer to the client
after ReadyForQuery and then again after emitting the Notify.

If I am diagnosing the right problem, this still doesn't seem like a bug to
me.

Well, it seems undesirable to me, both because it implies network traffic
inefficiency and because clients don't seem to be expecting it.

We have another recent complaint that seems to be possibly the
same thing, bug #15255.

regards, tom lane

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Tom Lane (#7)

Re: NOTIFY does not work as expected

On 2018-07-03 14:27:04 -0400, Tom Lane wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

Further diagnosis here is that in the "working" case the client receives a
single packet from the server containing both the pg_sleep response, and
async response, in that order, and the client processes both of them. In
the "broken" case, the client receives a single packet from the server
containing the pg_sleep response, and processes it, and then blocks on user
input. The async response is immediately available in the next packet if
the client would ask for it, but the client doesn't do so.

This suggests that 4f85fde8e introduced an extra output-flush operation
into the code path, ie it must be flushing the output buffer to the client
after ReadyForQuery and then again after emitting the Notify.

Hm. There's indeed a

/*
* Must flush the notify messages to ensure frontend gets them promptly.
*/
pq_flush();

in ProcessIncomingNotify(). But that was there before, too. And I don't
see any argument why it'd be a good idea to remove it?

If I am diagnosing the right problem, this still doesn't seem like a bug to
me.

Well, it seems undesirable to me, both because it implies network traffic
inefficiency and because clients don't seem to be expecting it.

A report after ~3 years doesn't strike me as a huge argument for that,
and it doesn't seem crazy to believe it'd hurt some users changing
that. And when would you avoid flushing?

We have another recent complaint that seems to be possibly the
same thing, bug #15255.

That seems more related to the logical replication apply code than
anything?

Greetings,

Andres Freund

Jeff Janes

jeff.janes@gmail.com

almost 8 years ago

In reply to: Jeff Janes (#6)

Re: NOTIFY does not work as expected

On Tue, Jul 3, 2018 at 12:53 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Jul 3, 2018 at 12:30 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Jul 3, 2018 at 1:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

On Mon, Jul 2, 2018 at 4:33 PM, Andrey <parihaaraka@gmail.com> wrote:

[ delayed receipt of notifications ]

I don't think this is a bug. I don't see that the docs promise one
behavior over the other, so it is really a dealer's choice. Also, I

can't

reliably reproduce the reported 9.6.2 behavior on my own 9.6.2 server.

FWIW, it looks like a bug to me. I don't have time to investigate
further right now, though. It's also not at all clear whether the
issue is in the server or libpq or psql ...

In my hands, it bisects down to this:

commit 4f85fde8eb860f263384fffdca660e16e77c7f76
Author: Andres Freund <andres@anarazel.de>
Date: Tue Feb 3 22:25:20 2015 +0100

Introduce and use infrastructure for interrupt processing during
client reads.

But that was committed in 9.5.dev, not between 9.6.2 and 9.6.9.

It is on the server side. This is testing with psql, and it doesn't seem
to matter which version of it. Maybe there is something between 9.6.2 and
9.6.9 that shows up with another client or more.

Further diagnosis here is that in the "working" case the client receives a
single packet from the server containing both the pg_sleep response, and
async response, in that order, and the client processes both of them. In
the "broken" case, the client receives a single packet from the server
containing the pg_sleep response, and processes it, and then blocks on user
input. The async response is immediately available in the next packet if
the client would ask for it, but the client doesn't do so.

It looks like I was wrong here. The 2nd packet with the async message is
not generally sent immediately, the server backend can hold it up until the
next time it either hears from the frontend, or until it gets a SIGUSR1 due
to another incoming NOTIFY.

But I still see this undesired behavior showing up in 9.5dev in the
mentioned commit, not showing up between 9.6.2 and 9.6.9.

Cheers,

Jeff

#10

Jeff Janes

jeff.janes@gmail.com

almost 8 years ago

In reply to: Jeff Janes (#9)

Re: NOTIFY does not work as expected

On Tue, Jul 3, 2018 at 9:42 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Jul 3, 2018 at 12:53 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Jul 3, 2018 at 12:30 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

In my hands, it bisects down to this:

commit 4f85fde8eb860f263384fffdca660e16e77c7f76
Author: Andres Freund <andres@anarazel.de>
Date: Tue Feb 3 22:25:20 2015 +0100

Introduce and use infrastructure for interrupt processing during
client reads.

But that was committed in 9.5.dev, not between 9.6.2 and 9.6.9.

It is on the server side. This is testing with psql, and it doesn't
seem to matter which version of it. Maybe there is something between 9.6.2
and 9.6.9 that shows up with another client or more.

Further diagnosis here is that in the "working" case the client receives
a single packet from the server containing both the pg_sleep response, and
async response, in that order, and the client processes both of them. In
the "broken" case, the client receives a single packet from the server
containing the pg_sleep response, and processes it, and then blocks on user
input. The async response is immediately available in the next packet if
the client would ask for it, but the client doesn't do so.

It looks like I was wrong here. The 2nd packet with the async message is
not generally sent immediately, the server backend can hold it up until the
next time it either hears from the frontend, or until it gets a SIGUSR1 due
to another incoming NOTIFY.

But I still see this undesired behavior showing up in 9.5dev in the
mentioned commit, not showing up between 9.6.2 and 9.6.9.

Andrey confirmed to me off list that he was mistaken about it working the
way he expected in 9.6.2, the issue started earlier.

Reading through the comments touched by the commit, it seems obvious what
the bug is. It says "cause the processing to occur just before we next go
idle", but also says "This is called just *after* waiting for a frontend
command", which is too late to be "before we next go idle"

I've attached a simple proof-of-concept patch which
calls ProcessClientReadInterrupt() before the client read is started so it
can process pre-existing interrupt flags, not just when it is interrupted
during the read or when the read is done. (That does seem to render the
name of the function inapt).

I don't know if this is the correct way to fix it, it is just a POC. It
looks like cache invalidation catch-up demands might also be involved in
the delay.

With this patch, the async arrives right after the pg_sleep response, but
it is still in a second packet. That means it doesn't fix the apparent
regression when viewed with psql, as psql doesn't process the 2nd packet
right away. So I instead tested it with this Perl script:

perl -le 'use DBI; my $dbh=DBI->connect("dbi:Pg:port=9999;host=localhost");
$dbh->do("listen test"); $dbh->do("select pg_sleep(5)"); warn "done with
sleep"; while (true) { $r=$dbh->pg_notifies(); warn join "\t", @$r if $r}'

Without my patch, notices sent during the pg_sleep will not get delivered
until another notice is sent after the sleep is over. With the they patch,
they get delivered as soon as the pg_sleep is done.

Cheers

Jeff

#11

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Jeff Janes (#10)

Re: NOTIFY does not work as expected

Hi,

On 2018-07-04 08:50:12 -0400, Jeff Janes wrote:

Reading through the comments touched by the commit, it seems obvious what
the bug is. It says "cause the processing to occur just before we next go
idle", but also says "This is called just *after* waiting for a frontend
command", which is too late to be "before we next go idle"

I've not looked at this issue in depth yet. So I might be completely off
base. But I'm confused by your comment - we're doing it *after*,
because we do a non-blocking read. And the latch will notify us
(event.events & WL_LATCH_SET) if there was a read.

Are you arguing that we should also notify in cases where we actually
never become idle? I'm not sure that's particularly meaningful, given
there's no guarantee that that actually works, because we could just
have read multiple commands from the client?

Greetings,

Andres Freund

#12

Jeff Janes

jeff.janes@gmail.com

almost 8 years ago

In reply to: Andres Freund (#11)

Re: NOTIFY does not work as expected

On Wed, Jul 4, 2018 at 2:30 PM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-07-04 08:50:12 -0400, Jeff Janes wrote:

Reading through the comments touched by the commit, it seems obvious what
the bug is. It says "cause the processing to occur just before we next

go

idle", but also says "This is called just *after* waiting for a frontend
command", which is too late to be "before we next go idle"

I've not looked at this issue in depth yet. So I might be completely off
base. But I'm confused by your comment - we're doing it *after*,
because we do a non-blocking read. And the latch will notify us
(event.events & WL_LATCH_SET) if there was a read.

We get a signal while we are busy. We set the flag and the latch. The
latch
wakes us up, but since we are busy (in a transaction, specifically) we don't
do anything. Later, the transaction ends and we are about to go idle, but
no one checks the flag again. We start a read using the latch mechanism,
but
the flag notifyInterruptPending being set true from a now-long-gone signal
is not on
the list of things the latch wakes us for. It is technically a
non-blocking read but from
the perspective if the pending notify message it is a blocking read, unless
another
signal comes along and rescues us.

Cheers,

Jeff

#13

Andrey

parihaaraka@gmail.com

over 7 years ago

In reply to: Jeff Janes (#12)

Re: NOTIFY does not work as expected

чт, 5 июл. 2018 г. в 1:11, Jeff Janes <jeff.janes@gmail.com>:

On Wed, Jul 4, 2018 at 2:30 PM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-07-04 08:50:12 -0400, Jeff Janes wrote:

Reading through the comments touched by the commit, it seems obvious

what

the bug is. It says "cause the processing to occur just before we next

go

idle", but also says "This is called just *after* waiting for a frontend
command", which is too late to be "before we next go idle"

I've not looked at this issue in depth yet. So I might be completely off
base. But I'm confused by your comment - we're doing it *after*,
because we do a non-blocking read. And the latch will notify us
(event.events & WL_LATCH_SET) if there was a read.

We get a signal while we are busy. We set the flag and the latch. The
latch
wakes us up, but since we are busy (in a transaction, specifically) we
don't
do anything. Later, the transaction ends and we are about to go idle, but
no one checks the flag again. We start a read using the latch mechanism,
but
the flag notifyInterruptPending being set true from a now-long-gone signal
is not on
the list of things the latch wakes us for. It is technically a
non-blocking read but from
the perspective if the pending notify message it is a blocking read,
unless another
signal comes along and rescues us.

Cheers,

Jeff

Hello. I beg your pardon, but the problem is still in 10.5. May we expect
it to be fixed in 11?
Thanks.

Regards,
Andrey L

#14

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Andrey (#13)

Re: NOTIFY does not work as expected

Andrey <parihaaraka@gmail.com> writes:

Hello. I beg your pardon, but the problem is still in 10.5. May we expect
it to be fixed in 11?

Nope :-(. However, I got around to looking at this problem, and I concur
with Jeff's diagnosis: the code around ProcessClientReadInterrupt is
buggy because it does not account for the possibility that the process
latch was cleared some time ago while unhandled interrupt-pending flags
remain set. There are some other issues too:

1. ProcessClientWriteInterrupt has the same problem.

2. I don't believe the "blocked" vs "not-blocked" distinction one bit.
At best, it creates race-condition-like changes in behavior depending
on exactly when a signal arrives vs when data arrives or is sent.
At worst, I think it creates the same problem it's purporting to solve,
ie failure to respond to ProcDiePending at all. I think the
before/during/after calls to ProcessClientXXXInterrupt should just all
behave the same and always be willing to execute ProcDiePending.

3. We've got bugs on the client side too. The documentation is pretty
clear that libpq users ought to call PQconsumeInput before PQnotifies,
but psql had not read the manual at all. Also, most callers were
calling PQconsumeInput only once and then looping on PQnotifies, which
assumes not-very-safely that we could only see at most one TCP packet
worth of notify messages at a time. That's even less safe now that
we have "payload" strings than it was before. So we ought to adjust
the code and documentation to recommend doing another PQconsumeInput
inside the loop. (Congratulations to dblink for getting this right.)

In short, I think we need something like the attached. With these
patches, psql consistently reports the notification promptly (for
me anyway).

regards, tom lane

#15

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Tom Lane (#14)

Re: NOTIFY does not work as expected

On 2018-10-18 18:39:34 -0400, Tom Lane wrote:

Nope :-(. However, I got around to looking at this problem, and I concur
with Jeff's diagnosis: the code around ProcessClientReadInterrupt is
buggy because it does not account for the possibility that the process
latch was cleared some time ago while unhandled interrupt-pending flags
remain set. There are some other issues too:

1. ProcessClientWriteInterrupt has the same problem.

2. I don't believe the "blocked" vs "not-blocked" distinction one bit.
At best, it creates race-condition-like changes in behavior depending
on exactly when a signal arrives vs when data arrives or is sent.
At worst, I think it creates the same problem it's purporting to solve,
ie failure to respond to ProcDiePending at all. I think the
before/during/after calls to ProcessClientXXXInterrupt should just all
behave the same and always be willing to execute ProcDiePending.

That distinction was introduced because people (IIRC you actually) were
worried that we'd be less likely to get error messages out to the
client. Especially when you check unconditionally before actually doing
the write, it's going to be far less likely that we are able to send
something out to the client.

Greetings,

Andres Freund

#16

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Andres Freund (#15)

Re: NOTIFY does not work as expected

Andres Freund <andres@anarazel.de> writes:

On 2018-10-18 18:39:34 -0400, Tom Lane wrote:

2. I don't believe the "blocked" vs "not-blocked" distinction one bit.
At best, it creates race-condition-like changes in behavior depending
on exactly when a signal arrives vs when data arrives or is sent.
At worst, I think it creates the same problem it's purporting to solve,
ie failure to respond to ProcDiePending at all. I think the
before/during/after calls to ProcessClientXXXInterrupt should just all
behave the same and always be willing to execute ProcDiePending.

That distinction was introduced because people (IIRC you actually) were
worried that we'd be less likely to get error messages out to the
client. Especially when you check unconditionally before actually doing
the write, it's going to be far less likely that we are able to send
something out to the client.

Far less likely than what? If we got a ProcDie signal we'd more often
than not have handled it long before reaching here. If we hadn't, though,
we could arrive here with ProcDiePending set but the latch clear, in which
case we fail to honor the interrupt until the client does something.
Don't really think that's acceptable :-(. I'm also not seeing why it's
okay to commit ProcDie hara-kiri immediately if the socket is
write-blocked but not otherwise --- the two cases are going to look about
the same from the client side.

regards, tom lane

#17

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Tom Lane (#16)

Re: NOTIFY does not work as expected

I wrote:

Andres Freund <andres@anarazel.de> writes:

That distinction was introduced because people (IIRC you actually) were
worried that we'd be less likely to get error messages out to the
client. Especially when you check unconditionally before actually doing
the write, it's going to be far less likely that we are able to send
something out to the client.

Far less likely than what? If we got a ProcDie signal we'd more often
than not have handled it long before reaching here. If we hadn't, though,
we could arrive here with ProcDiePending set but the latch clear, in which
case we fail to honor the interrupt until the client does something.
Don't really think that's acceptable :-(. I'm also not seeing why it's
okay to commit ProcDie hara-kiri immediately if the socket is
write-blocked but not otherwise --- the two cases are going to look about
the same from the client side.

I spent some more time thinking about this. It's possible to fix the bug
while preserving the current behavior for ProcDiePending if we adjust the
ProcessClientXXXInterrupt code to set the process latch in the cases where
ProcDiePending is true but we're not waiting for the socket to come ready.
See attached variant of 0001.

I am not, however, convinced that this is better rather than just more
complicated.

If we're willing to accept a ProcDie interrupt during secure_read at all,
I don't see why not to do it even if we got some data. We'll accept the
interrupt anyway the next time something happens to do
CHECK_FOR_INTERRUPTS; and it's unlikely that that would not be till after
we'd completed the query, so the net effect is just going to be that we
waste some cycles first.

Likewise, I see little merit in holding off ProcDie during secure_write.
If we could try to postpone the interrupt until a message boundary, so as
to avoid losing protocol sync, there would be value in that --- but this
code is at the wrong level to implement any such behavior, and it's
not trying to. So we still have the situation that the interrupt service
is being postponed without any identifiable gain in predictability of
behavior.

In short, I don't think that the existing logic here does anything useful
to meet the concern I had, and so I wouldn't mind throwing it away.

regards, tom lane

PS: this patch extends the ProcessClientXXXInterrupt API to distinguish
before/during/after calls. As written, there are only two behaviors
so we could've stuck with the bool API, but I thought this was clearer.

#18

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Tom Lane (#17)

Re: NOTIFY does not work as expected

On 2018-10-19 13:36:31 -0400, Tom Lane wrote:

I wrote:

Andres Freund <andres@anarazel.de> writes:

That distinction was introduced because people (IIRC you actually) were
worried that we'd be less likely to get error messages out to the
client. Especially when you check unconditionally before actually doing
the write, it's going to be far less likely that we are able to send
something out to the client.

Far less likely than what? If we got a ProcDie signal we'd more often
than not have handled it long before reaching here. If we hadn't, though,
we could arrive here with ProcDiePending set but the latch clear, in which
case we fail to honor the interrupt until the client does something.
Don't really think that's acceptable :-(. I'm also not seeing why it's
okay to commit ProcDie hara-kiri immediately if the socket is
write-blocked but not otherwise --- the two cases are going to look about
the same from the client side.

The reason we ended up there is that before the change that made it
behave like that, is that otherwise backends that are trying to write
something to the client, but the client isn't accepting any writes
(hung, forgotten, intentional DOS, whatnot), you have an unkillable
backend as soon as the the network buffers fill up.

But if we always check for procDiePending, even when not blocked, we'd
be able to send out an error message in fewer cases.

There's no problem with being unkillable when writing out data without
blocking, because it's going to take pretty finite time to copy a few
bytes into the kernel buffers.

Obviously it's not perfect to not be able to send a message in cases a
backend is killed while blocked in network, but there's not really a way
around that.

You can pretty easily trigger these cases and observe the difference by
doing something like COPY TO STDOUT of a large table, SIGSTOP the psql,
attach strace to the backend, and then terminate the backend. Without
checking interrupts while blocked the backend doesn't get terminated.

If we're willing to accept a ProcDie interrupt during secure_read at all,
I don't see why not to do it even if we got some data. We'll accept the
interrupt anyway the next time something happens to do
CHECK_FOR_INTERRUPTS; and it's unlikely that that would not be till after
we'd completed the query, so the net effect is just going to be that we
waste some cycles first.

I don't immediately see a problem with changing this for reads.

Likewise, I see little merit in holding off ProcDie during secure_write.
If we could try to postpone the interrupt until a message boundary, so as
to avoid losing protocol sync, there would be value in that --- but this
code is at the wrong level to implement any such behavior, and it's
not trying to. So we still have the situation that the interrupt service
is being postponed without any identifiable gain in predictability of
behavior.

See earlier explanation.

Greetings,

Andres Freund

#19

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Andres Freund (#18)

Re: NOTIFY does not work as expected

Hi,

On 2018-10-19 13:45:42 -0700, Andres Freund wrote:

On 2018-10-19 13:36:31 -0400, Tom Lane wrote:

If we're willing to accept a ProcDie interrupt during secure_read at all,
I don't see why not to do it even if we got some data. We'll accept the
interrupt anyway the next time something happens to do
CHECK_FOR_INTERRUPTS; and it's unlikely that that would not be till after
we'd completed the query, so the net effect is just going to be that we
waste some cycles first.

I don't immediately see a problem with changing this for reads.

One argument against changing it, although not a very strong one, is
that processing a proc die even when non-blocking prevents us from
processing commands like a client's X/terminate even if we already have
the necessary input.

Greetings,

Andres Freund

#20

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Andres Freund (#19)

Re: NOTIFY does not work as expected

Andres Freund <andres@anarazel.de> writes:

On 2018-10-19 13:45:42 -0700, Andres Freund wrote:

I don't immediately see a problem with changing this for reads.

One argument against changing it, although not a very strong one, is
that processing a proc die even when non-blocking prevents us from
processing commands like a client's X/terminate even if we already have
the necessary input.

I'm pretty skeptical of these arguments, as they depend on assumptions
that there are no CHECK_FOR_INTERRUPTS calls anywhere in the relevant
code paths outside be-secure.c. Even if that's true today, it doesn't
seem like something to depend on.

However, there's definitely merit in the idea that we shouldn't change
the ProcDie behavior if we don't have to in order to fix the NOTIFY
bug --- especially since I'd like to backpatch this. So if you're
happy with the revised patch, I can go with that one.

regards, tom lane

#21

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Tom Lane (#20)

#22

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Andres Freund (#21)

#23