BUG #14830: Missed NOTIFications, PostgreSQL 9.1.24
The following bug has been logged on the website:
Bug reference: 14830
Logged by: Marko Tiikkaja
Email address: marko@joh.to
PostgreSQL version: Unsupported/Unknown
Operating system: Ubuntu 14.04
Description:
Hey,
I understand this is not much information to go on (but the problem is
extremely difficult to reproduce), and that 9.1 is technically out of
support (but I don't think the relevant code has changed significantly,
either), so I fully expect that nobody will be able to figure out what's
wrong based on that. But I thought I'd post anyway.
For the past two days I've been tracking down a bug where it would appear
that some NOTIFications are simply lost. Then a minute later when the
notification is resent by a different transaction, it comes through just
fine. We have a single program connected to the database all the time,
which LISTENs on around 800 channels and delivers the notifications to its
own clients. The problem seems to only start happening, or perhaps gets
worse the longer this application is connected to the database.
I'm attaching two excerpts from the strace which, if I'm reading this
correctly, would suggest that there's a bug in postgres here. Here's how I
read this:
1) In strace2.txt, the send on line #1 corresponds to 28:3486 in
strace1.txt. I know this because notification payloads on that channel are
unique.
2) In strace2.txt, on line #5 something slightly out of the ordinary
happens. We have around 75 semop calls compared to 5400 semop calls in the
full strace, so no biggie, but perhaps noteworthy. Contention with another
backend, perhaps.
3) The send on line #6 seems to correspond to 28:3600 in strace1.txt.
4) Then here's where the problem seems to occur: the next send, on line
25, corresponds to 28:4458 in strace1.txt.
Within that ~850 bytes that the sending backend seemingly jumped over, we
have multiple notifications on channels we know the backend was listening
on. That's including a notification on channel "workerid48101842", which is
the one our application was desperately missing in this case. PostgreSQL's
logs and the state of the database indicate that at least the transaction
which wrote the "workerid48101842" notification committed, and I have no
reason to believe that any of the other ones near it did not commit.
So.. any ideas? Unfortunately I can't reproduce this in an isolated
environment, and in production this seems to be taking some time before it
builds up into a proper issue.
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Wed, Sep 27, 2017 at 3:29 AM, <marko@joh.to> wrote:
So.. any ideas? Unfortunately I can't reproduce this in an isolated
environment, and in production this seems to be taking some time before it
builds up into a proper issue.
Hm. Could it be some side-effect from 2bbe8a68? This has been
backpatched on all branches and it is part of 9.1.24.
--
Michael
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Tue, Oct 3, 2017 at 5:00 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:
On Wed, Sep 27, 2017 at 3:29 AM, <marko@joh.to> wrote:
So.. any ideas? Unfortunately I can't reproduce this in an isolated
environment, and in production this seems to be taking some time beforeit
builds up into a proper issue.
Hm. Could it be some side-effect from 2bbe8a68? This has been
backpatched on all branches and it is part of 9.1.24.
Unsure. I can at least reproduce this with only one session ever listening
on anything.
I've produced a test case[1]https://github.com/johto/notify-test for this which matches roughly what the code
does in production. The code isn't the most pretty code out there, but
basically what it does is that it has one session LISTENing on a channel,
and 24 sessions sending messages with a prefix in order, so for example:
session 1 sends A_1, A_2, A_3, etc.
session 2 sends B_1, B_2, B_3, ...
and the listener has a map recording what the last received number is for
each prefix, checking that all notifications are received and in the right
order.
After running it for a few days I start getting logged messages such as:
out of order notification Q_97882353: 97882353 != 97882349 + 1 (prefix Q)
out of order notification F_97947433: 97947433 != 97947429 + 1 (prefix F)
out of order notification F_97947439: 97947439 != 97947436 + 1 (prefix F)
I did it on both 9.1.24 and 9.6.5 and they both exhibit the same behavior:
it takes days to get into this state, but then notifications are missed all
the time. I currently have both systems in this state, so any idea what to
look at to try and debug this further?
.m
Marko Tiikkaja <marko@joh.to> writes:
After running it for a few days I start getting logged messages such as:
out of order notification Q_97882353: 97882353 != 97882349 + 1 (prefix Q)
out of order notification F_97947433: 97947433 != 97947429 + 1 (prefix F)
out of order notification F_97947439: 97947439 != 97947436 + 1 (prefix F)
I did it on both 9.1.24 and 9.6.5 and they both exhibit the same behavior:
it takes days to get into this state, but then notifications are missed all
the time. I currently have both systems in this state, so any idea what to
look at to try and debug this further?
You might try gdb'ing the recipient and stepping through
asyncQueueProcessPageEntries to see what happens. Are the missing
entries present in the queue but it decides to ignore them for some
reason, or are they just not there?
An interesting black-box test might be to do this with two receiver
processes and see if they miss identical sets of messages. That
would be a different way of triangulating on question number 1,
which is whether the sender or the recipient is at fault.
I wonder whether the long ramp-up time indicates that you have to
wrap around some counter somewhere before things go south. Although
the only obvious candidate is wrapping the pg_notify SLRU queue,
and I'd think that would have happened many times already.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
So I managed to accidentally kill and/or restart both servers while trying
to install debug symbols, but I'm doing a new run now and I noticed
something interesting: the listening backend's RecentXmin doesn't seem to
ever go forward. By my reading of this code, that would mean trouble for
this piece of code in TransactionIdIsInProgress:
if (TransactionIdPrecedes(xid, RecentXmin))
return false;
Is this not an obvious bug? async.c would think the transaction is not
running, and did not commit -> throw away the notification. That would
also explain why it takes days for this to happen (since we need to run
through enough xids to wrap around).
.m
Marko Tiikkaja <marko@joh.to> writes:
So I managed to accidentally kill and/or restart both servers while trying
to install debug symbols, but I'm doing a new run now and I noticed
something interesting: the listening backend's RecentXmin doesn't seem to
ever go forward. By my reading of this code, that would mean trouble for
this piece of code in TransactionIdIsInProgress:
if (TransactionIdPrecedes(xid, RecentXmin))
return false;
Hmm ... I suppose it's possible that that happens if the listening
backend isn't executing any SQL commands but is just sitting.
While that might describe your test harness, does it describe any
real-world application?
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Tom Lane wrote:
Marko Tiikkaja <marko@joh.to> writes:
So I managed to accidentally kill and/or restart both servers while trying
to install debug symbols, but I'm doing a new run now and I noticed
something interesting: the listening backend's RecentXmin doesn't seem to
ever go forward. By my reading of this code, that would mean trouble for
this piece of code in TransactionIdIsInProgress:if (TransactionIdPrecedes(xid, RecentXmin))
return false;Hmm ... I suppose it's possible that that happens if the listening
backend isn't executing any SQL commands but is just sitting.
While that might describe your test harness, does it describe any
real-world application?
I think it's not totally unreasonable to have processes sitting idle for
long periods of time. One example: a pooler configured to have more
connections that are actually needed most of the time (I'm fairly sure
I've seen this). Would they not recompute RecentXmin if they did a
sinval reset? Also, a listener daemon for which notifications are very
infrequent.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Tue, Oct 10, 2017 at 3:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Marko Tiikkaja <marko@joh.to> writes:
So I managed to accidentally kill and/or restart both servers while
trying
to install debug symbols, but I'm doing a new run now and I noticed
something interesting: the listening backend's RecentXmin doesn't seem to
ever go forward. By my reading of this code, that would mean trouble for
this piece of code in TransactionIdIsInProgress:if (TransactionIdPrecedes(xid, RecentXmin))
return false;Hmm ... I suppose it's possible that that happens if the listening
backend isn't executing any SQL commands but is just sitting.
While that might describe your test harness, does it describe any
real-world application?
Yes. It's a service which only does LISTEN and UNLISTEN and forwards
notifications to its clients.
.m
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Tom Lane wrote:
Hmm ... I suppose it's possible that that happens if the listening
backend isn't executing any SQL commands but is just sitting.
While that might describe your test harness, does it describe any
real-world application?
I think it's not totally unreasonable to have processes sitting idle for
long periods of time.
Sure, but if they're truly idle, they aren't looking at RecentXmin ;-)
Would they not recompute RecentXmin if they did a
sinval reset?
Not sure, but if not, maybe making that code path update it would be
a suitable fix in practice? I'm a bit hesitant to put a GetSnapshotData
call into ProcessIncomingNotify, because of the possibility of adding a
lot of contention on the ProcArray if there's very heavy notify traffic.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
I wrote:
... I'm a bit hesitant to put a GetSnapshotData
call into ProcessIncomingNotify, because of the possibility of adding a
lot of contention on the ProcArray if there's very heavy notify traffic.
Wait, I have an idea. Let's fetch a snapshot at the top of
ProcessIncomingNotify, and then use the snapshot to decide whether
xids are running, instead of calling TransactionIdIsInProgress.
GetSnapshotData is probably a bit heavier than a single
TransactionIdIsInProgress call, but if we have to process multiple
messages then this should be a net savings in ProcArray traffic.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
I wrote:
Wait, I have an idea. Let's fetch a snapshot at the top of
ProcessIncomingNotify, and then use the snapshot to decide whether
xids are running, instead of calling TransactionIdIsInProgress.
Concretely, I suggest the attached patch. I tried Marko's testbed
against this, and it seems indeed a bit faster than before ---
running it for 100000 notifies takes about 19.25 seconds, rather than
19.5 seconds with HEAD. But the testbed is probably about the best
case because it has a bunch of threads sending notifies to one
receiver, so that the receiver is quite likely to have multiple
messages to read at a time. OTOH, in a situation where the NOTIFY
traffic isn't so high, it probably hardly matters anyway.
regards, tom lane
Attachments:
fix-notify-xact-testing.patchtext/x-diff; charset=us-ascii; name=fix-notify-xact-testing.patchDownload+36-28
I wrote:
Concretely, I suggest the attached patch. I tried Marko's testbed
against this, and it seems indeed a bit faster than before ---
running it for 100000 notifies takes about 19.25 seconds, rather than
19.5 seconds with HEAD. But the testbed is probably about the best
case because it has a bunch of threads sending notifies to one
receiver, so that the receiver is quite likely to have multiple
messages to read at a time. OTOH, in a situation where the NOTIFY
traffic isn't so high, it probably hardly matters anyway.
It occurred to me that it's easy to measure the worst case, ie always
one message per notify interrupt:
$ cat bnch.txt
LISTEN foo\; NOTIFY foo;
$ pgbench -n -c 1 -T 10 -f bnch.txt
On this case it seems that the patch is circa 2% slower than HEAD,
though that's only barely above the noise level in my measurements.
I think that's probably acceptable.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
I wrote:
It occurred to me that it's easy to measure the worst case, ie always
one message per notify interrupt:$ cat bnch.txt
LISTEN foo\; NOTIFY foo;
$ pgbench -n -c 1 -T 10 -f bnch.txtOn this case it seems that the patch is circa 2% slower than HEAD,
though that's only barely above the noise level in my measurements.
On looking at the patch again, I realized that it's not really necessary
to check TransactionIdIsCurrentTransactionId in the per-message loop.
Although it is possible to run the message-scanning code in a transaction
that's executed NOTIFY (if it also executed the session's first LISTEN),
at that point we will not yet have pushed the transaction's messages
into the queue. Therefore there is no case where the loop can see
messages queued by its own transaction.
Although TransactionIdIsCurrentTransactionId isn't really all that
expensive, removing it brings the above test case to a statistical
tie: median of 3 60-second runs is 26054 tps in HEAD or 26034 tps
with patch. (The cross-run variation is a couple hundred tps.)
So I'm now satisfied with the attached form of the patch, and
will proceed to make back-ported versions.
regards, tom lane
Attachments:
fix-notify-xact-testing-2.patchtext/x-diff; charset=us-ascii; name=fix-notify-xact-testing-2.patchDownload+42-34
On Wed, Oct 11, 2017 at 7:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
So I'm now satisfied with the attached form of the patch, and
will proceed to make back-ported versions.
Thank you very much! I'll let the test case run against patched master for
a few days just to make sure we didn't miss anything obvious.
.m