Streaming replication and non-blocking I/O
I find the backend libpq changes related to non-blocking I/O quite
complex. Can we find a simpler solution?
The problem we're trying to solve is that while the walsender backend
sends a lot of WAL records to the client, the client can send a lot of
messages to the backend. If volume of the messages from client to server
exceeds both the input buffer in the server and the output buffer in the
client, the client will block until the server has read some data. But
if the client is blocked, it will not process incoming data from the
server, and eventually the server will block too. And we have a
deadlock. This:
http://florin.bjdean.id.au/docs/omnimark/omni55/docs/html/concept/717.htm
is a pretty good description of the problem.
The first question is: do we really need to be prepared for that? The
XLogRecPtr acknowledgment messages the client sends are very small, and
if the client is mindful about not sending them too often - perhaps max
1 ack per 1 received XLOG message - the receive buffer in the backend
should never fill up in practice.
If that's deemed not good enough, we could modify just internal_flush()
so that it uses secure_poll to wait for the possibility to either read
or write, instead of blocking for just write. Whenever there's incoming
data, read them into PqRecvBuffer for later processing, which keeps the
OS input buffer from filling up. If PqRecvBuffer fills up, it can be
extended, or we can start dropping old XLogRecPtr messages from it.
In any case, we'll need something like pq_wait to check if a message can
be read without blocking, but that's just a small additional function as
opposed to a whole new API for assembling and sending messages without
blocking.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, Dec 8, 2009 at 11:23 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
The first question is: do we really need to be prepared for that? The
XLogRecPtr acknowledgment messages the client sends are very small, and
if the client is mindful about not sending them too often - perhaps max
1 ack per 1 received XLOG message - the receive buffer in the backend
should never fill up in practice.
It's OK to drop that feature.
If that's deemed not good enough, we could modify just internal_flush()
so that it uses secure_poll to wait for the possibility to either read
or write, instead of blocking for just write. Whenever there's incoming
data, read them into PqRecvBuffer for later processing, which keeps the
OS input buffer from filling up. If PqRecvBuffer fills up, it can be
extended, or we can start dropping old XLogRecPtr messages from it.
Extending PqRecvBuffer seems better because XLogRecPtr message
has some types (i.e., we cannot just drop old message without parsing
all messages in the buffer).
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Fujii Masao wrote:
On Tue, Dec 8, 2009 at 11:23 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:If that's deemed not good enough, we could modify just internal_flush()
so that it uses secure_poll to wait for the possibility to either read
or write, instead of blocking for just write. Whenever there's incoming
data, read them into PqRecvBuffer for later processing, which keeps the
OS input buffer from filling up. If PqRecvBuffer fills up, it can be
extended, or we can start dropping old XLogRecPtr messages from it.Extending PqRecvBuffer seems better because XLogRecPtr message
has some types (i.e., we cannot just drop old message without parsing
all messages in the buffer).
True. Another idea I had was to introduce a callback that backend libpq
can call when the buffer fills. Walsender would set the callback to
ProcessStreamMsgs().
But if everyone is happy with just relying on the OS buffer to not fill
up, let's just drop it.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, Dec 9, 2009 at 3:58 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
True. Another idea I had was to introduce a callback that backend libpq
can call when the buffer fills. Walsender would set the callback to
ProcessStreamMsgs().But if everyone is happy with just relying on the OS buffer to not fill
up, let's just drop it.
The OS buffer is expected to be able to store a large number of
XLogRecPtr messages, because its size is small. So it's also OK
to just drop it.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Fujii Masao <masao.fujii@gmail.com> writes:
On Wed, Dec 9, 2009 at 3:58 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:But if everyone is happy with just relying on the OS buffer to not fill
up, let's just drop it.
The OS buffer is expected to be able to store a large number of
XLogRecPtr messages, because its size is small. So it's also OK
to just drop it.
It certainly seems to be something we could improve later, when and
if evidence emerges that it's a real-world problem. For now,
simple is beautiful.
regards, tom lane
On Thu, Dec 10, 2009 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The OS buffer is expected to be able to store a large number of
XLogRecPtr messages, because its size is small. So it's also OK
to just drop it.It certainly seems to be something we could improve later, when and
if evidence emerges that it's a real-world problem. For now,
simple is beautiful.
I just dropped the backend libpq changes related to non-blocking I/O.
git://git.postgresql.org/git/users/fujii/postgres.git
branch: replication
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Fujii Masao wrote:
On Thu, Dec 10, 2009 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The OS buffer is expected to be able to store a large number of
XLogRecPtr messages, because its size is small. So it's also OK
to just drop it.It certainly seems to be something we could improve later, when and
if evidence emerges that it's a real-world problem. For now,
simple is beautiful.I just dropped the backend libpq changes related to non-blocking I/O.
git://git.postgresql.org/git/users/fujii/postgres.git
branch: replication
Thanks, much simpler now.
Changing the finish_time argument to pqWaitTimed into timeout_ms changes
the behavior connect_timeout option to PQconnectdb. It should wait for
max connect_timeout seconds in total, but now it is waiting for
connect_timeout seconds at each step in the connection process: opening
a socket, authenticating etc.
Could we change the API of PQgetXLogData to be more like PQgetCopyData?
I'm thinking of removing the timeout argument, and instead looping with
select/poll and PQconsumeInput in the caller. That probably means
introducing a new state analogous to PGASYNC_COPY_IN. I haven't thought
this fully through yet, but it seems like it would be good to have a
consistent API.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Changing the finish_time argument to pqWaitTimed into timeout_ms changes
the behavior connect_timeout option to PQconnectdb. It should wait for
max connect_timeout seconds in total, but now it is waiting for
connect_timeout seconds at each step in the connection process: opening
a socket, authenticating etc.
Refresh my memory as to why this patch is touching any of that code at
all?
regards, tom lane
Tom Lane wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Changing the finish_time argument to pqWaitTimed into timeout_ms changes
the behavior connect_timeout option to PQconnectdb. It should wait for
max connect_timeout seconds in total, but now it is waiting for
connect_timeout seconds at each step in the connection process: opening
a socket, authenticating etc.Refresh my memory as to why this patch is touching any of that code at
all?
Walreceiver wants to wait for data to arrive from the master or a
signal. PQgetXLogData(), which is the libpq function to read a piece of
WAL, takes a timeout argument to support that. Walreceiver calls
PQgetXLogData() in an endless loop, checking for a received sighup or
death of postmaster at every iteration.
In the synchronous replication mode, I presume it's also going to listen
for a signal from the startup process, so that it can send a
acknowledgment to the master as soon as a COMMIT record has been
replayed that a backend on the master is waiting for.
To implement the timeout in PQgetXLogData(), pqWaitTimed() was changed
to take a timeout instead of finishing_time argument. Which is a mistake
because it breaks PQconnectdb, and as I said I don't think
PQgetXLogData(9 should have a timeout argument to begin with. Instead,
it should have a boolean 'async' argument to return immediately if
there's no data, and walreceiver main loop should call poll()/select()
to wait. Ie. just like PQgetCopyData() works.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Sun, Dec 13, 2009 at 5:42 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Walreceiver wants to wait for data to arrive from the master or a
signal. PQgetXLogData(), which is the libpq function to read a piece of
WAL, takes a timeout argument to support that. Walreceiver calls
PQgetXLogData() in an endless loop, checking for a received sighup or
death of postmaster at every iteration.In the synchronous replication mode, I presume it's also going to listen
for a signal from the startup process, so that it can send a
acknowledgment to the master as soon as a COMMIT record has been
replayed that a backend on the master is waiting for.
Right.
To implement the timeout in PQgetXLogData(), pqWaitTimed() was changed
to take a timeout instead of finishing_time argument. Which is a mistake
because it breaks PQconnectdb, and as I said I don't think
PQgetXLogData(9 should have a timeout argument to begin with. Instead,
it should have a boolean 'async' argument to return immediately if
there's no data, and walreceiver main loop should call poll()/select()
to wait. Ie. just like PQgetCopyData() works.
Seems good. I'll revise the code.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Fujii Masao <masao.fujii@gmail.com> writes:
On Sun, Dec 13, 2009 at 5:42 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:To implement the timeout in PQgetXLogData(), pqWaitTimed() was changed
to take a timeout instead of finishing_time argument. Which is a mistake
because it breaks PQconnectdb, and as I said I don't think
PQgetXLogData(9 should have a timeout argument to begin with. Instead,
it should have a boolean 'async' argument to return immediately if
there's no data, and walreceiver main loop should call poll()/select()
to wait. Ie. just like PQgetCopyData() works.
Seems good. I'll revise the code.
Do we need a new "PQgetXLogData" function at all? Seems like you could
shove the data through the COPY protocol and not have to touch libpq
at all, rather than duplicating a nontrivial amount of code there.
regards, tom lane
On Mon, Dec 14, 2009 at 11:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Do we need a new "PQgetXLogData" function at all? Seems like you could
shove the data through the COPY protocol and not have to touch libpq
at all, rather than duplicating a nontrivial amount of code there.
Yeah, I also think that all data (the WAL data itself, its LSN and
the flag bits) which the "PQgetXLogData" handles could be shoved
through the COPY protocol. But, outside libpq, it's somewhat messy
to extract the LSN and the flag bits from the data buffer which
"PQgetCopyData" returns, by using ntohs(). So I provided the new
libpq function only for replication. That is, I didn't want to expose
the low layer of network which libpq should handle.
I think that the friendly function would be useful to implement
the standby program (e.g., a stand-alone walreceiver tool) outside
the core.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Sat, Dec 12, 2009 at 5:09 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Could we change the API of PQgetXLogData to be more like PQgetCopyData?
I'm thinking of removing the timeout argument, and instead looping with
select/poll and PQconsumeInput in the caller. That probably means
introducing a new state analogous to PGASYNC_COPY_IN. I haven't thought
this fully through yet, but it seems like it would be good to have a
consistent API.
On a related issue, so far I haven't considered about the way to output
the notice message at all :( In the current SR, it's always written to
stderr by the defaultNoticeProcessor by using fprintf, whether the
log_destination is specified or not. This is bizarre, and would need to
be fixed.
I'm going to set the new function calling ereport as the current notice
processor by using PQsetNoticeProcessor. But the problem is that only the
completed message like "NOTICE: xxx" is passed to such notice processor,
i.e., the error level itself is not passed.
So I wonder which error level should be used to output the notice message.
There are some approaches to address this;
1. Always use a specific level without regard to the actual one
2. Reverse-engineer the level from the complete message
3. Change some libpq functions so as to pass the error level to the notice
processor
But nothing really stands out. Do you have another good idea?
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Fujii Masao <masao.fujii@gmail.com> writes:
On Mon, Dec 14, 2009 at 11:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Do we need a new "PQgetXLogData" function at all? �Seems like you could
shove the data through the COPY protocol and not have to touch libpq
at all, rather than duplicating a nontrivial amount of code there.
Yeah, I also think that all data (the WAL data itself, its LSN and
the flag bits) which the "PQgetXLogData" handles could be shoved
through the COPY protocol. But, outside libpq, it's somewhat messy
to extract the LSN and the flag bits from the data buffer which
"PQgetCopyData" returns, by using ntohs(). So I provided the new
libpq function only for replication. That is, I didn't want to expose
the low layer of network which libpq should handle.
I find that a completely unconvincing division of labor. Who is to say
that the LSN is the only part of the data that needs special treatment?
The very, very large practical problem with this is that if you decide
to change the behavior at any time, the only way to be sure that the WAL
receiver is using the right libpq version is to perform a soname major
version bump. The transformations done by libpq will essentially become
part of its ABI, and not a very visible part at that.
I am going to insist that no such logic be placed in libpq. From a
packager's standpoint that's insanity.
regards, tom lane
Fujii Masao <masao.fujii@gmail.com> writes:
I'm going to set the new function calling ereport as the current notice
processor by using PQsetNoticeProcessor. But the problem is that only the
completed message like "NOTICE: xxx" is passed to such notice processor,
i.e., the error level itself is not passed.
Use PQsetNoticeReceiver. The other one is just there for backwards
compatibility.
regards, tom lane
Tom Lane wrote:
The very, very large practical problem with this is that if you decide
to change the behavior at any time, the only way to be sure that the WAL
receiver is using the right libpq version is to perform a soname major
version bump. The transformations done by libpq will essentially become
part of its ABI, and not a very visible part at that.
Not having to change the libpq API would certainly be a big advantage.
It's going to be a bit more complicated in walsender/walreceiver to work
with the libpq COPY API. We're going to need a WAL sending/receiving
protocol on top of it, defined in terms of rows and columns passed
through the COPY protocol.
One problem is the the standby is supposed to send back acknowledgments
to the master, telling it how far it has received/replayed the WAL. Is
there any way to send information back to the server, while a COPY OUT
is in progress? That's not absolutely necessary with asynchronous
replication, but will be with synchronous.
One idea is to stop/start the COPY between every batch of WAL records
sent, giving the client (= walreceiver) a chance to send messages back.
But that will lead to extra round trips.
BTW, something that's been bothering me a bit with this patch is that we
now have to link the backend with libpq. I don't see an immediate
problem with that, but I'm not a packager. Does anyone see a problem
with that?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
It's going to be a bit more complicated in walsender/walreceiver to work
with the libpq COPY API. We're going to need a WAL sending/receiving
protocol on top of it, defined in terms of rows and columns passed
through the COPY protocol.
AFAIR, libpq knows essentially nothing of the data being passed through
COPY --- it just treats that as a byte stream. I think you can define
any data format you want, it doesn't need to look exactly like a COPY
of a table would. In fact it's probably a lot better if it DOESN'T
look like COPY data once it gets past libpq, so that you can check
that it is WAL and not COPY data.
One problem is the the standby is supposed to send back acknowledgments
to the master, telling it how far it has received/replayed the WAL. Is
there any way to send information back to the server, while a COPY OUT
is in progress? That's not absolutely necessary with asynchronous
replication, but will be with synchronous.
Well, a real COPY would of course not stop to look for incoming
messages, but I don't think that's inherent in the protocol. You
would likely need some libpq adjustments so it didn't throw error
when you tried that, but it would be a small and one-time adjustment.
BTW, something that's been bothering me a bit with this patch is that we
now have to link the backend with libpq. I don't see an immediate
problem with that, but I'm not a packager. Does anyone see a problem
with that?
Yeah, I have a problem with that. What's the backend doing with libpq?
It's not receiving this data, it's sending it.
regards, tom lane
Tom Lane wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
BTW, something that's been bothering me a bit with this patch is that we
now have to link the backend with libpq. I don't see an immediate
problem with that, but I'm not a packager. Does anyone see a problem
with that?Yeah, I have a problem with that. What's the backend doing with libpq?
It's not receiving this data, it's sending it.
walreceiver is a postmaster subprocess too.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Tom Lane wrote:
Yeah, I have a problem with that. What's the backend doing with libpq?
It's not receiving this data, it's sending it.
walreceiver is a postmaster subprocess too.
Hm. Perhaps it should be a loadable plugin and not hard-linked into the
backend? Compare dblink.
The main concern I have with hard-linking libpq is that it has a lot of
symbol conflicts with the backend --- and at least the ones from
src/port/ aren't easily removed. I foresee problems that will be very
difficult to fix on platforms where we can't filter the set of link
symbols exposed by libpq. Linking a thread-enabled libpq into the
backend could also create problems on some platforms --- it would likely
cause a thread-capable libc to get linked, which is not what we want.
regards, tom lane
On Tue, Dec 15, 2009 at 4:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hm. Perhaps it should be a loadable plugin and not hard-linked into the
backend? Compare dblink.
You mean that such plugin is supplied in shared_preload_libraries,
a new process is forked and the shared-memory related to walreceiver
is created by using shmem_startup_hook? Since this approach would
solve the problem discussed previously, ISTM this makes sense.
http://archives.postgresql.org/pgsql-hackers/2009-11/msg00031.php
Some additional code might be required to control the termination
of walreceiver.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center