Keepalive for max_standby_delay
Patch adds a keepalive message to ensure max_standby_delay is useful.
No WAL format changes, no libpq changes. Just an additional message type
for the streaming replication protocol, sent once per main loop in
WALsender. Plus docs.
Comments?
--
Simon Riggs www.2ndQuadrant.com
Attachments:
hs_keepalive.patchtext/x-patch; charset=UTF-8; name=hs_keepalive.patchDownload+97-5
Simon Riggs <simon@2ndQuadrant.com> writes:
Patch adds a keepalive message to ensure max_standby_delay is useful.
The proposed placement of the keepalive-send is about the worst it could
possibly be. It needs to be done right before pq_flush to ensure
minimum transfer delay. Otherwise any attempt to measure clock skew
using the timestamp will be seriously off. Where you've placed it
guarantees maximum delay not minimum.
I'm also extremely dubious that it's a good idea to set
recoveryLastXTime from this. Using both that and the timestamps from
the wal log flies in the face of everything I remember about control
theory. We should be doing only one or only the other, I think.
regards, tom lane
On Sat, 2010-05-15 at 11:45 -0400, Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
Patch adds a keepalive message to ensure max_standby_delay is useful.
The proposed placement of the keepalive-send is about the worst it could
possibly be. It needs to be done right before pq_flush to ensure
minimum transfer delay. Otherwise any attempt to measure clock skew
using the timestamp will be seriously off. Where you've placed it
guarantees maximum delay not minimum.
I don't understand. WalSndKeepAlive() contains a pq_flush() immediately
after the timestamp is set. I did that way for exactly the same reasons
you've said.
Do you mean you only want to see one pq_flush()? I used two so that the
actual data is never delayed by a keepalive. WAL Sender was going to
sleep anyway, so shouldn't be a problem.
I'm also extremely dubious that it's a good idea to set
recoveryLastXTime from this. Using both that and the timestamps from
the wal log flies in the face of everything I remember about control
theory. We should be doing only one or only the other, I think.
I can change it so that the recoveryLastXTime will not be updated if we
are using the value from the keepalives. So we have one, or the other.
Remember that replication can switch backwards and forwards between
modes, so it seems sensible to have a common timing value whichever mode
we're in.
--
Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote:
On Sat, 2010-05-15 at 11:45 -0400, Tom Lane wrote:
I'm also extremely dubious that it's a good idea to set
recoveryLastXTime from this. Using both that and the timestamps from
the wal log flies in the face of everything I remember about control
theory. We should be doing only one or only the other, I think.I can change it so that the recoveryLastXTime will not be updated if we
are using the value from the keepalives. So we have one, or the other.
Remember that replication can switch backwards and forwards between
modes, so it seems sensible to have a common timing value whichever mode
we're in.
That means that recoveryLastXTime can jump forwards and backwards.
Doesn't feel right to me either. If you want to expose the
keepalive-time to queries, it should be a separate field, something like
lastMasterKeepaliveTime and a pg_last_master_keepalive() function to
read it.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Sat, 2010-05-15 at 19:30 +0300, Heikki Linnakangas wrote:
Simon Riggs wrote:
On Sat, 2010-05-15 at 11:45 -0400, Tom Lane wrote:
I'm also extremely dubious that it's a good idea to set
recoveryLastXTime from this. Using both that and the timestamps from
the wal log flies in the face of everything I remember about control
theory. We should be doing only one or only the other, I think.I can change it so that the recoveryLastXTime will not be updated if we
are using the value from the keepalives. So we have one, or the other.
Remember that replication can switch backwards and forwards between
modes, so it seems sensible to have a common timing value whichever mode
we're in.That means that recoveryLastXTime can jump forwards and backwards.
That behaviour would be bad, so why not just prevent that from
happening?
Doesn't feel right to me either. If you want to expose the
keepalive-time to queries, it should be a separate field, something like
lastMasterKeepaliveTime and a pg_last_master_keepalive() function to
read it.
That wouldn't be good because then you couldn't easily monitor the
delay? You'd have to run two different functions depending on the state
of replication (for which we would need yet another function). Users
would just wrap that back up into a single function.
--
Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote:
On Sat, 2010-05-15 at 19:30 +0300, Heikki Linnakangas wrote:
Doesn't feel right to me either. If you want to expose the
keepalive-time to queries, it should be a separate field, something like
lastMasterKeepaliveTime and a pg_last_master_keepalive() function to
read it.That wouldn't be good because then you couldn't easily monitor the
delay? You'd have to run two different functions depending on the state
of replication (for which we would need yet another function). Users
would just wrap that back up into a single function.
What exactly is the user trying to monitor? If it's "how far behind is
the standby", the difference between pg_current_xlog_insert_location()
in the master and pg_last_xlog_replay_location() in the standby seems
more robust and well-defined to me. It's a measure of XLOG location (ie.
bytes) instead of time, but time is a complicated concept.
Also note that as the patch stands, if you receive a keep-alive from the
master at point X, it doesn't mean that the standby is fully up-to-date.
It's possible that the walsender just finished sending a huge batch of
accumulated WAL, say 1 GB, and it took 1 hour for the batch to be sent.
During that time, a lot more WAL has accumulated, yet walsender sends a
keep-alive with the current timestamp.
In general, the purpose of a keep-alive is to keep the connection alive,
but you're trying to accomplish something else too, and I don't fully
understand what it is.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Sat, 2010-05-15 at 20:05 +0300, Heikki Linnakangas wrote:
Simon Riggs wrote:
On Sat, 2010-05-15 at 19:30 +0300, Heikki Linnakangas wrote:
Doesn't feel right to me either. If you want to expose the
keepalive-time to queries, it should be a separate field, something like
lastMasterKeepaliveTime and a pg_last_master_keepalive() function to
read it.That wouldn't be good because then you couldn't easily monitor the
delay? You'd have to run two different functions depending on the state
of replication (for which we would need yet another function). Users
would just wrap that back up into a single function.What exactly is the user trying to monitor? If it's "how far behind is
the standby", the difference between pg_current_xlog_insert_location()
in the master and pg_last_xlog_replay_location() in the standby seems
more robust and well-defined to me. It's a measure of XLOG location (ie.
bytes) instead of time, but time is a complicated concept.
Maybe, but its meaningful to users and that is the point.
Also note that as the patch stands, if you receive a keep-alive from the
master at point X, it doesn't mean that the standby is fully up-to-date.
It's possible that the walsender just finished sending a huge batch of
accumulated WAL, say 1 GB, and it took 1 hour for the batch to be sent.
During that time, a lot more WAL has accumulated, yet walsender sends a
keep-alive with the current timestamp.
Not at all. The timestamp for the keepalive is calculated after the
pq_flush for the main WAL data. So it takes 10 years to send the next
blob of WAL data the timestamp will be current.
However, a point you made in an earlier thread is still true here. It
sounds like it would be best if startup process didn't re-access shared
memory for the next location until it had fully replayed all the WAL up
to the point it last accessed. That way the changing value of the shared
timestamp would have no effect on the calculated value at any point in
time. I will recode using that concept.
In general, the purpose of a keep-alive is to keep the connection alive,
but you're trying to accomplish something else too, and I don't fully
understand what it is.
That surprises me. If nothing else, its in the title of the thread,
though since you personally added this to the Hot Standby todo more than
6 months ago I'd hope you of all people would understand the purpose.
--
Simon Riggs www.2ndQuadrant.com
On Sat, 2010-05-15 at 18:24 +0100, Simon Riggs wrote:
I will recode using that concept.
There's some behaviours that aren't helpful here:
Startup gets new pointer when it runs out of data to replay. That might
or might not include an updated keepalive timestamp, since there's no
exact relationship between chunks sent and chunks received. Startup
might ask for a new chunk when half a chunk has been received, or when
multiple chunks have been received.
WALSender doesn't chunk up what it sends, though libpq does at a lower
level. So there's no way to make a keepalive happen every X bytes
without doing this from within libpq.
WALSender sleeps even when it might have more WAL to send, it doesn't
check it just unconditionally sleeps. At least WALReceiver loops until
it has no more to receive. I just can't imagine why that's useful
behaviour.
All in all, I think we should be calling this "burst replication" not
streaming replication. That does cause an issue in trying to monitor
what's going on cos there's so much jitter.
--
Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote:
WALSender sleeps even when it might have more WAL to send, it doesn't
check it just unconditionally sleeps. At least WALReceiver loops until
it has no more to receive. I just can't imagine why that's useful
behaviour.
Good catch. That should be fixed.
I also note that walsender doesn't respond to signals, while it's
sending a large batch. That's analogous to the issue that was addressed
recently in the archiver process.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote:
Simon Riggs wrote:
WALSender sleeps even when it might have more WAL to send, it doesn't
check it just unconditionally sleeps. At least WALReceiver loops until
it has no more to receive. I just can't imagine why that's useful
behaviour.Good catch. That should be fixed.
I also note that walsender doesn't respond to signals, while it's
sending a large batch. That's analogous to the issue that was addressed
recently in the archiver process.
Attached patch rearranges the walsender loops slightly to fix the above.
XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
2) in one round and returns to the main loop after that even if there's
unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
That way the main loop gets to respond to signals quickly, and we also
get to update the shared memory status and PS display more often when
there's a lot of catching up to do.
Comments, have I screwed up anything?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
dont-sleep-if-wal-to-send.patchtext/x-diff; name=dont-sleep-if-wal-to-send.patchDownload+166-157
On Sun, 2010-05-16 at 00:05 +0300, Heikki Linnakangas wrote:
Heikki Linnakangas wrote:
Simon Riggs wrote:
WALSender sleeps even when it might have more WAL to send, it doesn't
check it just unconditionally sleeps. At least WALReceiver loops until
it has no more to receive. I just can't imagine why that's useful
behaviour.Good catch. That should be fixed.
I also note that walsender doesn't respond to signals, while it's
sending a large batch. That's analogous to the issue that was addressed
recently in the archiver process.Attached patch rearranges the walsender loops slightly to fix the above.
XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
2) in one round and returns to the main loop after that even if there's
unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
That way the main loop gets to respond to signals quickly, and we also
get to update the shared memory status and PS display more often when
there's a lot of catching up to do.Comments
8MB at a time still seems like a large batch to me.
libpq is going to send it in smaller chunks anyway, so I don't see the
importance of trying to keep the batch too large. It just introduces
delay into the sending process. We should be sending chunks that matches
libpq better.
--
Simon Riggs www.2ndQuadrant.com
On Sat, 2010-05-15 at 19:50 +0100, Simon Riggs wrote:
On Sat, 2010-05-15 at 18:24 +0100, Simon Riggs wrote:
I will recode using that concept.
Startup gets new pointer when it runs out of data to replay. That might
or might not include an updated keepalive timestamp, since there's no
exact relationship between chunks sent and chunks received. Startup
might ask for a new chunk when half a chunk has been received, or when
multiple chunks have been received.
New version, with some other cleanup of wait processing.
New logic is that when Startup asks for next applychunk of WAL it saves
the lastChunkTimestamp. That is then the base time used by
WaitExceedsMaxStandbyDelay(), except when latestXLogTime is later.
Since multiple receivechunks can arrive from primary before Startup asks
for next applychunk we use the oldest receivechunk timestamp, not the
latest. Doing it this way means the lastChunkTimestamp doesn't change
when new keepalives arrive, so we have a stable and reasonably accurate
recordSendTimestamp for each WAL record.
The keepalive is sent as the first part of a new message, if any. So
partial chunks of data always have an accurate timestamp, even if that
is slightly older as a result. Doesn't make much difference except with
very large chunks.
I think that addresses the points raised on this thread and others.
--
Simon Riggs www.2ndQuadrant.com
Attachments:
hs_keepalive.v2.patchtext/x-patch; charset=UTF-8; name=hs_keepalive.v2.patchDownload+221-22
On Sun, May 16, 2010 at 6:05 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Heikki Linnakangas wrote:
Simon Riggs wrote:
WALSender sleeps even when it might have more WAL to send, it doesn't
check it just unconditionally sleeps. At least WALReceiver loops until
it has no more to receive. I just can't imagine why that's useful
behaviour.Good catch. That should be fixed.
I also note that walsender doesn't respond to signals, while it's
sending a large batch. That's analogous to the issue that was addressed
recently in the archiver process.Attached patch rearranges the walsender loops slightly to fix the above.
XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
2) in one round and returns to the main loop after that even if there's
unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
That way the main loop gets to respond to signals quickly, and we also
get to update the shared memory status and PS display more often when
there's a lot of catching up to do.Comments
The main loop needs to respond to also the 'X' message from the standby
quickly?
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Mon, May 17, 2010 at 1:11 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Sat, 2010-05-15 at 19:50 +0100, Simon Riggs wrote:
On Sat, 2010-05-15 at 18:24 +0100, Simon Riggs wrote:
I will recode using that concept.
Startup gets new pointer when it runs out of data to replay. That might
or might not include an updated keepalive timestamp, since there's no
exact relationship between chunks sent and chunks received. Startup
might ask for a new chunk when half a chunk has been received, or when
multiple chunks have been received.New version, with some other cleanup of wait processing.
New logic is that when Startup asks for next applychunk of WAL it saves
the lastChunkTimestamp. That is then the base time used by
WaitExceedsMaxStandbyDelay(), except when latestXLogTime is later.
Since multiple receivechunks can arrive from primary before Startup asks
for next applychunk we use the oldest receivechunk timestamp, not the
latest. Doing it this way means the lastChunkTimestamp doesn't change
when new keepalives arrive, so we have a stable and reasonably accurate
recordSendTimestamp for each WAL record.The keepalive is sent as the first part of a new message, if any. So
partial chunks of data always have an accurate timestamp, even if that
is slightly older as a result. Doesn't make much difference except with
very large chunks.I think that addresses the points raised on this thread and others.
Is it OK that this keepalive message cannot be used by HS in file-based
log-shipping? Even in SR, the startup process cannot use the keepalive
until walreceiver has been started up.
WalSndKeepAlive() always calls initStringInfo(), which seems to cause
memory-leak.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Mon, 2010-05-17 at 11:51 +0900, Fujii Masao wrote:
Is it OK that this keepalive message cannot be used by HS in file-based
log-shipping? Even in SR, the startup process cannot use the keepalive
until walreceiver has been started up.
Yes, I see those things.
We already have archive_timeout to handle the keepalive case in
file-based.
When starting up the delay is high anyway, so doesn't really matter
about accuracy - though we do use latestXLogTime in that cases.
WalSndKeepAlive() always calls initStringInfo(), which seems to cause
memory-leak.
Thanks.
--
Simon Riggs www.2ndQuadrant.com
On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote:
Attached patch rearranges the walsender loops slightly to fix the above.
XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
2) in one round and returns to the main loop after that even if there's
unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
That way the main loop gets to respond to signals quickly, and we also
get to update the shared memory status and PS display more often when
there's a lot of catching up to do.Comments
8MB at a time still seems like a large batch to me.
libpq is going to send it in smaller chunks anyway, so I don't see the
importance of trying to keep the batch too large. It just introduces
delay into the sending process. We should be sending chunks that matches
libpq better.
More to the point the logic will fail if XLOG_BLCKSZ > PQ_BUFFER_SIZE
because it will send partial pages.
Having MAX_SEND_SIZE > PQ_BUFFER_SIZE is pointless, as libpq currently
stands.
--
Simon Riggs www.2ndQuadrant.com
On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote:
What exactly is the user trying to monitor? If it's "how far behind is
the standby", the difference between pg_current_xlog_insert_location()
in the master and pg_last_xlog_replay_location() in the standby seems
more robust and well-defined to me. It's a measure of XLOG location (ie.
bytes) instead of time, but time is a complicated concept.
I can tell you that end users *will* want a time-based indication of how far behind we are. DBAs will understand "we're this many transactions behind", but managers and end users won't. Unless it's unreasonable to provide that info, we should do so.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
On 17/05/10 04:40, Simon Riggs wrote:
On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote:
Attached patch rearranges the walsender loops slightly to fix the above.
XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
2) in one round and returns to the main loop after that even if there's
unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
That way the main loop gets to respond to signals quickly, and we also
get to update the shared memory status and PS display more often when
there's a lot of catching up to do.Comments
8MB at a time still seems like a large batch to me.
libpq is going to send it in smaller chunks anyway, so I don't see the
importance of trying to keep the batch too large. It just introduces
delay into the sending process. We should be sending chunks that matches
libpq better.More to the point the logic will fail if XLOG_BLCKSZ> PQ_BUFFER_SIZE
because it will send partial pages.
I don't see a failure. We rely on not splitting WAL records across
messages, but we're talking about libpq-level CopyData messages, not TCP
messages.
Having MAX_SEND_SIZE> PQ_BUFFER_SIZE is pointless, as libpq currently
stands.
Well, it does affect the size of the read() in walsender, and I'm sure
there's some overhead in setting the ps display and the other small
stuff we do once per message. But you're probably right that we could
easily make MAX_SEND_SIZE much smaller with no noticeable affect on
performance, while making walsender more responsive to signals. I'll
decrease it to, say, 512 kB.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 17/05/10 12:36, Jim Nasby wrote:
On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote:
What exactly is the user trying to monitor? If it's "how far behind is
the standby", the difference between pg_current_xlog_insert_location()
in the master and pg_last_xlog_replay_location() in the standby seems
more robust and well-defined to me. It's a measure of XLOG location (ie.
bytes) instead of time, but time is a complicated concept.I can tell you that end users *will* want a time-based indication of how far behind we are. DBAs will understand "we're this many transactions behind", but managers and end users won't. Unless it's unreasonable to provide that info, we should do so.
No doubt about that, the problem is that it's hard to provide a reliable
time-based indication.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, 2010-05-18 at 17:06 -0400, Heikki Linnakangas wrote:
On 17/05/10 04:40, Simon Riggs wrote:
On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote:
Attached patch rearranges the walsender loops slightly to fix the above.
XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
2) in one round and returns to the main loop after that even if there's
unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
That way the main loop gets to respond to signals quickly, and we also
get to update the shared memory status and PS display more often when
there's a lot of catching up to do.Comments
8MB at a time still seems like a large batch to me.
libpq is going to send it in smaller chunks anyway, so I don't see the
importance of trying to keep the batch too large. It just introduces
delay into the sending process. We should be sending chunks that matches
libpq better.More to the point the logic will fail if XLOG_BLCKSZ> PQ_BUFFER_SIZE
because it will send partial pages.I don't see a failure. We rely on not splitting WAL records across
messages, but we're talking about libpq-level CopyData messages, not TCP
messages.
OK
Having MAX_SEND_SIZE> PQ_BUFFER_SIZE is pointless, as libpq currently
stands.Well, it does affect the size of the read() in walsender, and I'm sure
there's some overhead in setting the ps display and the other small
stuff we do once per message. But you're probably right that we could
easily make MAX_SEND_SIZE much smaller with no noticeable affect on
performance, while making walsender more responsive to signals. I'll
decrease it to, say, 512 kB.
I'm pretty certain we don't need to set the ps display once per message.
ps doesn't need an update 5 times per second on average.
There's no reason that the buffer size we use for XLogRead() should be
the same as the send buffer, if you're worried about that. My point is
that pq_putmessage contains internal flushes so at the libpq level you
gain nothing by big batches. The read() will be buffered anyway with
readahead so not sure what the issue is. We'll have to do this for sync
rep anyway, so what's the big deal? Just do it now, once. Do we really
want 9.1 code to differ here?
--
Simon Riggs www.2ndQuadrant.com