max_standby_delay considered harmful
I've finally wrapped my head around exactly what the max_standby_delay
code is doing, and I'm not happy with it. The way that code is designed
is that the setting represents a maximum allowed difference between the
standby server's system clock and the commit timestamps it is reading
from the WAL log; whenever this difference exceeds the setting, we'll
kill standby queries in hopes of catching up faster. Now, I can see
the attraction of defining it that way, for certain use-cases.
However, I think it is too fragile and too badly implemented to be
usable in the real world; and it certainly can't be the default
operating mode. There are three really fundamental problems with it:
1. The timestamps we are reading from the log might be historical,
if we are replaying from archive rather than reading a live SR stream.
In the current implementation that means zero grace period for standby
queries. Now if your only interest is catching up as fast as possible,
that could be a sane behavior, but this is clearly not the only possible
interest --- in fact, if that's all you care about, why did you allow
standby queries at all?
2. There could be clock skew between the master and slave servers.
If the master's clock is a minute or so ahead of the slave's, again we
get into a situation where standby queries have zero grace period, even
though killing them won't do a darn thing to permit catchup. If the
master is behind the slave then we have an artificially inflated grace
period, which is going to slow down the slave.
3. There could be significant propagation delay from master to slave,
if the WAL stream is being transmitted with pg_standby or some such.
Again this results in cutting into the standby queries' grace period,
for no defensible reason.
In addition to these fundamental problems there's a fatal implementation
problem: the actual comparison is not to the master's current clock
reading, but to the latest commit, abort, or checkpoint timestamp read
from the WAL. Thus, if the last commit was more than max_standby_delay
seconds ago, zero grace time. Now if the master is really idle then
there aren't going to be any conflicts anyway, but what if it's running
only long-running queries? Or what happens when it was idle for awhile
and then starts new queries? Zero grace period, that's what.
We could possibly improve matters for the SR case by having walsender
transmit the master's current clock reading every so often (probably
once per activity cycle), outside the WAL stream proper. The receiver
could subtract off its own clock reading in order to measure the skew,
and then we could cancel queries if the de-skewed transmission time
falls too far behind. However this doesn't do anything to fix the cases
where we aren't reading (and caught up to) a live SR broadcast.
I'm inclined to think that we should throw away all this logic and just
have the slave cancel competing queries if the replay process waits
more than max_standby_delay seconds to acquire a lock. This is simple,
understandable, and behaves the same whether we're reading live data or
not. Putting in something that tries to maintain a closed-loop maximum
delay between master and slave seems like a topic for future research
rather than a feature we have to have in 9.0. And in any case we'd
still want the plain max delay for non-SR cases, AFAICS, because there's
no sane way to use closed-loop logic in other cases.
Comments?
regards, tom lane
On Mon, 2010-05-03 at 11:37 -0400, Tom Lane wrote:
I've finally wrapped my head around exactly what the max_standby_delay
code is doing, and I'm not happy with it.
Yes, I don't think I'd call it perfect yet.
have the slave cancel competing queries if the replay process waits
more than max_standby_delay seconds to acquire a lock. This is simple,
understandable, and behaves the same whether we're reading live data or
not.
I have no objection, and would welcome, adding another behaviour, since
that just gives us a better chance of having this feature do something
useful.
I'm inclined to think that we should throw away all this logic
HS has been through 2 Alphas with the current behaviour and it will go
through 0 Alphas with the newly proposed behaviour. At this stage of
proceedings, that is extremely dangerous and I don't wish to do that.
The likelihood that we replace it with something worse seems fairly
high/certain: snap decision making never quite considers all angles.
Phrases like "throw away all this logic" don't give me confidence that
people that agree with that perspective would understand what they are
signing up to.
Putting in something that tries to maintain a closed-loop maximum
delay between master and slave seems like a topic for future research
rather than a feature we have to have in 9.0. And in any case we'd
still want the plain max delay for non-SR cases, AFAICS, because there's
no sane way to use closed-loop logic in other cases.
I will be looking for ways to improve this over time.
--
Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote:
On Mon, 2010-05-03 at 11:37 -0400, Tom Lane wrote:
I've finally wrapped my head around exactly what the max_standby_delay
code is doing, and I'm not happy with it.Yes, I don't think I'd call it perfect yet.
have the slave cancel competing queries if the replay process waits
more than max_standby_delay seconds to acquire a lock. This is simple,
understandable, and behaves the same whether we're reading live data or
not.I have no objection, and would welcome, adding another behaviour, since
that just gives us a better chance of having this feature do something
useful.I'm inclined to think that we should throw away all this logic
HS has been through 2 Alphas with the current behaviour and it will go
through 0 Alphas with the newly proposed behaviour. At this stage of
proceedings, that is extremely dangerous and I don't wish to do that.
The likelihood that we replace it with something worse seems fairly
high/certain: snap decision making never quite considers all angles.
Phrases like "throw away all this logic" don't give me confidence that
people that agree with that perspective would understand what they are
signing up to.
I'm not really sure how much serious testing outside of the small set of
people mostly interested in one or another specific aspect of HS/SR has
been actually done with the alphas to be honest.
I just started testing HS yesterday and I already ran twice into the
general issue tom is complaining about with max_standby_delay...
Stefan
On Mon, 2010-05-03 at 18:54 +0200, Stefan Kaltenbrunner wrote:
I'm not really sure how much serious testing outside of the small set of
people mostly interested in one or another specific aspect of HS/SR has
been actually done with the alphas to be honest.
I just started testing HS yesterday and I already ran twice into the
general issue tom is complaining about with max_standby_delay...
I guarantee that if that proposal goes in, people will complain about
that also. Last minute behaviour changes are bad news. I don't object to
adding something, just don't take anything away. It's not like the code
for it is pages long or anything.
The trade off is HA or queries and two modes make sense for user choice.
--
Simon Riggs www.2ndQuadrant.com
* Simon Riggs (simon@2ndQuadrant.com) wrote:
I guarantee that if that proposal goes in, people will complain about
that also. Last minute behaviour changes are bad news. I don't object to
adding something, just don't take anything away. It's not like the code
for it is pages long or anything.
I have to disagree with this. If it goes into 9.0 this way then we're
signing up to support it for *years*. With something as fragile as the
existing setup (as outlined by Tom), that's probably not a good idea.
We've not signed up to support the existing behaviour at all yet-
alpha's aren't a guarentee of what we're going to release.
The trade off is HA or queries and two modes make sense for user choice.
The option isn't being thrown out, it's just being made to depend on
something which is alot easier to measure while still being very useful
for the trade-off you're talking about. I don't really see a downside
to this, to be honest. Perhaps you could speak to the specific user
experience difference that you think there would be from this change?
+1 from me on Tom's proposal.
Thanks,
Stephen
On Mon, May 3, 2010 at 11:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm inclined to think that we should throw away all this logic and just
have the slave cancel competing queries if the replay process waits
more than max_standby_delay seconds to acquire a lock.
What if we somehow get into a situation where the replay process is
waiting for a lock over and over and over again, because it keeps
killing conflicting processes but something restarts them and they
take locks over again? It seems hard to ensure that replay will make
adequate progress with any substantially non-zero value of
max_standby_delay under this definition.
...Robert
On Mon, 2010-05-03 at 13:13 -0400, Stephen Frost wrote:
* Simon Riggs (simon@2ndQuadrant.com) wrote:
I guarantee that if that proposal goes in, people will complain about
that also. Last minute behaviour changes are bad news. I don't object to
adding something, just don't take anything away. It's not like the code
for it is pages long or anything.I have to disagree with this. If it goes into 9.0 this way then we're
signing up to support it for *years*. With something as fragile as the
existing setup (as outlined by Tom), that's probably not a good idea.
We've not signed up to support the existing behaviour at all yet-
alpha's aren't a guarentee of what we're going to release.
That's a great argument, either way. We will have to live with 9.0 for
many years and so that's why I mention having both. Make a choice either
way and we take a risk. Why?
The trade off is HA or queries and two modes make sense for user choice.
The option isn't being thrown out, it's just being made to depend on
something which is alot easier to measure while still being very useful
for the trade-off you're talking about. I don't really see a downside
to this, to be honest. Perhaps you could speak to the specific user
experience difference that you think there would be from this change?+1 from me on Tom's proposal.
--
Simon Riggs www.2ndQuadrant.com
On Mon, 2010-05-03 at 13:21 -0400, Robert Haas wrote:
On Mon, May 3, 2010 at 11:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm inclined to think that we should throw away all this logic and just
have the slave cancel competing queries if the replay process waits
more than max_standby_delay seconds to acquire a lock.What if we somehow get into a situation where the replay process is
waiting for a lock over and over and over again, because it keeps
killing conflicting processes but something restarts them and they
take locks over again? It seems hard to ensure that replay will make
adequate progress with any substantially non-zero value of
max_standby_delay under this definition.
That is one argument against, and a reason why just one route is bad.
We already have more than one way, so another option is useful
--
Simon Riggs www.2ndQuadrant.com
On Mon, 2010-05-03 at 13:13 -0400, Stephen Frost wrote:
Perhaps you could speak to the specific user
experience difference that you think there would be from this change?
The difference is really to do with the weight you give to two different
considerations
* avoid query cancellations
* avoid having recovery fall behind, so that failover time is minimised
Some people recognise the trade-offs and are planning multiple standby
servers dedicated to different roles/objectives.
Some people envisage Hot Standby as a platform for running very fast
SELECTs, for which retrying the query is a reasonable possibility and
for whom keeping the standby as up-to-date as possible is an important
consideration from a data freshness perspective. Others view HS as a
weapon against long running queries.
My initial view was that the High Availability goal/role should be the
default or most likely mode of operation. I would say that the current
max_standby_delay favours the HA route since it specifically limits the
amount by which server can fall behind.
Tom's proposed behaviour (has also been proposed before) favours the
avoid query cancellation route though could lead to huge amounts of lag.
I'm happy to have both options because I know this is a trade-off that
solution engineers want to have control of, not something we as
developers can choose ahead of time.
--
Simon Riggs www.2ndQuadrant.com
* Robert Haas (robertmhaas@gmail.com) wrote:
On Mon, May 3, 2010 at 11:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm inclined to think that we should throw away all this logic and just
have the slave cancel competing queries if the replay process waits
more than max_standby_delay seconds to acquire a lock.What if we somehow get into a situation where the replay process is
waiting for a lock over and over and over again, because it keeps
killing conflicting processes but something restarts them and they
take locks over again? It seems hard to ensure that replay will make
adequate progress with any substantially non-zero value of
max_standby_delay under this definition.
That was my first question too- but I reread what Tom wrote and came to
a different conclusion: If the reply process waits more than
max_standby_delay to acquire a lock, then it will kill off *everything*
it runs into from that point forward, until it's done with whatever is
currently available. At that point, the 'timer' would reset back to
zero.
When/how that timer gets reset was a question I had, but I feel like
"until nothing is available" makes sense and is what I assumed Tom was
thinking.
Thanks,
Stephen
Simon,
* Simon Riggs (simon@2ndQuadrant.com) wrote:
Tom's proposed behaviour (has also been proposed before) favours the
avoid query cancellation route though could lead to huge amounts of lag.
My impression of Tom's suggestion was that it would also be a maximum
amount of delay which would be allowed before killing off queries- not
that it would be able to wait indefinitely until no one is blocking.
Based on that, I don't know that there's really much user-seen behaviour
between the two, except in 'oddball' situations, where there's a time
skew between the servers, or a large lag, etc, in which case I think
Tom's proposal would be more likely what's 'expected', whereas what you
would get with the existing implementation (zero time delay, or far too
much) would be a 'gotcha'..
Thanks,
Stephen
Simon,
My initial view was that the High Availability goal/role should be the
default or most likely mode of operation. I would say that the current
max_standby_delay favours the HA route since it specifically limits the
amount by which server can fall behind.
I don't understand how Tom's approach would cause the slave to be
further behind than the current max_standy_delay code, and I can see
ways in which it would result in less delay. So, explain?
The main issue with Tom's list which struck me was that
max_standby_delay was linked to the system clock. HS is going to get
used by a lot of PG users who aren't running time sync on their servers,
or who let it get out of whack without fixing it. I'd thought that the
delay was somehow based on transaction timestamps coming from the
master. Keep in mind that there will be a *lot* of people using this
feature, including ones without compentent & available sysadmins.
The lock method appeals to me simply because it would eliminate the
"mass cancel" issues which Greg Smith was reporting every time the timer
runs down. That is, it seems to me that only the oldest queries would
be cancelled and not any new ones. The biggest drawback I can see to
Tom's approach is possible blocking on the slave due to the lock wait
from the recovery process. However, this could be managed with the new
lock-waits GUC, as well as statement timeout.
Overall, I think Tom's proposal gives me what I would prefer, which is
degraded performance on the slave but in ways which users are used to,
rather than a lot of query cancel, which will interfere with user
application porting.
Would the recovery lock show up in pg_locks? That would also be a good
diagnostic tool.
I am happy to test some of this on Amazon or GoGrid, which is what I was
planning on doing anyway.
P.S. can we avoid the "considered harmful" phrase? It carries a lot of
baggage ...
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, May 3, 2010 at 11:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm inclined to think that we should throw away all this logic and just
have the slave cancel competing queries if the replay process waits
more than max_standby_delay seconds to acquire a lock.
What if we somehow get into a situation where the replay process is
waiting for a lock over and over and over again, because it keeps
killing conflicting processes but something restarts them and they
take locks over again?
They won't be able to take locks "over again", because the lock manager
won't allow requests to pass a pending previous request, except in
very limited circumstances that shouldn't hold here. They'll queue
up behind the replay process's lock request, not in front of it.
(If that isn't the case, it needs to be fixed, quite independently
of this concern.)
regards, tom lane
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160
Based on that, I don't know that there's really much user-seen behaviour
between the two, except in 'oddball' situations, where there's a time
skew between the servers, or a large lag, etc, in which case I think
Certainly that one particular case can be solved by making the
servers be in time sync a prereq for HS working (in the traditional way).
And by "prereq" I mean a "user beware" documentation warning.
- --
Greg Sabino Mullane greg@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201005031539
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----
iEYEAREDAAYFAkvfJr0ACgkQvJuQZxSWSsgSRwCgwAZpKJDqHX28y90rCx/CNXDt
JGgAoO9JeoBacvTJ09UJ+o1Nek3KtcYR
=gvch
-----END PGP SIGNATURE-----
On Mon, 2010-05-03 at 15:32 -0400, Stephen Frost wrote:
Simon,
* Simon Riggs (simon@2ndQuadrant.com) wrote:
Tom's proposed behaviour (has also been proposed before) favours the
avoid query cancellation route though could lead to huge amounts of lag.My impression of Tom's suggestion was that it would also be a maximum
amount of delay which would be allowed before killing off queries- not
that it would be able to wait indefinitely until no one is blocking.
Based on that, I don't know that there's really much user-seen behaviour
between the two, except in 'oddball' situations, where there's a time
skew between the servers, or a large lag, etc, in which case I think
Tom's proposal would be more likely what's 'expected', whereas what you
would get with the existing implementation (zero time delay, or far too
much) would be a 'gotcha'..
If recovery waits for max_standby_delay every time something gets in its
way, it should be clear that if many things get in its way it will
progressively fall behind. There is no limit to this and it can always
fall further behind. It does result in fewer cancelled queries and I do
understand many may like that.
That is *significantly* different from how it works now. (Plus: If there
really was no difference, why not leave it as is?)
The bottom line is this is about conflict resolution. There is simply no
way to resolve conflicts without favouring one or other of the
protagonists. Whatever mechanism you come up with that favours one will,
disfavour the other. I'm happy to give choices, but I'm not happy to
force just one kind of conflict resolution.
--
Simon Riggs www.2ndQuadrant.com
On Mon, 2010-05-03 at 15:39 -0400, Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, May 3, 2010 at 11:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm inclined to think that we should throw away all this logic and just
have the slave cancel competing queries if the replay process waits
more than max_standby_delay seconds to acquire a lock.What if we somehow get into a situation where the replay process is
waiting for a lock over and over and over again, because it keeps
killing conflicting processes but something restarts them and they
take locks over again?They won't be able to take locks "over again", because the lock manager
won't allow requests to pass a pending previous request, except in
very limited circumstances that shouldn't hold here. They'll queue
up behind the replay process's lock request, not in front of it.
(If that isn't the case, it needs to be fixed, quite independently
of this concern.)
Most conflicts aren't lock-manager locks, they are snapshot conflicts,
though clearly different workloads will have different characteristics.
Some conflicts are buffer conflicts and the semantics of buffer cleanup
locks and many other internal locks are that shared locks queue jump
past exclusive lock requests. Not something we should touch, now at
least.
I understand that you aren't impressed by everything about the current
patch but rushed changes may not help either.
--
Simon Riggs www.2ndQuadrant.com
On Mon, May 3, 2010 at 3:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, May 3, 2010 at 11:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm inclined to think that we should throw away all this logic and just
have the slave cancel competing queries if the replay process waits
more than max_standby_delay seconds to acquire a lock.What if we somehow get into a situation where the replay process is
waiting for a lock over and over and over again, because it keeps
killing conflicting processes but something restarts them and they
take locks over again?They won't be able to take locks "over again", because the lock manager
won't allow requests to pass a pending previous request, except in
very limited circumstances that shouldn't hold here. They'll queue
up behind the replay process's lock request, not in front of it.
(If that isn't the case, it needs to be fixed, quite independently
of this concern.)
Well, the new backends needn't try to take "the same" locks as the
existing backends - the point is that in the worst case this proposal
means waiting max_standby_delay for EACH replay that requires taking a
lock. And that might be a LONG time.
One idea I had while thinking this over was to bound the maximum
amount of unapplied WAL rather than the absolute amount of time lag.
Now, that's a little fruity, because your WAL volume might fluctuate
considerably, so you wouldn't really know how far the slave was behind
the master chronologically. However, it would avoid all the time skew
issues, and it would also more accurately model the idea of a bound on
recovery time should we need to promote the standby to master, so
maybe it works out to a win. You could still end up stuck
semi-permanently behind, but never by more than N segments.
Stephen's idea of a mode where we wait up to max_standby_delay for a
lock but then kill everything in our path until we've caught up again
is another possible way of approaching this problem, although it may
lead to "kill storms". Some of that may be inevitable, though: a
bound on WAL lag has the same issue - if the primary is generating WAL
faster than the standby can apply it, the standby will eventually
decide to slaughter everything in its path.
...Robert
Greg, Robert,
Certainly that one particular case can be solved by making the
servers be in time sync a prereq for HS working (in the traditional way).
And by "prereq" I mean a "user beware" documentation warning.
Last I checked, you work with *lots* of web developers and web
companies. I'm sure you can see the issue with the above.
Stephen's idea of a mode where we wait up to max_standby_delay for a
lock but then kill everything in our path until we've caught up again
is another possible way of approaching this problem, although it may
lead to "kill storms".
Personally, I thought that the kill storms were exactly what was wrong
with max_standby_delay. That is, with MSD, no matter *what* your
settings or traffic are, you're going to get query cancel occasionally.
I don't see the issue with Tom's approach from a wait perspective. The
max wait becomes 1.001X max_standby_delay; there's no way I can think of
that replay would wait longer than that. I've yet to see an explanation
why it would be longer.
Simon's assertion that not all operations take a conventional lock is a
much more serious potential flaw.
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
Simon Riggs wrote:
On Mon, 2010-05-03 at 13:13 -0400, Stephen Frost wrote:
Perhaps you could speak to the specific user
experience difference that you think there would be from this change?The difference is really to do with the weight you give to two different
considerations* avoid query cancellations
* avoid having recovery fall behind, so that failover time is minimisedSome people recognise the trade-offs and are planning multiple standby
servers dedicated to different roles/objectives.
I understand Simon's point that the two behaviors have different
benefits. However, I believe few users will be able to understand when
to use which.
As I remember, 9.0 has two behaviors:
o master delays vacuum cleanup
o slave delays WAL application
and in 9.1 we will be adding:
o slave communicates snapshots to master
How would this figure into what we ultimately want in 9.1?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
On Mon, 2010-05-03 at 15:04 -0700, Josh Berkus wrote:
I don't see the issue with Tom's approach from a wait perspective. The
max wait becomes 1.001X max_standby_delay; there's no way I can think of
that replay would wait longer than that. I've yet to see an explanation
why it would be longer.
Yes, the max wait on any *one* blocker will be max_standby_delay. But if
you wait for two blockers, then the total time by which the standby lags
will now be 2*max_standby_delay. Add a third, fourth etc and the standby
lag keeps rising.
We need to avoid confusing these two measurables
* standby lag - defined as the total delay from when a WAL record is
written to the time the WAL record is applied. This includes both
transfer time and any delays imposed by Hot Standby.
* standby query delay - defined as the time that recovery will wait for
a query to complete before a cancellation takes place. (We could
complicate this by asking what happens when recovery is blocked twice by
the same query? Would it wait twice, or does it have to track how much
it has waited for each query in total so far?)
Currently max_standby_delay seeks to constrain the standby lag to a
particular value, as a way of providing a bounded time for failover, and
also to constrain the amount of WAL that needs to be stored as the lag
increases. Currently, there is no guaranteed minimum query delay given
to each query.
If every query is guaranteed its requested query delay then the standby
lag will be unbounded. Less cancellations, higher lag. Some people do
want this, though is not currently available. We can do this with two
new GUCs:
* standby_query_delay - USERSET parameter that allows user to specify a
guaranteed query delay, anywhere from 0 to maximum_standby_query_delay
* max_standby_query_delay - SIGHUP parameter - parameter exists to
provide DBA with a limit on the USERSET standby_query_delay, though I
can see some would say this is optional
Current behaviour is same as global settings of
standby_query_delay = 0
max_standby_query_delay = 0
max_standby_delay = X
So if people want minimal cancellations they would specify
standby_query_delay = Y (e.g. 30)
max_standby_query_delay = Z (e.g. 300)
max_standby_delay = -1
--
Simon Riggs www.2ndQuadrant.com