Hot Standby query cancellation and Streaming Replication integration
I'm happy to see we've crossed the point where the worst of the Hot
Standby and Streaming Replication issues are sorted out. A look at the
to-do lists: http://wiki.postgresql.org/wiki/Hot_Standby_TODO
http://wiki.postgresql.org/wiki/Streaming_Replication show no Must-fix
items and 5 Serious Issues for Hot Standby left; there are 9 Streaming
Replication items there, which aren't as clearly prioritized yet.
Correct me if I'm wrong here, but those read to me like tweaks and
polishing rather than major architecture issues at this point, so I
believe that code is the same position as HS: some serious issues, but
no really terrible parts.
The job Simon asked me to take a look at starting last week is which of
the listed HS "Serious Issues" might be promoted into must-fix items
after seeing how easy they were to encounter. There are a number of HS
tunables that interact with one another, and depending your priorities a
few ways you can try to optimize the configuration for what I expect to
be common use cases for this feature. I've written a blog entry at
http://blog.2ndquadrant.com/en/2010/02/tradeoffs-in-hot-standby-deplo.html
that tries to explain all that background clearly, and relate the
implementation details to how I expect DBAs will perceive them. That
was a bit much to also cover here, and had a broader audience that might
appreciate it than just this list.
Attached is a tar file with some test case demo scripts that demonstrate
the worst of the problems here IMHO. A README in there outlines how to
set the problem demo up (presuming you've already gotten a HS pair
going). What this does is execute the following sequence continuously
on the master:
UPDATE pgbench_tellers SET tbalance = tbalance + <delta> WHERE tid =
<tid>; (several times)
VACUUM pgbench_tellers;
Meanwhile, on the standby, the following long query runs on a few
sessions at once, again looping constantly:
SELECT sum(abalance) FROM pgbench_accounts;
It took a bit of testing to get the database scale and iteration times
here to easily encounter the issue here on my system, I hope this shows
up easily enough for others with the values used. (I have a similar
work in progress demo that tries to trigger the b-tree deletion problem
too, will follow up once the storm of messages about this topic dies
down, as I think this is a pre-requisite for it anyway)
I'm not sure what you might be expecting from the above combination, but
what actually happens is that many of the SELECT statements on the table
*that isn't even being updated* are canceled. You see this in the logs:
LOG: restored log file "0000000100000000000000A5" from archive
ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be
removed.
STATEMENT: SELECT sum(abalance) FROM pgbench_accounts;
Basically, every time a WAL segment appears that wipes out a tuple that
SELECT expects should still be visible, because the dead row left behind
by the update has been vacuumed away, the query is canceled. This
happens all the time the way I've set this up, and I don't feel like
this is a contrived demo. Having a long-running query on the standby
while things get updated and then periodically autovacuumed on the
primary is going to be extremely common in the sorts of production
systems I expect want HS the most.
Now, as explained on the blog entry and in the documentation, there are
all sorts of ways you can work around this issue by tweaking parameters
or doing fun tricks with dblink. You can prioritize any two of keeping
the standby current, letting long-running queries execute on the
standby, and keeping xid advances on the master moving forward as fast
as possible. But you can't get all three at once. The choices
available are really about the best you can do given a system that's
basically the old warm-standby approach, improved with adding just Hot
Standby to the mix. Sure, you might make the conflict resolution a bit
smarter or make the UI for setting the parameters more friendly, and
there's already been plenty of argument and patching over all of that.
I don't want to belittle that work because it's been important to make
HS a useful standalone feature, but I feel like that's all triage rather
than looking for the most robust fix possible.
If you're running a system that also is using Streaming Replication,
there is a much better approach possible. This idea has been floating
around for a while and I am not taking credit for inventing it (too busy
tonight to dig into the archives to figure out exactly when this popped
up initially and who deserves credit for it). I'm just pointing out
that now is the time where it's actually possible to implement. The HS
TODO already includes the following action item, to resolve a serious
issue you can run into (that itself would be great to eliminate):
"Requires keep-alives with timestamps to be added to sync rep feature"
If those keep-alives flowed in both directions, and included both
timestamps *and* xid visibility information, the master could easily be
configured to hold open xid snapshots needed for long running queries on
the standby when that was necessary. I might be missing an
implementation detail here, but from a high level it seems like you
could make the walreceiver on the master publish the information about
where the standby has advanced to as a bit of ProcArray xmin data. Then
the master could only advance past where the standby says it cannot need
visibility behind anymore.
This is a much more elegant solution than any of the hacks available so
far. It would turn Hot Standby + Streaming Replication into a system
that stepped out of the way of the worst of the technical limitations of
HS alone. The master would easily and automatically avoid advancing
past where the queries running on the standby needed visibility back to,
essentially the same way cleanup is blocked during a long-running query
on the primary--except with the actual main query work offloaded to the
standby, the idea all along.
I don't know how difficult the keepalive feature was expected to be, and
there's certainly plenty of potential landmines in this whole xid export
idea. How to handle situations where the standby goes away for a while,
such as a network outage, so that it doesn't block the master from ever
cleaning up dead tuples is a concern. I wouldn't expect that to be too
serious of a blocker, given that if the standby isn't talking it
probably isn't running queries you need to worry about canceling
either. Not sure where else this can fall down, and unfortunately I
don't know nearly enough about the SR code to help myself with
implementing this feature. (I think Simon is in a similar
position--it's just not what we've been staring at the last few months).
But I do know that the current Hot Standby implementation is going to be
frustrating to configure correctly for people. If it's possible to make
most of that go away just by doing some final integration between it and
Streaming Replication that just wasn't practical to accomplish until
now, I think it's worth considering how to make that happen before the
final 9.0 release.
I really hope this discussion can say focused on if and how it's
possible to improve this area, with the goal being to deliver a product
everyone can be proud of with the full feature set that makes this next
release a killer one. The features that have managed to all get into
this release already are fantastic, everyone who contributed should be
proud of that progress, and it's encouraging that the alpha4 date was
nailed. It would be easy to descend into finger-pointing for why
exactly this particular problem is only getting more visibility now, or
into schedule-oriented commentary suggesting it must be ignored because
it's too late to do anything about it. I hope everyone appreciates
wandering that way will not help make PostgreSQL 9.0 a better release.
This issue is so easy to encounter, and looks so bad when it happens,
that I feel it could easily lead to an embarrassing situation for the
community if something isn't done about it before release.
--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.us
Attachments:
On Fri, Feb 26, 2010 at 8:33 AM, Greg Smith <greg@2ndquadrant.com> wrote:
I'm not sure what you might be expecting from the above combination, but
what actually happens is that many of the SELECT statements on the table
*that isn't even being updated* are canceled. You see this in the logs:
Well I proposed that the default should be to wait forever when
applying WAL logs that conflict with a query. Precisely because I
think the expectation is that things will "just work" and queries not
fail unpredictably. Perhaps in your test a larger max_standby_delay
would have prevented the cancellations but then as soon as you try a
query which lasts longer you would have to raise it again. There's no
safe value which will be right for everyone.
If you're running a system that also is using Streaming Replication, there
is a much better approach possible.
So I think one of the main advantages of a log shipping system over
the trigger-based systems is precisely that it doesn't require the
master to do anything it wasn't doing already. There's nothing the
slave can do which can interfere with the master's normal operation.
This independence is really a huge feature. It means you can allow
users on the slave that you would never let near the master. The
master can continue running production query traffic while users run
all kinds of crazy queries on the slave and drive it into the ground
and the master will continue on blithely unaware that anything's
changed.
In the model you describe any long-lived queries on the slave cause
tables in the master to bloat with dead records.
I think this model is on the roadmap but it's not appropriate for
everyone and I think one of the benefits of having delayed it is that
it forces us to get the independent model right before throwing in
extra complications. It would be too easy to rely on the slave
feedback as an answer for hard questions about usability if we had it
and just ignore the question of what to do when it's not the right
solution for the user.
--
greg
Greg Smith wrote:
Attached is a tar file with some test case demo scripts that demonstrate
the worst of the problems here IMHO.
Thanks for that! We've been discussing this for ages, so it's nice to
have a concrete example.
I don't want to belittle that work because it's been important to make
HS a useful standalone feature, but I feel like that's all triage rather
than looking for the most robust fix possible.
Ideally the standby would stash away the old pages or tuples somewhere
so that it can still access them even after replaying the WAL records
that remove them from the main storage. I realize that's not going to
happen any time soon because it's hard to do, but that would really be
the most robust fix possible.
I don't know how difficult the keepalive feature was expected to be, and
there's certainly plenty of potential landmines in this whole xid export
idea.
One such landmine is that the keepalives need to flow from client to
server while the WAL records are flowing from server to client. We'll
have to crack that problem for synchronous replication too, but I think
that alone is a big enough problem to make this 9.1 material.
How to handle situations where the standby goes away for a while,
such as a network outage, so that it doesn't block the master from ever
cleaning up dead tuples is a concern.
Yeah, that's another issue that needs to be dealt with. You'd probably
need some kind of a configurable escape valve in the master, to let it
ignore a standby's snapshot once it gets too old.
But I do know that the current Hot Standby implementation is going to be
frustrating to configure correctly for people.
Perhaps others who are not as deep into the code as I am will have a
better view on this, but I seriously don't think that's such a big
issue. I think the max_standby_delay setting is quite intuitive and easy
to explain. Sure, it would better if there was no tradeoff between
killing queries and stalling recovery, but I don't think it'll be that
hard to understand the tradeoff.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 26/02/10 08:33, Greg Smith wrote:
There are a number of HS
tunables that interact with one another, and depending your priorities a
few ways you can try to optimize the configuration for what I expect to
be common use cases for this feature.
I've written a blog entry at
http://blog.2ndquadrant.com/en/2010/02/tradeoffs-in-hot-standby-deplo.html
that tries to explain all that background clearly,
It did too. Thanks for the nice summary people can be pointed at.
I'm not sure what you might be expecting from the above combination, but
what actually happens is that many of the SELECT statements on the table
*that isn't even being updated* are canceled. You see this in the logs:
Hmm - this I'd already figured out for myself. It's just occurred to me
that this could well be the case between databases too. Database A gets
vacuumed, B gets its queries kicked off on the standby. Granted lots of
people just have the one main DB, but even so...
LOG: restored log file "0000000100000000000000A5" from archive
ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be
removed.
STATEMENT: SELECT sum(abalance) FROM pgbench_accounts;Basically, every time a WAL segment appears that wipes out a tuple that
SELECT expects should still be visible, because the dead row left behind
by the update has been vacuumed away, the query is canceled. This
happens all the time the way I've set this up, and I don't feel like
this is a contrived demo. Having a long-running query on the standby
while things get updated and then periodically autovacuumed on the
primary is going to be extremely common in the sorts of production
systems I expect want HS the most.
I can pretty much everyone wanting HS+SR. Thousands of small DBs running
on VMs for a start. Free mostly-live backup? Got to be a winner.
Dumb non-hacker question: why do we cancel all transactions rather than
just those with "ACCESS SHARE" on the vacuumed table in question? Is it
the simple fact that we don't know what table this particular section of
WAL affects, or is it the complexity of tracking all this info?
If you're running a system that also is using Streaming Replication,
there is a much better approach possible.
"Requires keep-alives with timestamps to be added to sync rep feature"
If those keep-alives flowed in both directions, and included both
timestamps *and* xid visibility information, the master could easily be
configured to hold open xid snapshots needed for long running queries on
the standby when that was necessary.
Presumably meaning we need *another* config setting to prevent excessive
bloat on a heavily updated table on the master.
--
Richard Huxton
Archonet Ltd
On 26/02/10 14:10, Heikki Linnakangas wrote:
Ideally the standby would stash away the old pages or tuples somewhere
so that it can still access them even after replaying the WAL records
that remove them from the main storage. I realize that's not going to
happen any time soon because it's hard to do, but that would really be
the most robust fix possible.
Something like snapshotting a filesystem, so updates continue while
you're still looking at a static version.
--
Richard Huxton
Archonet Ltd
Richard Huxton wrote:
On 26/02/10 08:33, Greg Smith wrote:
I'm not sure what you might be expecting from the above combination, but
what actually happens is that many of the SELECT statements on the table
*that isn't even being updated* are canceled. You see this in the logs:Hmm - this I'd already figured out for myself. It's just occurred to me
that this could well be the case between databases too. Database A gets
vacuumed, B gets its queries kicked off on the standby.
No, it's per-database already. Only queries in the same database are
canceled.
Dumb non-hacker question: why do we cancel all transactions rather than
just those with "ACCESS SHARE" on the vacuumed table in question? Is it
the simple fact that we don't know what table this particular section of
WAL affects, or is it the complexity of tracking all this info?
The problem is that even if transaction X doesn't have an (access share)
lock on the vacuumed table at the moment, it might take one in the
future. Simon proposed mechanisms for storing the information about
vacuumed tables in shared memory, so that if X takes the lock later on
it will get canceled at that point, but that's 9.1 material.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 26/02/10 14:45, Heikki Linnakangas wrote:
Richard Huxton wrote:
On 26/02/10 08:33, Greg Smith wrote:
I'm not sure what you might be expecting from the above combination, but
what actually happens is that many of the SELECT statements on the table
*that isn't even being updated* are canceled. You see this in the logs:Hmm - this I'd already figured out for myself. It's just occurred to me
that this could well be the case between databases too. Database A gets
vacuumed, B gets its queries kicked off on the standby.No, it's per-database already. Only queries in the same database are
canceled.
That's a relief.
Dumb non-hacker question: why do we cancel all transactions rather than
just those with "ACCESS SHARE" on the vacuumed table in question? Is it
the simple fact that we don't know what table this particular section of
WAL affects, or is it the complexity of tracking all this info?The problem is that even if transaction X doesn't have an (access share)
lock on the vacuumed table at the moment, it might take one in the
future. Simon proposed mechanisms for storing the information about
vacuumed tables in shared memory, so that if X takes the lock later on
it will get canceled at that point, but that's 9.1 material.
I see - we'd need to age the list of vacuumed tables too, so when the
oldest transactions complete the correct flags get cleared.
Can we not wait to cancel the transaction until *any* new lock is
attempted though? That should protect all the single-statement
long-running transactions that are already underway. Aggregates etc.
--
Richard Huxton
Archonet Ltd
Richard Huxton wrote:
Can we not wait to cancel the transaction until *any* new lock is
attempted though? That should protect all the single-statement
long-running transactions that are already underway. Aggregates etc.
Hmm, that's an interesting thought. You'll still need to somehow tell
the victim backend "you have to fail if you try to acquire any more
locks", but a single per-backend flag in the procarray would suffice.
You could also clear the flag whenever you free the last snapshot in the
transaction (ie. between each query in read committed mode).
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Replying to my own post - first sign of madness...
Let's see if I've got the concepts clear here, and hopefully my thinking
it through will help others reading the archives.
There are two queues:
1. Cleanup on the master
2. Replay on the slave
Running write queries on the master adds to both queues.
Running (read-only) queries on the slave prevents you removing from both
queues.
There are two interesting measurements of "age"/"size":
1. Oldest item in / length of queue (knowable)
2. How long will it take to clear the queue (estimable at best)
You'd like to know #2 to keep up with your workload. Unfortunately, you
can't for certain unless you have control over new incoming queries (on
both master and slave).
You might want four separate GUCs for the two measurements on the two
queues. We currently have two that (sort of) match #1 "Oldest item"
(vacuum_defer_cleanup_age, max_standby_delay).
Delaying replay on a slave has no effect on the master. If a slave falls
too far behind it's responsible for catch-up (via normal WAL archives).
There is no point in delaying cleanup on the master unless it's going to
help one or more slaves. In fact, you don't want to start delaying
cleanup until you have to, otherwise you're wasting your delay time.
This seems to be the case with vacuum_defer_cleanup_age. If I have a
heavily-updated table and I defer vacuuming then when any given query
starts on the slave it's going to be half used up already.
There's also no point in deferring cleanup on the master if the standby
is already waiting on a conflict that will cause its queries to be
cancelled anyway. Not only won't it help, but it might make things worse
since transactions will be cancelled, the conflict will be replayed and
(presumably) queries will be re-submitted only to be cancelled again.
This is what Greg Smith's discussion of the keep-alives was about.
Giving the master enough information to be smarter about cleanup (and
making the conflicts more fine-grained).
The situation with deferring on one or both ends of process just gets
more complicated with multiple slaves. There's all sorts of unpleasant
feedback loops I can envisage there.
For the case of single slave being used to run long reporting queries
the ideal scenario would be the following. Master starts deferring
vacuum activity just before the query starts. When that times out, the
slave will receive the cleanup info, refuse to replay it and start its
delay. This gives you a total available query time of:
natural time between vacuums + vacuum delay + WAL transfer time +
standby delay
I can think of five useful things we should be doing (and might be
already - don't know).
1. On the master, deduce whether the slave is already waiting on a
query. If so, don't bother delaying cleanup. Clearly you don't want to
be signalling hundreds of times a second though. Does the slave pause
fetching via streaming replication if replay is blocked on a query?
Could we signal "half-way to max-age" or some such?
2. Perhaps simpler than trying to make the master smarter, just allow
SET this_transaction_is_probably_a_long_one=true on the slave. That (a)
clears the queue on the slave and (b) sends the signal to the master
which then starts deferring vacuum.
3. Do a burst of cleanup activity on the master after blocking. This
should concentrate conflicts together when they reach the slave. Perhaps
vacuum_defer_cleanup_age should be vacuum_deferred_queue_size and
measure the amount of work to do, rather than the max age of the oldest
cleanup (if I've understood correctly).
4. Do a burst of replay on the slave after blocking. Perhaps every time
it cancels a transaction it should replay at least half the queued WAL
before letting new transactions start. Or perhaps it replays any vacuum
activity it comes across and then stops. That should sync with #2
assuming the slave doesn't lag the master too much.
5. I've been mixing "defer" and "delay", as do the docs. We should
probably settle on one or the other. I think defer conveys the meaning
more precisely, but what about non-native English speakers?
--
Richard Huxton
Archonet Ltd
On Fri, Feb 26, 2010 at 10:21 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Richard Huxton wrote:
Can we not wait to cancel the transaction until *any* new lock is
attempted though? That should protect all the single-statement
long-running transactions that are already underway. Aggregates etc.Hmm, that's an interesting thought. You'll still need to somehow tell
the victim backend "you have to fail if you try to acquire any more
locks", but a single per-backend flag in the procarray would suffice.You could also clear the flag whenever you free the last snapshot in the
transaction (ie. between each query in read committed mode).
Wow, that seems like it would help a lot. Although I'm not 100% sure
I follow all the details of how this works.
...Robert
On Fri, Feb 26, 2010 at 4:43 PM, Richard Huxton <dev@archonet.com> wrote:
Let's see if I've got the concepts clear here, and hopefully my thinking it
through will help others reading the archives.There are two queues:
I don't see two queues. I only see the one queue of operations which
have been executed on the master but not replayed yet on the slave.
Every write operation on the master enqueues an operation to it and
every operation replayed on the slave dequeues from it. Only a subset
of operations create conflicts with concurrent transactions on the
slave, namely vacuums and a few similar operations (HOT pruning and
btree index pruning).
There's no question we need to make sure users have good tools to
monitor this queue and are aware of these tools. You can query each
slave for its currently replayed log position and hopefully you can
find out how long it's been delayed (ie, if it's looking at a log
record and waiting for a conflict to clear how long ago that log
record was generated). You can also find out what the log position is
on the master.
--
greg
Heikki Linnakangas wrote:
How to handle situations where the standby goes away for a while,
such as a network outage, so that it doesn't block the master from ever
cleaning up dead tuples is a concern.Yeah, that's another issue that needs to be dealt with. You'd probably
need some kind of a configurable escape valve in the master, to let it
ignore a standby's snapshot once it gets too old.But I do know that the current Hot Standby implementation is going to be
frustrating to configure correctly for people.Perhaps others who are not as deep into the code as I am will have a
better view on this, but I seriously don't think that's such a big
issue. I think the max_standby_delay setting is quite intuitive and easy
to explain. Sure, it would better if there was no tradeoff between
killing queries and stalling recovery, but I don't think it'll be that
hard to understand the tradeoff.
Let's look at the five documented cases of query conflict (from our manual):
1 Access Exclusive Locks from primary node, including both explicit
LOCK commands and various DDL actions
2 Dropping tablespaces on the primary while standby queries are
using those tablespaces for temporary work files (work_mem
overflow)
3 Dropping databases on the primary while users are connected to
that database on the standby.
4 The standby waiting longer than max_standby_delay to acquire a
buffer cleanup lock.
5 Early cleanup of data still visible to the current query's
snapshot
We might have a solution to #1 by only cancelling queries that try to
take locks.
#2 and #3 seem like rare occurances.
#4 can be controlled by max_standby_delay, where a large value only
delays playback during crash recovery --- again, a rare occurance.
#5 could be handled by using vacuum_defer_cleanup_age on the master.
Why is vacuum_defer_cleanup_age not listed in postgresql.conf?
In summary, I think passing snapshots to the master is not something
possible for 9.0, and ideally we will never need to add that feature.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
+ If your life is a hard drive, Christ can be your backup. +
Greg Stark <gsstark@mit.edu> writes:
In the model you describe any long-lived queries on the slave cause
tables in the master to bloat with dead records.
Yup, same as they would do on the master.
I think this model is on the roadmap but it's not appropriate for
everyone and I think one of the benefits of having delayed it is that
it forces us to get the independent model right before throwing in
extra complications. It would be too easy to rely on the slave
feedback as an answer for hard questions about usability if we had it
and just ignore the question of what to do when it's not the right
solution for the user.
I'm going to make an unvarnished assertion here. I believe that the
notion of synchronizing the WAL stream against slave queries is
fundamentally wrong and we will never be able to make it work.
The information needed isn't available in the log stream and can't be
made available without very large additions (and consequent performance
penalties). As we start getting actual beta testing we are going to
uncover all sorts of missed cases that are not going to be fixable
without piling additional ugly kluges on top of the ones Simon has
already crammed into the system. Performance and reliability will both
suffer.
I think that what we are going to have to do before we can ship 9.0
is rip all of that stuff out and replace it with the sort of closed-loop
synchronization Greg Smith is pushing. It will probably be several
months before everyone is forced to accept that, which is why 9.0 is
not going to ship this year.
regards, tom lane
Tom Lane wrote:
I'm going to make an unvarnished assertion here. I believe that the
notion of synchronizing the WAL stream against slave queries is
fundamentally wrong and we will never be able to make it work.
The information needed isn't available in the log stream and can't be
made available without very large additions (and consequent performance
penalties). As we start getting actual beta testing we are going to
uncover all sorts of missed cases that are not going to be fixable
without piling additional ugly kluges on top of the ones Simon has
already crammed into the system. Performance and reliability will both
suffer.I think that what we are going to have to do before we can ship 9.0
is rip all of that stuff out and replace it with the sort of closed-loop
synchronization Greg Smith is pushing. It will probably be several
months before everyone is forced to accept that, which is why 9.0 is
not going to ship this year.
Wow, can I have some varnish with that. :-O
You are right that we need to go down the road a bit before we know what
we need for 9.0 or 9.1.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
+ If your life is a hard drive, Christ can be your backup. +
On 2/26/10 6:57 AM, Richard Huxton wrote:
Can we not wait to cancel the transaction until *any* new lock is
attempted though? That should protect all the single-statement
long-running transactions that are already underway. Aggregates etc.
I like this approach. Is it fragile in some non-obvious way?
--Josh Berkus
On 2/26/10 10:53 AM, Tom Lane wrote:
I think that what we are going to have to do before we can ship 9.0
is rip all of that stuff out and replace it with the sort of closed-loop
synchronization Greg Smith is pushing. It will probably be several
months before everyone is forced to accept that, which is why 9.0 is
not going to ship this year.
I don't think that publishing visibility info back to the master ... and
subsequently burdening the master substantially for each additional
slave ... are what most users want. Certainly for use cases like NTT's,
it is, but not for most of our users.
In fact, I seem to remember specifically discussing the approach of
trying to publish snapshots back to the master, and rejecting it on this
list during the development of SR.
Does anyone know how Oracle solves these issues? Does their structure
(separate rollback log) make it easier for them?
--Josh Berkus
Josh Berkus <josh@agliodbs.com> writes:
On 2/26/10 10:53 AM, Tom Lane wrote:
I think that what we are going to have to do before we can ship 9.0
is rip all of that stuff out and replace it with the sort of closed-loop
synchronization Greg Smith is pushing. It will probably be several
months before everyone is forced to accept that, which is why 9.0 is
not going to ship this year.
I don't think that publishing visibility info back to the master ... and
subsequently burdening the master substantially for each additional
slave ... are what most users want.
I don't see a "substantial additional burden" there. What I would
imagine is needed is that the slave transmits a single number back
--- its current oldest xmin --- and the walsender process publishes
that number as its transaction xmin in its PGPROC entry on the master.
I don't doubt that this approach will have its own gotchas that we
find as we get into it. But it looks soluble. I have no faith in
either the correctness or the usability of the approach currently
being pursued.
regards, tom lane
I don't see a "substantial additional burden" there. What I would imagine is needed is that the slave transmits a single number back --- its current oldest xmin --- and the walsender process publishes that number as its transaction xmin in its PGPROC entry on the master.
If the main purpose of the slave is long-running queries, though, this
could cause a lot of bloat on the master. That's a special case, but a
reason why we would want to preserve the stop replication functionality.
I don't doubt that this approach will have its own gotchas that we
find as we get into it. But it looks soluble. I have no faith in
either the correctness or the usability of the approach currently
being pursued.
So, why not start working on it now, instead of arguing about it? It'll
be easy to prove the approach once we have some test code.
Tom Lane wrote:
Josh Berkus <josh@agliodbs.com> writes:
On 2/26/10 10:53 AM, Tom Lane wrote:
I think that what we are going to have to do before we can ship 9.0
is rip all of that stuff out and replace it with the sort of closed-loop
synchronization Greg Smith is pushing. It will probably be several
months before everyone is forced to accept that, which is why 9.0 is
not going to ship this year.I don't think that publishing visibility info back to the master ... and
subsequently burdening the master substantially for each additional
slave ... are what most users want.I don't see a "substantial additional burden" there. What I would imagine is needed is that the slave transmits a single number back --- its current oldest xmin --- and the walsender process publishes that number as its transaction xmin in its PGPROC entry on the master.
The additional burden comes from the old snapshot effect. It makes it
unusable for offloading reporting queries, for example. In general, it
is a very good architectural property that the master is not affected by
what happens in a standby, and a closed-loop synchronization would break
that.
I don't actually understand how tight synchronization on its own would
solve the problem. What if the connection to the master is lost? Do you
kill all queries in the standby before reconnecting?
One way to think about this is to first consider a simple a stop-and-go
system. Clearly the database must be consistent at any point in the WAL
sequence, if recovery was stopped and the database started up. So it is
always safe to pause recovery and run a read-only query against the
database as it is at that point in time (this assumes that the index
"cleanup" operations are not required for consistent query results BTW).
After the read-only transaction is finished, you can continue recovery.
The next step up is to relax that so that you allow replay of those WAL
records that are known to not cause trouble to the read-only queries.
For example, heap_insert records are very innocent, they only add rows
with a yet-uncommitted xmin.
Things get more complex when you allow the replay of commit records; all
the known-assigned-xids tracking is related to that, so that
transactions that are not committed when a snapshot is taken in the
standby to be considered uncommitted by the snapshot even after the
commit record is later replayed. If that feels too fragile, there might
be other methods to achieve that. One I once pondered is to not track
all in-progress transactions in shared memory like we do now, but only
OldestXmin. When a backend wants to take a snapshot in the slave, it
memcpy()s clog from OldestXmin to the latest committed XID, and includes
it in the snapshot. The visibility checks use the copy instead of the
actual clog, so they see the situation as it was when the snapshot was
taken. To keep track of the OldestXmin in the slave, the master can emit
that as a WAL record every now and then; it's ok if it lags behind.
Then there's the WAL record types that remove data that might still be
required by the read-only transactions. This includes vacuum and index
deletion records.
If you really think the current approach is unworkable, I'd suggest that
we fall back to a stop-and-go system, where you either let the recovery
to progress or allow queries to run, but not both at the same time. But
FWIW I don't think the situation is that grave.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Fri, Feb 26, 2010 at 7:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I don't see a "substantial additional burden" there. What I would imagine is needed is that the slave transmits a single number back --- its current oldest xmin --- and the walsender process publishes that number as its transaction xmin in its PGPROC entry on the master.
And when we want to support cascading slaves?
Or when you want to bring up a new slave and it suddenly starts
advertising a new xmin that's older than the current oldestxmin?
But in any case if I were running a reporting database I would want it
to just stop replaying logs for a few hours while my big batch report
runs, not cause the master to be unable to vacuum any dead records for
hours. That defeats much of the purpose of running the queries on the
slave.
--
greg