Issues with two-server Synch Rep
Simon, Fujii,
What follows are what I see as the major issues with making two-server
synch replication work well. I would like to have you each answer them,
explaining how your patch and your design addresses each issue. I
believe this will go a long way towards helping the majority of the
community understand the options we have from your code, as well as
where help is still needed.
Adding a Synch Standby
-----------------------
What is the procedure for adding a new synchronous standby in your
implementation? That is, how do we go from having a standby server with
an empty PGDATA to having a working synchronous standby?
Snapshot Publication
---------------------
During 9.0 development discussion, one of the things we realized we
needed for synch standby was publication of snapshots back to the master
in order to prevent query cancel on the standby. Without this, the
synch standby is useless for running read queries. Does your patch
implement this? Please describe.
Management
-----------
One of the serious flaws currently in HS/SR is complexity of
administration. Setting up and configuring even a single master and
single standby requires editing up to 6 configuration files in Postgres,
as well as dealing with file permissions. As such, any Synch Rep patch
must work together with attempts to simplify administration. How does
your design do this?
Monitoring
-----------
Synch rep offers severe penalties to availability if a synch standby
gets behind or goes down. What replication-specific monitoring tools
and hooks are available to allow administators to take action before the
database becomes unavailable?
Degradation
------------
In the event that the synch rep standby falls too far behind or becomes
unavailable, or is deliberately taken offline, what are you envisioning
as the process for the DBA resolving the situation? Is there any
ability to commit "stuck" transactions?
Client Consistency
---------------------
With a standby in "apply" mode, and a master failure at the wrong time,
there is the possibility that the Standby will apply a transaction at
the same time that the master crashes, causing the client to never
receive a commit message. Once the client reconnects to the standby,
how will it know whether its transaction was committed or not?
As a lesser case, a standby in "apply" mode will show the results of
committed transactions *before* they are visible on the master. Is
there any need to handle this? If so, how?
Performance
------------
As with XA, synch rep has the potential to be so slow as to be unusable.
What optimizations to you make in your approach to synch rep to make it
faster than two-phase commit? What other performance optimizations have
you added?
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
On Thu, Oct 7, 2010 at 2:05 PM, Josh Berkus <josh@agliodbs.com> wrote:
What is the procedure for adding a new synchronous standby in your
implementation? That is, how do we go from having a standby server with
an empty PGDATA to having a working synchronous standby?
I'll take a crack at answering these. I don't think that the
procedure for setting up a standby server is going to change much.
The idea is presumably that you set up an async standby more or less
as you do now and then make whatever configuration changes are
necessary to flip it to synchronous.
During 9.0 development discussion, one of the things we realized we
needed for synch standby was publication of snapshots back to the master
in order to prevent query cancel on the standby. Without this, the
synch standby is useless for running read queries. Does your patch
implement this? Please describe.
This is a completely separate issue from making replication
synchronous. And, really? Useless for running read queries?
One of the serious flaws currently in HS/SR is complexity of
administration. Setting up and configuring even a single master and
single standby requires editing up to 6 configuration files in Postgres,
as well as dealing with file permissions. As such, any Synch Rep patch
must work together with attempts to simplify administration. How does
your design do this?
This is also completely out of scope for sync rep.
Synch rep offers severe penalties to availability if a synch standby
gets behind or goes down. What replication-specific monitoring tools
and hooks are available to allow administators to take action before the
database becomes unavailable?
I don't think there's much hope of allowing administrators to take
action BEFORE the database becomes unavailable. The point of making
replication synchronous rather than asynchronous is that the slave
can't be behind AT ALL, and if it goes down the primary is immediately
stuck. If the synchronous standby vanishes, the master can recover
if:
1. We turn off synchronous replication, or
2. TCP keepalives or some other mechanism kills the master-slave
connection after a suitable timeout, and we interpret (or configure)
no connected standbys = stop synchronous replication.
In the event that the synch rep standby falls too far behind or becomes
unavailable, or is deliberately taken offline, what are you envisioning
as the process for the DBA resolving the situation? Is there any
ability to commit "stuck" transactions?
Again, it can't fall "too far" behind. It can't be behind at all.
Any stuck transactions are necessarily already committed; the commit
just hasn't been acknowledged to the client yet. Presumably, if
synchronous replication is disabled via (1) or (2) above, then any
outstanding committed-but-unacknowledged-to-the-client transactions
should notify the client of the commit and continue on.
With a standby in "apply" mode, and a master failure at the wrong time,
there is the possibility that the Standby will apply a transaction at
the same time that the master crashes, causing the client to never
receive a commit message. Once the client reconnects to the standby,
how will it know whether its transaction was committed or not?
If a client loses the connection after issuing a commit but before
receiving the acknowledgment, it can't know whether the commit
happened or not. This is true regardless of whether there is a
standby and regardless of whether that standby is synchronous.
Clients that care need to implement their own mechanisms for resolving
this difficulty.
As a lesser case, a standby in "apply" mode will show the results of
committed transactions *before* they are visible on the master. Is
there any need to handle this? If so, how?
It's theoretically impossible for the transaction to become visible
everywhere simultaneously. It's already the case that transactions
become visible to other backends before the backend doing the commit
has received an acknowledgment. Any client relying on any other
behavior is already broken.
As with XA, synch rep has the potential to be so slow as to be unusable.
What optimizations to you make in your approach to synch rep to make it
faster than two-phase commit? What other performance optimizations have
you added?
Sync rep is going to be slow, period. Every implementation currently
on the table has to fsync on the master, and then send the commit xlog
record to the slave and wait for an acknowledgment from the slave.
Allowing those to happen in parallel is going to be Hard. Also, the
interaction with max_standby_delay is going to be a big problem, I
suspect.
As for the specific optimizations in each patch, I believe the major
thing that differs between them is the exact timing of the
acknowledgments; but perhaps I should let the patch authors speak to
that question, if they wish to do so.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert,
I'll take a crack at answering these. I don't think that the
procedure for setting up a standby server is going to change much.
The idea is presumably that you set up an async standby more or less
as you do now and then make whatever configuration changes are
necessary to flip it to synchronous.
What is the specific "flip" procedure, though? For one thing, I want to
make sure that it's not necessary to restart the master or the standby
to "flip" it, since that would be a catch-22.
This is a completely separate issue from making replication
synchronous. And, really? Useless for running read queries?
Absolutely. For a synch standby, you can't tolerate any standby delay
at all. This means that anywhere from 1/4 to 3/4 of queries on the
standby would be cancelled on any high-traffic OLTP server. Hence,
"useless".
As such, any Synch Rep patch
must work together with attempts to simplify administration. How does
your design do this?This is also completely out of scope for sync rep.
It is not, given that I've seen several proposals for synch rep which
would make asynch rep even more complicated than it already is. I'm
taking the stance that any sync rep design which *blocks* making asynch
rep easier to use is fundamentally flawed and can't be accepted.
I don't think there's much hope of allowing administrators to take
action BEFORE the database becomes unavailable.
I'd swear that you were working as a DBA less than a year ago, but I
couldn't tell it from that statement.
There is every bit of value in allowing DBAs to view, and chart,
response times on the standby for ACK. That way they can notice an
increase in response times and take action to improve the standby
*before* it locks up the system.
Presumably, if
synchronous replication is disabled via (1) or (2) above, then any
outstanding committed-but-unacknowledged-to-the-client transactions
should notify the client of the commit and continue on.
That's what I was asking about. I'm not "presuming" that any pending
patch covers any such eventuality until it's confirmed.
If a client loses the connection after issuing a commit but before
receiving the acknowledgment, it can't know whether the commit
happened or not. This is true regardless of whether there is a
standby and regardless of whether that standby is synchronous.
Clients that care need to implement their own mechanisms for resolving
this difficulty.
That's a handwavy way of saying "go away, don't bother us with such
details". For the client to resolve the situation, then *it* needs to
be able to tell whether or not the transaction was committed. How would
it do this, exactly?
It's theoretically impossible for the transaction to become visible
everywhere simultaneously. It's already the case that transactions
become visible to other backends before the backend doing the commit
has received an acknowledgment. Any client relying on any other
behavior is already broken.
So, your opinion is "it's out of scope to handle this issue" ?
Sync rep is going to be slow, period. Every implementation currently
on the table has to fsync on the master, and then send the commit xlog
record to the slave and wait for an acknowledgment from the slave.
Allowing those to happen in parallel is going to be Hard.
Yes, but it's something we need to address. XA is widely distrusted and
is seen as inadequate for high-traffic OLTP systems precisely because it
is SO slow. If we want to create a synch rep system which people will
want to use, then it has to be faster than XA. If it's not faster than
XA, why bother creating it? We already have 2PC.
Also, the
interaction with max_standby_delay is going to be a big problem, I
suspect.
Interaction? My opinion is that the two are completely incompatible.
You can't have synch rep and also have standby_delay > 0.
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
On Mon, Oct 11, 2010 at 2:07 PM, Josh Berkus <josh@agliodbs.com> wrote:
I'll take a crack at answering these. I don't think that the
procedure for setting up a standby server is going to change much.
The idea is presumably that you set up an async standby more or less
as you do now and then make whatever configuration changes are
necessary to flip it to synchronous.What is the specific "flip" procedure, though? For one thing, I want to
make sure that it's not necessary to restart the master or the standby
to "flip" it, since that would be a catch-22.
Obviously. I presume it'll be something like "update postgresql.conf
or recovery.conf and run pg_ctl reload", but I haven't (yet, anyway)
verified the actual behavior of the patches, but if the above isn't
feasible then we have a problem.
This is a completely separate issue from making replication
synchronous. And, really? Useless for running read queries?Absolutely. For a synch standby, you can't tolerate any standby delay
at all. This means that anywhere from 1/4 to 3/4 of queries on the
standby would be cancelled on any high-traffic OLTP server. Hence,
"useless".
What is your source for those numbers? They could be right, but I
simply don't know.
At any rate, I don't disagree that we have a problem. In fact, I
think we have a whole serious of problems. The whole architecture of
replication as it exists in PG is pretty fundamentally limited right
now. Right now, a pruning operation on the master (regardless of
whether it's a HOT prune or vacuum) can happen when there are still
snapshots on the slave that need that data. Our only options are to
either wait for those snapshots to go away, or kill of the
queries/transactions that took them. Adding an XID feedback from the
slave to the master "fixes" the problem by preventing the master from
pruning those tuples until the slave no longer needs them, but at the
expense of bloating the master and all other standbys. That may,
indeed, be better for some use cases, but it's not really all that
good. It would be far better if we could decouple master cleanup from
standby cleanup, so that only the machine that actually has the old
query gets bloated. However, no one seems excited about writing that
code.
A further grump about our current architecture is that it doesn't seem
at all clear how to make it work for partial replication. I have to
wonder whether we are going down the wrong path completely and need to
hit the reset button. But neither this nor the pruning problem are
things that we can reasonably expect the sync rep patch to solve, if
we want it to get committed this release cycle.
As such, any Synch Rep patch
must work together with attempts to simplify administration. How does
your design do this?This is also completely out of scope for sync rep.
It is not, given that I've seen several proposals for synch rep which
would make asynch rep even more complicated than it already is.
I'm not aware of any proposals on the table which would do that.
I'm
taking the stance that any sync rep design which *blocks* making asynch
rep easier to use is fundamentally flawed and can't be accepted.
Do you have some ideas on how to simplify it? How will we know
whether a particular design for sync rep does this?
I don't think there's much hope of allowing administrators to take
action BEFORE the database becomes unavailable.I'd swear that you were working as a DBA less than a year ago, but I
couldn't tell it from that statement.
Your comment sounded to me like you were asking for a schedule of all
future unplanned outages.
There is every bit of value in allowing DBAs to view, and chart,
response times on the standby for ACK. That way they can notice an
increase in response times and take action to improve the standby
*before* it locks up the system.
Sure, that would be nice to have, and it's a good idea. But I don't
think that's going to be a common failure mode. What I expect to
happen is the standby to hum along with no problem for a long time and
then either kick a disk or suffer a power outage. There's very little
monitoring we can do within PG that will notice either of those things
coming. There might be some external-to-PG monitoring that can be
done, but if there's a massive blackout or a terrorist attack or
somebody trips over the power cord, you're just going to get
surprised.
Presumably, if
synchronous replication is disabled via (1) or (2) above, then any
outstanding committed-but-unacknowledged-to-the-client transactions
should notify the client of the commit and continue on.That's what I was asking about. I'm not "presuming" that any pending
patch covers any such eventuality until it's confirmed.
Yep, we need to confirm that.
If a client loses the connection after issuing a commit but before
receiving the acknowledgment, it can't know whether the commit
happened or not. This is true regardless of whether there is a
standby and regardless of whether that standby is synchronous.
Clients that care need to implement their own mechanisms for resolving
this difficulty.That's a handwavy way of saying "go away, don't bother us with such
details". For the client to resolve the situation, then *it* needs to
be able to tell whether or not the transaction was committed. How would
it do this, exactly?
No, it isn't at all. What does your application do NOW if the master
goes down after you've sent a commit and before you get an
acknowledgment back? Does it assume that the transaction is
committed, or does it assume the transaction was aborted by a crash on
the master? Either is possible, right?
It's theoretically impossible for the transaction to become visible
everywhere simultaneously. It's already the case that transactions
become visible to other backends before the backend doing the commit
has received an acknowledgment. Any client relying on any other
behavior is already broken.So, your opinion is "it's out of scope to handle this issue" ?
What handling of it would you propose? Consider the case where you
just have one server and no standbys. A client connects, does some
work, and says COMMIT. There is some finite amount of time after the
COMMIT happens and before the client gets the acknowledgment back that
the commit has succeeded. During that time, another transaction that
starts up will see the effects of the COMMIT - BEFORE the transaction
itself knows that it is committed. There's not much you can do about
this. You have to do the commit on the server before sending the
response back to the client.
In the sync rep case, you're going to get the same behavior. After
the client has asked for commit and before the commit has been
acknowledged, there's no guarantee whether another transaction that
starts up during that in-between time sees the transaction or not.
The only further anomaly that can happen as a result of sync rep is
that, in apply mode, the transaction's effects will become visible on
the standby before they are visible on the master, so if you fire off
a COMMIT, and then before receiving the acknowledgment start a
transaction on the standby, and then just after that start a
transaction on the master, and then just after that you get back an
acknowledgment that the COMMIT completed, you might have a snapshot on
the master that was taken afterwards chronologically but shows the
effects of fewer committed XIDs - i.e. time has gone backwards.
Unfortunately, short of a global transaction manager, this is an
unsolvable problem, and that's definitely more than is going to happen
for 9.1, I think.
Sync rep is going to be slow, period. Every implementation currently
on the table has to fsync on the master, and then send the commit xlog
record to the slave and wait for an acknowledgment from the slave.
Allowing those to happen in parallel is going to be Hard.Yes, but it's something we need to address.
I agree, but it's not something we can address in the first patch,
which is hard enough without adding things that make it even harder.
We need to get something simple committed first and then build on it.
XA is widely distrusted and
is seen as inadequate for high-traffic OLTP systems precisely because it
is SO slow. If we want to create a synch rep system which people will
want to use, then it has to be faster than XA. If it's not faster than
XA, why bother creating it? We already have 2PC.
I don't know anything about XA so I can't comment on this.
Also, the
interaction with max_standby_delay is going to be a big problem, I
suspect.Interaction? My opinion is that the two are completely incompatible.
You can't have synch rep and also have standby_delay > 0.
We seem to be in violent agreement on this point. I was saying the
same thing in a different way.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, 2010-10-11 at 15:22 -0400, Robert Haas wrote:
This is a completely separate issue from making replication
synchronous. And, really? Useless for running read queries?Absolutely. For a synch standby, you can't tolerate any standby delay
at all. This means that anywhere from 1/4 to 3/4 of queries on the
standby would be cancelled on any high-traffic OLTP server. Hence,
"useless".What is your source for those numbers? They could be right, but I
simply don't know.
I was initially taken aback by the word "useless" as well. However, I
had trouble thinking of a use case that isn't better solved by sync rep
without HS, or async rep. I don't have the numbers either though, so
perhaps someone does have a use case.
It would be far better if we could decouple master cleanup from
standby cleanup, so that only the machine that actually has the old
query gets bloated. However, no one seems excited about writing that
code.
That doesn't seem just a matter of code, it seems like a major design
conflict. If the master removes a tuple and then re-uses that space for
another tuple, you can't move those changes to the standby unless you
cancel queries dependent on the first tuple.
A further grump about our current architecture is that it doesn't seem
at all clear how to make it work for partial replication. I have to
wonder whether we are going down the wrong path completely and need to
hit the reset button. But neither this nor the pruning problem are
things that we can reasonably expect the sync rep patch to solve, if
we want it to get committed this release cycle.
What we have is physical replication, but you seem to be talking about
logical replication (evidence: Slony solves both the independent cleanup
problem and partial replication).
Both logical and physical replication have a place, and I don't believe
either is the "wrong" path. If we'd like to add logical replication, we
may be better of starting with Slony (or Londiste) and working from
there.
Regards,
Jeff Davis
On Mon, Oct 11, 2010 at 4:01 PM, Jeff Davis <pgsql@j-davis.com> wrote:
On Mon, 2010-10-11 at 15:22 -0400, Robert Haas wrote:
This is a completely separate issue from making replication
synchronous. And, really? Useless for running read queries?Absolutely. For a synch standby, you can't tolerate any standby delay
at all. This means that anywhere from 1/4 to 3/4 of queries on the
standby would be cancelled on any high-traffic OLTP server. Hence,
"useless".What is your source for those numbers? They could be right, but I
simply don't know.I was initially taken aback by the word "useless" as well. However, I
had trouble thinking of a use case that isn't better solved by sync rep
without HS, or async rep. I don't have the numbers either though, so
perhaps someone does have a use case.
The main use cases for synchronous replication seem to be (1) high
availability and (2) read scalability. That is, if you have 99%
writes and 1% reads, you can round-robin the reads and do all the
writes on the master. But I think we are quite a way from making (2)
work well enough to get excited about.
It would be far better if we could decouple master cleanup from
standby cleanup, so that only the machine that actually has the old
query gets bloated. However, no one seems excited about writing that
code.That doesn't seem just a matter of code, it seems like a major design
conflict.
Yes. I had the idea of trying to fix this by allowing the standby to
retain old versions of entire pages that got cleaned up on the master,
until the transactions that might want to read the old pages were
gone. But that may be prohibitively difficult, not sure.
Alternatively, you could, as you say, do logical rather than physical
replication.
A further grump about our current architecture is that it doesn't seem
at all clear how to make it work for partial replication. I have to
wonder whether we are going down the wrong path completely and need to
hit the reset button. But neither this nor the pruning problem are
things that we can reasonably expect the sync rep patch to solve, if
we want it to get committed this release cycle.What we have is physical replication, but you seem to be talking about
logical replication (evidence: Slony solves both the independent cleanup
problem and partial replication).Both logical and physical replication have a place, and I don't believe
either is the "wrong" path. If we'd like to add logical replication, we
may be better of starting with Slony (or Londiste) and working from
there.
Yeah, that's possible. Or Mammoth Replicator.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Obviously. I presume it'll be something like "update postgresql.conf
or recovery.conf and run pg_ctl reload", but I haven't (yet, anyway)
verified the actual behavior of the patches, but if the above isn't
feasible then we have a problem.
Right. That's why I asked the question. Mind you, a superuser function
on the master would be even better ...
What is your source for those numbers? They could be right, but I
simply don't know.
pg_bench tests with asynch rep and standby_delay = 0. Not rigorous, but
enough to show that there is a problem there. Doing pg_bench with a
small database
It would be far better if we could decouple master cleanup from
standby cleanup, so that only the machine that actually has the old
query gets bloated. However, no one seems excited about writing that
code.
"not excited" == terrified of the amount of troubleshooting involved,
and likely believing it's impossible.
A further grump about our current architecture is that it doesn't seem
at all clear how to make it work for partial replication. I have to
wonder whether we are going down the wrong path completely and need to
hit the reset button.
The way to do partial replication is Slony, Londiste, Bucardo, etc.
But neither this nor the pruning problem are
things that we can reasonably expect the sync rep patch to solve, if
we want it to get committed this release cycle.
It is not, given that I've seen several proposals for synch rep which
would make asynch rep even more complicated than it already is.I'm not aware of any proposals on the table which would do that.
Standby registration?
Do you have some ideas on how to simplify it? How will we know
whether a particular design for sync rep does this?
That's a good point, I'll have to think about this and do a write-up.
Sure, that would be nice to have, and it's a good idea. But I don't
think that's going to be a common failure mode. What I expect to
happen is the standby to hum along with no problem for a long time and
then either kick a disk or suffer a power outage.
That might be more common, but it's not an argument against monitoring
what we *can* monitor for. More importantly, if monitoring ACK response
times -- and similar metrics -- is not available via core postgres, it
is impossible to find them out any other way. We need to give DBAs the
tools to do their jobs, even if the tools are at a very low level.
No, it isn't at all. What does your application do NOW if the master
goes down after you've sent a commit and before you get an
acknowledgment back? Does it assume that the transaction is
committed, or does it assume the transaction was aborted by a crash on
the master? Either is possible, right?
This problem certainly exists with async, it's just less likely so
people are ignoring it. With a high enough transaction rate, and a
standby in "apply" mode, it's *certain* to happen on synch rep. So we
can't ignore it as a problem anymore.
I don't have any brilliant ideas on a solution for this one.
So, your opinion is "it's out of scope to handle this issue" ?
What handling of it would you propose? Consider the case where you
I was asking a question. My original question was "do we need to handle
this?" I'm taking your viewpoint as "there's no reasonable way to
handle it, so we shouldn't." That's a fine answer. What I want is for
-hackers to make a *decision* about a very real problem, and not just
fail to discuss it.
I agree, but it's not something we can address in the first patch,
which is hard enough without adding things that make it even harder.
We need to get something simple committed first and then build on it.
The reason I posted the start of this thread is that I know that both
Fujii and Simon have thought about some of these questions, and even if
they don't have code for them, they have ideas. I want to read those
ideas explained. Further, the answers to these questions may tell the
rest of us which parts of each patch are the most valuable.
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
On Mon, 2010-10-11 at 16:07 -0400, Robert Haas wrote:
I was initially taken aback by the word "useless" as well. However, I
had trouble thinking of a use case that isn't better solved by sync rep
without HS, or async rep. I don't have the numbers either though, so
perhaps someone does have a use case.The main use cases for synchronous replication seem to be (1) high
availability and (2) read scalability. That is, if you have 99%
writes and 1% reads, you can round-robin the reads and do all the
writes on the master. But I think we are quite a way from making (2)
work well enough to get excited about.
[ I assume you meant "99% reads and 1% writes" ]
Wouldn't the snapshot publication (as Josh called it) back to the master
work better for that use case?
I'm not even sure that it's the ratio that matters, but rather how
constant the writes are. 1% writes does not necessarily mean that a
random 1% of read queries fail on the standby. I don't have the numbers,
but SR + query cancel seems like the standby system would effectively be
down during write activity. I wouldn't be surprised if SR + query cancel
resulted in some frustrated users; but perhaps "useless" is too strong a
word.
It would be far better if we could decouple master cleanup from
standby cleanup, so that only the machine that actually has the old
query gets bloated. However, no one seems excited about writing that
code.That doesn't seem just a matter of code, it seems like a major design
conflict.Yes. I had the idea of trying to fix this by allowing the standby to
retain old versions of entire pages that got cleaned up on the master,
until the transactions that might want to read the old pages were
gone. But that may be prohibitively difficult, not sure.
I think you'd end up having a notion of a snapshot of block information
(like a FS with snapshots) inside of postgres.
Sounds like a lot of complexity to me, and the only benefit I see is
moving bloat from the primary to the standby. Granted, that would be
nice, but I would expect some costs aside from just the complexity.
Regards,
Jeff Davis
On Mon, Oct 11, 2010 at 5:40 PM, Josh Berkus <josh@agliodbs.com> wrote:
Obviously. I presume it'll be something like "update postgresql.conf
or recovery.conf and run pg_ctl reload", but I haven't (yet, anyway)
verified the actual behavior of the patches, but if the above isn't
feasible then we have a problem.Right. That's why I asked the question. Mind you, a superuser function
on the master would be even better ...
That's probably not going to happen until we have a way to update
postgresql.conf via SQL. Which, I maintain, as I have maintained
before, is not going to happen until we get rid of the comments,
because otherwise absolutely any implementation anyone proposes will
get criticized for failing to handle them sensibly (because it is not
possible to rewrite the file while handling the comments sensibly).
What is your source for those numbers? They could be right, but I
simply don't know.pg_bench tests with asynch rep and standby_delay = 0. Not rigorous, but
enough to show that there is a problem there. Doing pg_bench with a
small database
Interesting.
It would be far better if we could decouple master cleanup from
standby cleanup, so that only the machine that actually has the old
query gets bloated. However, no one seems excited about writing that
code."not excited" == terrified of the amount of troubleshooting involved,
and likely believing it's impossible.
Quitter! :-)
A further grump about our current architecture is that it doesn't seem
at all clear how to make it work for partial replication. I have to
wonder whether we are going down the wrong path completely and need to
hit the reset button.The way to do partial replication is Slony, Londiste, Bucardo, etc.
Sure. But we can't forever ignore the fact that trigger-based
replication is not as performant as log-based replication.
But neither this nor the pruning problem are
things that we can reasonably expect the sync rep patch to solve, if
we want it to get committed this release cycle.It is not, given that I've seen several proposals for synch rep which
would make asynch rep even more complicated than it already is.I'm not aware of any proposals on the table which would do that.
Standby registration?
No one that I know of has proposed making that mandatory for async
standbys. In fact, Heikki has said just the opposite, and I agree
with him.
Sure, that would be nice to have, and it's a good idea. But I don't
think that's going to be a common failure mode. What I expect to
happen is the standby to hum along with no problem for a long time and
then either kick a disk or suffer a power outage.That might be more common, but it's not an argument against monitoring
what we *can* monitor for. More importantly, if monitoring ACK response
times -- and similar metrics -- is not available via core postgres, it
is impossible to find them out any other way. We need to give DBAs the
tools to do their jobs, even if the tools are at a very low level.
No argument.
No, it isn't at all. What does your application do NOW if the master
goes down after you've sent a commit and before you get an
acknowledgment back? Does it assume that the transaction is
committed, or does it assume the transaction was aborted by a crash on
the master? Either is possible, right?This problem certainly exists with async, it's just less likely so
people are ignoring it. With a high enough transaction rate, and a
standby in "apply" mode, it's *certain* to happen on synch rep. So we
can't ignore it as a problem anymore.
It exists with no replication at all...
I don't have any brilliant ideas on a solution for this one.
Right, well, the world is fundamentally asynchronous. In practice,
it's not that hard to write application-dependent logic to handle this
when it matters. Your transaction can, e.g. start by inserting a UUID
in a table somewhere. Then if a crash occurs you probe and see if the
UUID associated with that transaction is there, or not. Or depending
on what your transaction does, there may be some more natural
identifier you can use (e.g. if you are inserting an order into an
orders table, you can look and see if the order number you thought you
created is there).
So, your opinion is "it's out of scope to handle this issue" ?
What handling of it would you propose? Consider the case where you
I was asking a question. My original question was "do we need to handle
this?" I'm taking your viewpoint as "there's no reasonable way to
handle it, so we shouldn't." That's a fine answer. What I want is for
-hackers to make a *decision* about a very real problem, and not just
fail to discuss it.
Yes, that's my viewpoint. I think both this an the previous item are
pretty basic database theory - and you can probably read about them in
any good database theory textbook.
I agree, but it's not something we can address in the first patch,
which is hard enough without adding things that make it even harder.
We need to get something simple committed first and then build on it.The reason I posted the start of this thread is that I know that both
Fujii and Simon have thought about some of these questions, and even if
they don't have code for them, they have ideas. I want to read those
ideas explained. Further, the answers to these questions may tell the
rest of us which parts of each patch are the most valuable.
Fair enough. I was just replying because (1) nobody else had and (2)
I thought it might help to try to separate out which of the issues you
raised are most relevant to the patches at hand.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Oct 11, 2010 at 7:16 PM, Jeff Davis <pgsql@j-davis.com> wrote:
On Mon, 2010-10-11 at 16:07 -0400, Robert Haas wrote:
I was initially taken aback by the word "useless" as well. However, I
had trouble thinking of a use case that isn't better solved by sync rep
without HS, or async rep. I don't have the numbers either though, so
perhaps someone does have a use case.The main use cases for synchronous replication seem to be (1) high
availability and (2) read scalability. That is, if you have 99%
writes and 1% reads, you can round-robin the reads and do all the
writes on the master. But I think we are quite a way from making (2)
work well enough to get excited about.[ I assume you meant "99% reads and 1% writes" ]
Oops, yes.
Wouldn't the snapshot publication (as Josh called it) back to the master
work better for that use case?
Well, that would help make it more useful. Of course then bloat on
any machine will bloat the entire cluster...
I'm not even sure that it's the ratio that matters, but rather how
constant the writes are. 1% writes does not necessarily mean that a
random 1% of read queries fail on the standby. I don't have the numbers,
but SR + query cancel seems like the standby system would effectively be
down during write activity. I wouldn't be surprised if SR + query cancel
resulted in some frustrated users; but perhaps "useless" is too strong a
word.
Yeah.
It would be far better if we could decouple master cleanup from
standby cleanup, so that only the machine that actually has the old
query gets bloated. However, no one seems excited about writing that
code.That doesn't seem just a matter of code, it seems like a major design
conflict.Yes. I had the idea of trying to fix this by allowing the standby to
retain old versions of entire pages that got cleaned up on the master,
until the transactions that might want to read the old pages were
gone. But that may be prohibitively difficult, not sure.I think you'd end up having a notion of a snapshot of block information
(like a FS with snapshots) inside of postgres.
Yep.
Sounds like a lot of complexity to me, and the only benefit I see is
moving bloat from the primary to the standby. Granted, that would be
nice, but I would expect some costs aside from just the complexity.
The standby is bloated either way, but you avoid propagating that
bloat back to the master. It's particularly pernicious if you have a
master and 17 standbys. Now any single standby with a long running
query bloats all 18 machines. Not awesome.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
That's probably not going to happen until we have a way to update
postgresql.conf via SQL. Which, I maintain, as I have maintained
before, is not going to happen until we get rid of the comments,
because otherwise absolutely any implementation anyone proposes will
get criticized for failing to handle them sensibly (because it is not
possible to rewrite the file while handling the comments sensibly).
Well, you know my viewpoint on that. I'm still bitter about the .conf
directory patch.
What is your source for those numbers? They could be right, but I
simply don't know.pg_bench tests with asynch rep and standby_delay = 0. Not rigorous, but
enough to show that there is a problem there. Doing pg_bench with a
small databaseInteresting.
Yeah, it occurs to me that we can "fix" this with cleanup_delay on the
master, but that's a much worse solution than XID publication from the
standby. It means bloat *all* the time instead of just some of the time.
I think we have Yet Another Knob here: users whose standby is
essentially idle will NOT want XID publication, and users whose standby
is for load-balancing will.
Note that XID publication needn't mean extraordinary extra bloat
provided that the queries on the standby are short-lived. Long-running
reporting queries, of course, should be run on an asynch standby.
Sure. But we can't forever ignore the fact that trigger-based
replication is not as performant as log-based replication.
Watch me. ;-)
This problem certainly exists with async, it's just less likely so
people are ignoring it. With a high enough transaction rate, and a
standby in "apply" mode, it's *certain* to happen on synch rep. So we
can't ignore it as a problem anymore.It exists with no replication at all...
Oh, I see what you mean, now. Point taken.
Yes, that's my viewpoint. I think both this an the previous item are
pretty basic database theory - and you can probably read about them in
any good database theory textbook.
I doubt anything we're doing here is covering new ground, really.
Binary log-based replication has been around for a while in proprietary
products. We just don't necessarily want to make the *same* tradeoffs
other DBs have.
Fair enough. I was just replying because (1) nobody else had and (2)
I thought it might help to try to separate out which of the issues you
raised are most relevant to the patches at hand.
Sure. Simon is apparently on vacation right now. Not sure where Fujii is.
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
On Mon, Oct 11, 2010 at 9:29 PM, Josh Berkus <josh@agliodbs.com> wrote:
What is your source for those numbers? They could be right, but I
simply don't know.pg_bench tests with asynch rep and standby_delay = 0. Not rigorous, but
enough to show that there is a problem there. Doing pg_bench with a
small databaseInteresting.
Yeah, it occurs to me that we can "fix" this with cleanup_delay on the
master, but that's a much worse solution than XID publication from the
standby. It means bloat *all* the time instead of just some of the time.
Yeah, that's worse, I think.
I think we have Yet Another Knob here: users whose standby is
essentially idle will NOT want XID publication, and users whose standby
is for load-balancing will.
There probably is a knob, but XID publication ought to be basically
free on an idle standby, so the real trade-off is between query
cancellation or replay delay on the standby, vs. cluster-wide bloat.
Sure. But we can't forever ignore the fact that trigger-based
replication is not as performant as log-based replication.Watch me. ;-)
s/can't/shouldn't/ ?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Oct 11, 2010 at 5:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
That's probably not going to happen until we have a way to update
postgresql.conf via SQL. Which, I maintain, as I have maintained
before, is not going to happen until we get rid of the comments,
because otherwise absolutely any implementation anyone proposes will
get criticized for failing to handle them sensibly (because it is not
possible to rewrite the file while handling the comments sensibly).
So we've been over this. All the pieces you need are already there:
you can handle this without any nasty comment grunging by just writing
the new setting to a postgresql.auto and including that from
postgresql.conf. Include a note in postgresql.auto warning users any
changes in this file will be thrown away when the file is rewritten.
This is the same method used in .emacs.custom or a million other
places people wanted automatically written config files.
Also incidentally pgadmin currently *does* rewrite postgresql.conf
while keeping the comments. I think that's not such a hot idea because
it interferes with things like debian configuration file management
and so on, but it's not impossible to do. It's just that separating
automatically written files from user-editable files is a better
long-term plan.
--
greg
On Mon, Oct 11, 2010 at 9:44 PM, Greg Stark <gsstark@mit.edu> wrote:
On Mon, Oct 11, 2010 at 5:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
That's probably not going to happen until we have a way to update
postgresql.conf via SQL. Which, I maintain, as I have maintained
before, is not going to happen until we get rid of the comments,
because otherwise absolutely any implementation anyone proposes will
get criticized for failing to handle them sensibly (because it is not
possible to rewrite the file while handling the comments sensibly).So we've been over this. All the pieces you need are already there:
you can handle this without any nasty comment grunging by just writing
the new setting to a postgresql.auto and including that from
postgresql.conf. Include a note in postgresql.auto warning users any
changes in this file will be thrown away when the file is rewritten.
This is the same method used in .emacs.custom or a million other
places people wanted automatically written config files.
It seems that someone could have the following complaint: a setting
configured in whichever file gets read first could potentially be
ignored if it's also set in whichever file gets read second. And that
might be confusing.
Still, maybe we should just ignore that problem and do it anyway.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Oct 11, 2010 at 7:16 PM, Jeff Davis <pgsql@j-davis.com> wrote:
On Mon, 2010-10-11 at 16:07 -0400, Robert Haas wrote:
It would be far better if we could decouple master cleanup from
standby cleanup, so that only the machine that actually has the old
query gets bloated. However, no one seems excited about writing that
code.That doesn't seem just a matter of code, it seems like a major design
conflict.Yes. I had the idea of trying to fix this by allowing the standby to
retain old versions of entire pages that got cleaned up on the master,
until the transactions that might want to read the old pages were
gone. But that may be prohibitively difficult, not sure.I think you'd end up having a notion of a snapshot of block information
(like a FS with snapshots) inside of postgres.Sounds like a lot of complexity to me, and the only benefit I see is
moving bloat from the primary to the standby. Granted, that would be
nice, but I would expect some costs aside from just the complexity.
I had this idea when HS was being initially implemented, and I saw two
problems with it.
.) Identifying block versions in buffers. Our XID based MVCC like scheme
would work, but we already have a block header which we would want to trim.
.) Effectively every block cleanup will have to create a copy of the block
because more often than not there's at least one transaction in the system,
and when it asks, we need to hand it the old version.
Regards,
--
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com
singh.gurjeet@{ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet
Mail sent from my BlackLaptop device
On Mon, Oct 11, 2010 at 10:17 PM, Gurjeet Singh <singh.gurjeet@gmail.com> wrote:
Yes. I had the idea of trying to fix this by allowing the standby to
retain old versions of entire pages that got cleaned up on the master,
until the transactions that might want to read the old pages were
gone. But that may be prohibitively difficult, not sure.I think you'd end up having a notion of a snapshot of block information
(like a FS with snapshots) inside of postgres.Sounds like a lot of complexity to me, and the only benefit I see is
moving bloat from the primary to the standby. Granted, that would be
nice, but I would expect some costs aside from just the complexity.I had this idea when HS was being initially implemented, and I saw two
problems with it..) Identifying block versions in buffers. Our XID based MVCC like scheme
would work, but we already have a block header which we would want to trim.
I was thinking of stuffing the old versions of the blocks into a
relation fork... and then having some kind of mapping... thing...
.) Effectively every block cleanup will have to create a copy of the block
because more often than not there's at least one transaction in the system,
and when it asks, we need to hand it the old version.
This is probably true. And then, of course, you need the equivalent
of VACUUM to get rid of the old blocks, which kind of sucks.
Hence my earlier comments about being unsure that physical replication
is the way to go.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
That's probably not going to happen until we have a way to update
postgresql.conf via SQL. Which, I maintain, as I have maintained
before, is not going to happen until we get rid of the comments,
because otherwise absolutely any implementation anyone proposes will
get criticized for failing to handle them sensibly (because it is not
possible to rewrite the file while handling the comments sensibly).
I'm sorry but you don't need to get rid of the comments *at all* to have
the facility to update configuration from SQL.
What you need to do is make the choice that the setup is either human
friendly or SQL friendly. You will lose something that you now have to
be able to use SQL commands to edit the setup. Re-read. Ready now?
So, what you do is have a file per GUC, file name is the GUC name, first
line contains *ONLY* current value, the rest of the file is comments.
Note that it's still possible to hand-edit your setup. It's not even
hard to do. If you prefer a single file without comments, it's even
trivial to come up with a script that prepares the machine friendly
layout.
And it's even possible to offer DBAs a choice here: if postgresql.conf
exists, it's used in the old way and the SQL command only issues errors,
if there's a postgresql.conf.d and no postgresql.conf, the SQL commands
are working fine. Add an option for initdb to choose one layout or the
other.
Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 10/12/2010 05:02 PM, Dimitri Fontaine wrote:
So, what you do is have a file per GUC, file name is the GUC name, first
line contains *ONLY* current value, the rest of the file is comments.
You're joking, right?
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
On 10/12/2010 05:02 PM, Dimitri Fontaine wrote:
So, what you do is have a file per GUC, file name is the GUC name, first
line contains *ONLY* current value, the rest of the file is comments.You're joking, right?
No. I just want both comments and SQL commands. If you refuse this
simple file scheme, keep your postgresql.conf and don't remote edit it.
That's my proposal, I'm happy that it comes with laughter :)
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Dimitri Fontaine wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 10/12/2010 05:02 PM, Dimitri Fontaine wrote:
So, what you do is have a file per GUC, file name is the GUC name, first
line contains *ONLY* current value, the rest of the file is comments.You're joking, right?
No. I just want both comments and SQL commands. If you refuse this
simple file scheme, keep your postgresql.conf and don't remote edit it.That's my proposal, I'm happy that it comes with laughter :)
Maybe I missed something important, but why is it not possible to retain the
single existing postgres.conf file format (human writable) *and* have it
machine/SQL-editable *and* maintain the comments? I should think that it would
be possible to do all of these without too much trouble. All you would need is
for the file parser to retain the comments as metadata, include them in the
relations that the SQL commands see where the latter can also edit them as data,
and then write out the updated file with comments. The fact that Postgres
already explicitly supports comment metadata in its system catalog means it must
already know something about this. If something is missing, then expand the
catalog so it represents all the details you want to preserve. -- Darren Duncan