PostgreSQL Synchronous Replication in production

Started by Colin Salmost 13 years ago7 messagesgeneral
Jump to latest
#1Colin S
colin_sloss@hotmail.com

Hello,

I have been testing the differences between asynchronous and synchronous hot standby streaming replication on PostgreSQL 9.2.4. There is some push towards synchronous replication, but I am finding some serious problems, and wonder how other people deal with them.
Action:The Slave is unreachable (postgres is stopped, or machine is turned off)
Result: Transactions stay open, and add up until connection is allowed between Master and Slave again.
My guess:Create a script that detects when transactions are being opened, but not committed on slave. Alter postgresql.conf to comment out synchronous_standby_names and pg_ctl reload. This should boot the server back to asynchronous, more or less.
Caveats:I haven't tested this yet, and I'm not sure what would happen to the transactions that started while synchronous replication was active. Any guesses?
The whole idea of my solution was to have no single point of failure. This seems to create two exclusive points of failure, each needing a completely separate reaction. My original proposal was asynchronous replication, with xlogs being stored on shared storage, and DRBD replication to prevent it from being a single point of failure. I have never seen it go over 15kbs behind in my setup, which still results in a very speedy failover. Is it just me, or is that seeming better than just synchronous replication?
Another caveat I found is that setting up slaves becomes more complicated. You have to set up the Master in asynchronous style, and then switch it to synchronous only when the timing is right. Otherwise the transactions will sit there until everything is ready.
Sorry for the onslaught of questions, I don't expect all of them satisfied. Please share any resolutions to these issues which you guys have discovered.
Regards,
Colin

#2Richard Huxton
dev@archonet.com
In reply to: Colin S (#1)
Re: PostgreSQL Synchronous Replication in production

On 06/06/13 11:20, Colin Sloss wrote:

I have been testing the differences between asynchronous and synchronous
hot standby streaming replication on PostgreSQL 9.2.4. There is some
push towards synchronous replication, but I am finding some serious
problems, and wonder how other people deal with them.

[snip]

The whole idea of my solution was to have no single point of failure.
This seems to create two exclusive points of failure, each needing a
completely separate reaction.

Synchronous replication provides a higher level of guarantee for an
individual transaction (it's safely[1]For various values of "safely" of course on at least two boxes now) at the
cost of making the system as a whole more brittle.

Your uptime as a "service" will inevitably be reduced since in the event
of problems talking to the slave the master will *have* to delay/cancel
new transactions.

I have seen people suggest some sort of mode where the server drops back
to asynch mode in the event of problems. I can't quite understand the
use-case for that though - either you want synchronous replication or
you don't. Mostly-synchronous is just asynchronous.

Here's a few questions. How you answer them will decide whether you
really want synchronous replication or not:
1. The link between servers encounters network congestion
a. The whole system should slow down.
Committed transactions should ALWAYS be on
two geographically separate machines.
b. An alert should be sent.
If it's not sorted in 5 mins we'll get someone to look at it.
2. Adding more servers[2]In the same mode - adding async slaves doesn't count to my replication should:
a. Make the system as a whole slower[3]Assuming a reasonable write load of course. Read-only databases won't care. and reduce uptime
but increase the safety of committed transactions
b. Make the system as a whole faster and increase uptime

There are cases where you want (a), but lots where you want (b) and
monitor the replication lag.

[1]: For various values of "safely" of course
[2]: In the same mode - adding async slaves doesn't count
[3]: Assuming a reasonable write load of course. Read-only databases won't care.
won't care.

--
Richard Huxton
Archonet Ltd

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Colin S
colin_sloss@hotmail.com
In reply to: Richard Huxton (#2)
Re: PostgreSQL Synchronous Replication in production

Here's a few questions. How you answer them will decide whether you
really want synchronous replication or not:
1. The link between servers encounters network congestion
a. The whole system should slow down.
Committed transactions should ALWAYS be on
two geographically separate machines.
b. An alert should be sent.
If it's not sorted in 5 mins we'll get someone to look at it.
2. Adding more servers[2] to my replication should:
a. Make the system as a whole slower[3] and reduce uptime
but increase the safety of committed transactions
b. Make the system as a whole faster and increase uptime

There are cases where you want (a), but lots where you want (b) and
monitor the replication lag.

[1] For various values of "safely" of course
[2] In the same mode - adding async slaves doesn't count
[3] Assuming a reasonable write load of course. Read-only databases
won't care.

--
Richard Huxton
Archonet Ltd

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Hello,
Thanks for your answer. I find it very interesting that you say that synchronous setups should always be in two geographically separate locations. In this case they are on the same subnet. Adding the lag of committing to two, geographically separate, databases is not feasible for this OLTP application.
I also like your point that "mostly synchronous is just asynchronous." So, responding by switching to asynchronous as a response to slow-down is asynchronous anyway.
Any other comments, or examples, of when synchronous is worth implementing would be greatly appreciated.
Regards,
Colin

#4Richard Huxton
dev@archonet.com
In reply to: Colin S (#3)
Re: PostgreSQL Synchronous Replication in production

On 06/06/13 12:48, Colin S wrote:

Thanks for your answer. I find it very interesting that you say that
synchronous setups should always be in two geographically separate
locations. In this case they are on the same subnet. Adding the lag of
committing to two, geographically separate, databases is not feasible
for this OLTP application.

Well, if they're in the same building(s) then your transactions are all
at the same risk from fire/earthquake/godzilla etc. Might/might not be
important to you.

I also like your point that "mostly synchronous is just asynchronous."
So, responding by switching to asynchronous as a response to slow-down
is asynchronous anyway.

"Mostly synchronous" is like "a bit pregnant".

Any other comments, or examples, of when synchronous is worth
implementing would be greatly appreciated.

Note that PostgreSQL's synchronous replication just guarantees that the
commit has reached the transaction log on the slave. That doesn't mean
the slave has replayed the transaction log and a query against the slave
will show the transaction's results. So - it doesn't in itself guarantee
that you can see issue read-only queries against either server
indiscriminately.

However, if you really, really need to know that a committed transaction
is on two physically separate sets of disks then synchronous is what you
want. If both sets of disks are in the same building then you might be
able to achieve the same result by other (cheaper/simpler?) means.

If you have a business e.g. selling books or train tickets or some such
then you might decide it's better to have a simpler more robust setup
from the point of view of providing continuous service to end-customers.
In the (hopefully rare) event of a crash irreparably losing some
transactions apologise to your customers and recompense them generously.

For a system handling multi-million pound inter-bank transfers you might
decide it's better to have the system not working at all rather than
have an increased risk of a lost transaction.

Of course in both cases you might well want a separate list/cache of
pending/recently-committed transactions to check against in the event of
a failure.

I believe what you should do from an engineering approach is to treat it
in a similar way to security. What do you want to protect against? Make
a list of possible failures and what they mean to the business/project
and then decide how much time/money to spend protecting against each one.

--
Richard Huxton
Archonet Ltd

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Colin S (#1)
Re: PostgreSQL Synchronous Replication in production

Colin Sloss wrote:

I have been testing the differences between asynchronous and synchronous hot standby streaming
replication on PostgreSQL 9.2.4. There is some push towards synchronous replication, but I am finding
some serious problems, and wonder how other people deal with them.

Action:
The Slave is unreachable (postgres is stopped, or machine is turned off)

Result:
Transactions stay open, and add up until connection is allowed between Master and Slave again.

My guess:
Create a script that detects when transactions are being opened, but not committed on slave. Alter
postgresql.conf to comment out synchronous_standby_names and pg_ctl reload. This should boot the
server back to asynchronous, more or less.

Caveats:
I haven't tested this yet, and I'm not sure what would happen to the transactions that started while
synchronous replication was active. Any guesses?

The whole idea of my solution was to have no single point of failure. This seems to create two
exclusive points of failure, each needing a completely separate reaction. My original proposal was
asynchronous replication, with xlogs being stored on shared storage, and DRBD replication to prevent
it from being a single point of failure. I have never seen it go over 15kbs behind in my setup, which
still results in a very speedy failover. Is it just me, or is that seeming better than just
synchronous replication?

Another caveat I found is that setting up slaves becomes more complicated. You have to set up the
Master in asynchronous style, and then switch it to synchronous only when the timing is right.
Otherwise the transactions will sit there until everything is ready.

Sorry for the onslaught of questions, I don't expect all of them satisfied. Please share any
resolutions to these issues which you guys have discovered.

One simple guideline first:
If you use synchronous replication, you'll have to have at least two standby
servers or the overall availability of your system will suffer.

The differences between synchronous and asynchronous replication are mostly:
a) With synchronous replication you cannot lose a committed transaction
during failover.
b) Synchronous replication will slow down your system; the higher the
network latency between the servers, the slower it will get.

You should work on getting the requirements defined:
- In the case of failover, can you afford to lose a few committed transactions?
- Are you ready to pay the price for synchronous replication
(second standby if you don't want availability to suffer, slower
database system, more complicated setup and failover procedures)?

For synchronous replication, keep the servers close together with
a strong network inbetween.
To protect against catastrophes (if that's a requirement), you should
use another asynchronous standby in a distant location.

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Merlin Moncure
mmoncure@gmail.com
In reply to: Laurenz Albe (#5)
Re: PostgreSQL Synchronous Replication in production

On Fri, Jun 7, 2013 at 3:22 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:

Colin Sloss wrote:

I have been testing the differences between asynchronous and synchronous hot standby streaming
replication on PostgreSQL 9.2.4. There is some push towards synchronous replication, but I am finding
some serious problems, and wonder how other people deal with them.

Action:
The Slave is unreachable (postgres is stopped, or machine is turned off)

Result:
Transactions stay open, and add up until connection is allowed between Master and Slave again.

My guess:
Create a script that detects when transactions are being opened, but not committed on slave. Alter
postgresql.conf to comment out synchronous_standby_names and pg_ctl reload. This should boot the
server back to asynchronous, more or less.

Caveats:
I haven't tested this yet, and I'm not sure what would happen to the transactions that started while
synchronous replication was active. Any guesses?

The whole idea of my solution was to have no single point of failure. This seems to create two
exclusive points of failure, each needing a completely separate reaction. My original proposal was
asynchronous replication, with xlogs being stored on shared storage, and DRBD replication to prevent
it from being a single point of failure. I have never seen it go over 15kbs behind in my setup, which
still results in a very speedy failover. Is it just me, or is that seeming better than just
synchronous replication?

Another caveat I found is that setting up slaves becomes more complicated. You have to set up the
Master in asynchronous style, and then switch it to synchronous only when the timing is right.
Otherwise the transactions will sit there until everything is ready.

Sorry for the onslaught of questions, I don't expect all of them satisfied. Please share any
resolutions to these issues which you guys have discovered.

One simple guideline first:
If you use synchronous replication, you'll have to have at least two standby
servers or the overall availability of your system will suffer.

The differences between synchronous and asynchronous replication are mostly:
a) With synchronous replication you cannot lose a committed transaction
during failover.
b) Synchronous replication will slow down your system; the higher the
network latency between the servers, the slower it will get.

You should work on getting the requirements defined:
- In the case of failover, can you afford to lose a few committed transactions?
- Are you ready to pay the price for synchronous replication
(second standby if you don't want availability to suffer, slower
database system, more complicated setup and failover procedures)?

For synchronous replication, keep the servers close together with
a strong network inbetween.
To protect against catastrophes (if that's a requirement), you should
use another asynchronous standby in a distant location.

yeah -- well put. Synchronous replication exists because in some
cases even the loss of a single transaction is unacceptable. So it's
not really fair to compare vs asynchronous which is more of a 'best
effort' system; it in no way guarantees that every transaction is
recoverable. That small fudge allows for big optimizations in terms
of process simplicity and performance.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Jeff Janes
jeff.janes@gmail.com
In reply to: Richard Huxton (#4)
Re: PostgreSQL Synchronous Replication in production

On Thu, Jun 6, 2013 at 5:23 AM, Richard Huxton <dev@archonet.com> wrote:

If you have a business e.g. selling books or train tickets or some such
then you might decide it's better to have a simpler more robust setup from
the point of view of providing continuous service to end-customers. In the
(hopefully rare) event of a crash irreparably losing some transactions
apologise to your customers and recompense them generously.

Unfortunately you probably no longer know who to apologize to, what for, or
how much to compensate them!

I guess when you reconcile your cc credits from the cc company recorded in
their system to the sales in your recovered system, you will have evidence
of the discrepancies.

Now I can't use an ecommerce without pondering all the ways something can
go wrong, and how to minimize/address them.

Cheers,

Jeff