Synchronous Standalone Master Redoux

Started by Shaun Thomasalmost 14 years ago52 messageshackers
Jump to latest
#1Shaun Thomas
sthomas@optionshouse.com

Hey everyone,

Upon doing some usability tests with PostgreSQL 9.1 recently, I ran
across this discussion:

http://archives.postgresql.org/pgsql-hackers/2011-12/msg01224.php

And after reading the entire thing, I found it odd that the overriding
pushback was because nobody could think of a use case. The argument was:
if you don't care if the slave dies, why not just use asynchronous
replication?

I'd like to introduce all of you to DRBD. DRBD is, for those who aren't
familiar, distributed (network) block-level replication. Right now, this
is what we're using, and will use in the future, to ensure a stable
synchronous PostgreSQL copy on our backup node. I was excited to read
about synchronous replication, because with it, came the possibility we
could have two readable nodes with the servers we already have. You
can't do that with DRBD; secondary nodes can't even mount the device.

So here's your use case:

1. Slave wants to be synchronous with master. Master wants replication
on at least one slave. They have this, and are happy.
2. For whatever reason, slave crashes or becomes unavailable.
3. Master notices no more slaves are available, and operates in
standalone mode, accumulating WAL files until a suitable slave appears.
4. Slave finishes rebooting/rebuilding/upgrading/whatever, and
re-subscribes to the feed.
5. Slave stays in degraded sync (asynchronous) mode until it is caught
up, and then switches to synchronous. This makes both master and slave
happy, because *intent* of synchronous replication is fulfilled.

PostgreSQL's implementation means the master will block until
someone/something notices and tells it to stop waiting, or the slave
comes back. For pretty much any high-availability environment, this is
not viable. Based on that alone, I can't imagine a scenario where
synchronous replication would be considered beneficial.

The current setup doubles unplanned system outage scenarios in such a
way I'd never use it in a production environment. Right now, we only
care if the master server dies. With sync rep, we'd have to watch both
servers like a hawk and be ready to tell the master to disable sync rep,
lest our 10k TPS system come to an absolute halt because the slave died.

With DRBD, when a slave node goes offline, the master operates in
standalone until the secondary re-appears, after which it
re-synchronizes missing data, and then operates in sync mode afterwards.
Just because the data is temporarily out of sync does *not* mean we want
asynchronous replication. I think you'd be hard pressed to find many
users taking advantage of DRBD's async mode. Just because data is
temporarily catching up, doesn't mean it will remain in that state.

I would *love* to have the functionality discussed in the patch. If I
can make a case for it, I might even be able to convince my company to
sponsor its addition, provided someone has time to integrate it. Right
now, we're using DRBD so we can have a very short outage window while
the offline node gets promoted, and it works, but that means a basically
idle server at all times. I'd gladly accept a 10-20% performance hit for
sync rep if it meant that other server could reliably act as a read
slave. That's currently impossible because async replication is too
slow, and sync is too fragile for reasons stated above.

Am I totally off-base, here? I was shocked when I actually read the
documentation on how sync rep worked, and saw that no servers would
function properly until at least two were online.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

#2Josh Berkus
josh@agliodbs.com
In reply to: Shaun Thomas (#1)
Re: Synchronous Standalone Master Redoux

Shaun,

PostgreSQL's implementation means the master will block until
someone/something notices and tells it to stop waiting, or the slave
comes back. For pretty much any high-availability environment, this is
not viable. Based on that alone, I can't imagine a scenario where
synchronous replication would be considered beneficial.

So there's an issue with the definition of "synchronous". What
"synchronous" in "synchronous replication" means is "guarantee zero data
loss or fail the transaction". It does NOT mean "master and slave have
the same transactional data at the same time", as much as that would be
great to have.

There are, indeed, systems where you'd rather shut down the system than
accept writes which were not replicated, or we wouldn't have the
feature. That just doesn't happen to fit your needs (nor, indeed, the
needs of most people who think they want SR).

"Total-consistency" replication is what I think you want, that is, to
guarantee that at any given time a read query on the master will return
the same results as a read query on the standby. Heck, *most* people
would like to have that. You would also be advancing database science
in general if you could come up with a way to implement it.

slave. That's currently impossible because async replication is too
slow, and sync is too fragile for reasons stated above.

So I'm unclear on why sync rep would be faster than async rep given that
they use exactly the same mechanism. Explain?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#3Daniel Farina
daniel@heroku.com
In reply to: Shaun Thomas (#1)
Re: Synchronous Standalone Master Redoux

On Mon, Jul 9, 2012 at 1:30 PM, Shaun Thomas <sthomas@optionshouse.com> wrote:

1. Slave wants to be synchronous with master. Master wants replication on at least one slave. They have this, and are happy.
2. For whatever reason, slave crashes or becomes unavailable.
3. Master notices no more slaves are available, and operates in standalone mode, accumulating WAL files until a suitable slave appears.
4. Slave finishes rebooting/rebuilding/upgrading/whatever, and re-subscribes to the feed.
5. Slave stays in degraded sync (asynchronous) mode until it is caught up, and then switches to synchronous. This makes both master and slave happy, because *intent* of synchronous replication is fulfilled.

So if I get this straight, what you are saying is "be asynchronous
replication unless someone is around, in which case be synchronous" is
the mode you want. I think if your goal is zero-transaction loss then
you would want to rethink this, and that was the goal of SR: two
copies, no matter what, before COMMIT returns from the primary.

However, I think there is something you are stating here that has a
finer point on it: right now, there is no graceful way to attenuate
the speed of commit on a primary to ensure bounded lag of an
*asynchronous* standby. This is a pretty tricky definition: consider
if you bring a standby on-line from archive replay and it shows up in
streaming with pretty high lag, and stops all commit traffic while it
reaches the bounded window of what "acceptable" lag is. That sounds
pretty terrible, too. How does DBRD handle this? It seems like the
catchup phase might be interesting prior art.

On first inspection, the best I can come up with something like "if
the standby is making progress and it fails to make progress in
convergence, attenuate the primary's speed of COMMIT until convergence
is projected to occur in a projected time" or something like that.

Relatedly, this is related to one of the one of the ugliest problems I
have with continuous archiving: there is no graceful way to attenuate
the speed of operations to prevent backlog that can fill up the disk
containing pg_xlog. It also makes it very hard to very strictly bound
the amount of data that can remain outstanding and unarchived. To get
around this, I was planning on very carefully making use of the status
messages supplied that inform synchronous replication to block and
unblock operations, but perhaps a less strained interface is possible
with some kind of cooperation from Postgres.

--
fdr

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Daniel Farina (#3)
Re: Synchronous Standalone Master Redoux

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Daniel Farina

Sent: Tuesday, July 10, 2012 11:42 AM

On Mon, Jul 9, 2012 at 1:30 PM, Shaun Thomas <sthomas@optionshouse.com>

wrote:

1. Slave wants to be synchronous with master. Master wants replication on

at least one slave. They have this, and are happy.

2. For whatever reason, slave crashes or becomes unavailable.
3. Master notices no more slaves are available, and operates in

standalone mode, accumulating WAL files until a suitable slave appears.

4. Slave finishes rebooting/rebuilding/upgrading/whatever, and

re-subscribes to the feed.

5. Slave stays in degraded sync (asynchronous) mode until it is caught

up, and then switches to synchronous. This makes both master and slave
happy, because *intent* of synchronous replication is fulfilled.

So if I get this straight, what you are saying is "be asynchronous
replication unless someone is around, in which case be synchronous" is
the mode you want. I think if your goal is zero-transaction loss then
you would want to rethink this, and that was the goal of SR: two
copies, no matter what, before COMMIT returns from the primary.

For such cases, can there be a way with which an option can be provided to
user if he wants to change mode to async?

#5Magnus Hagander
magnus@hagander.net
In reply to: Shaun Thomas (#1)
Re: Synchronous Standalone Master Redoux

On Tue, Jul 10, 2012 at 8:42 AM, Amit Kapila <amit.kapila@huawei.com> wrote:

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Daniel Farina

Sent: Tuesday, July 10, 2012 11:42 AM

On Mon, Jul 9, 2012 at 1:30 PM, Shaun Thomas <sthomas@optionshouse.com>

wrote:

1. Slave wants to be synchronous with master. Master wants replication on

at least one slave. They have this, and are happy.

2. For whatever reason, slave crashes or becomes unavailable.
3. Master notices no more slaves are available, and operates in

standalone mode, accumulating WAL files until a suitable slave appears.

4. Slave finishes rebooting/rebuilding/upgrading/whatever, and

re-subscribes to the feed.

5. Slave stays in degraded sync (asynchronous) mode until it is caught

up, and then switches to synchronous. This makes both master and slave
happy, because *intent* of synchronous replication is fulfilled.

So if I get this straight, what you are saying is "be asynchronous
replication unless someone is around, in which case be synchronous" is
the mode you want. I think if your goal is zero-transaction loss then
you would want to rethink this, and that was the goal of SR: two
copies, no matter what, before COMMIT returns from the primary.

For such cases, can there be a way with which an option can be provided to
user if he wants to change mode to async?

You can already change synchronous_standby_names, and do so without a
restart. That will change between sync and async just fine on a live
system. And you can control that from some external monitor to define
your own rules for exactly when it should drop to async mode.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#6Shaun Thomas
sthomas@optionshouse.com
In reply to: Daniel Farina (#3)
Re: Synchronous Standalone Master Redoux

On 07/10/2012 01:11 AM, Daniel Farina wrote:

So if I get this straight, what you are saying is "be asynchronous
replication unless someone is around, in which case be synchronous"
is the mode you want.

Er, no. I think I see where you might have gotten that, but no.

This is a pretty tricky definition: consider if you bring a standby
on-line from archive replay and it shows up in streaming with pretty
high lag, and stops all commit traffic while it reaches the bounded
window of what "acceptable" lag is. That sounds pretty terrible, too.
How does DBRD handle this? It seems like the catchup phase might be
interesting prior art.

Well, DRBD actually has a very definitive sync mode, and no
"attenuation" is involved at all. Here's what a fully working cluster
looks like, according to /proc/drbd:

cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate

Here's what happens when I disconnect the secondary:

cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown

So there's a few things here:

1. Primary is waiting for the secondary to reconnect.
2. It knows its own data is still up to date.
3. It's waiting to assess the secondary when it re-appears
4. It's still capable of writing to the device.

This is more akin to degraded RAID-1. Writes are synchronous as long as
two devices exist, but if one vanishes, you can still use the disk at
your own risk. Checking the status of DRBD will show this readily. I
also want to point out it is *fully* synchronous when both nodes are
available. I.e., you can't even call a filesystem sync without the sync
succeeding on both nodes.

When you re-connect a secondary device, it catches up as fast as
possible by replaying waiting transactions, and then re-attaching to the
cluster. Until it's fully caught-up, it doesn't exist. DRBD acknowledges
the secondary is there and attempting to catch up, but does not leave
"degraded" mode until the secondary reaches "UpToDate" status.

This is a much more graceful failure scenario than is currently possible
with PostgreSQL. With DRBD, you'd still need a tool to notice the master
node is in an invalid state and perform a failover, but the secondary
going belly-up will not suddenly halt the master.

But I'm not even hoping for *that* level of functionality. I just want
to be able to tell PostgreSQL to notice when the secondary becomes
unavailable *on its own*, and then perform in "degraded non-sync mode"
because it's much faster than any monitor I can possibly attach to
perform the same function. I plan on using DRBD until either PG can do
that, or a better alternative presents itself.

Async is simply too slow for our OLTP system except for the disaster
recovery node, which isn't expected to carry on within seconds of the
primary's failure. I briefly considered sync mode when it appeared as a
feature, but I see it's still too early in its development cycle,
because there are no degraded operation modes. That's fine, I'm willing
to wait.

I just don't understand the push-back, I guess. RAID-1 is the poster
child for synchronous writes for fault tolerance. It will whine
constantly to anyone who will listen when operating only on one device,
but at least it still works. I'm pretty sure nobody would use RAID-1 if
its failure mode was: block writes until someone installs a replacement
disk.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

#7Aidan Van Dyk
aidan@highrise.ca
In reply to: Shaun Thomas (#6)
Re: Synchronous Standalone Master Redoux

On Tue, Jul 10, 2012 at 9:28 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:

Async is simply too slow for our OLTP system except for the disaster
recovery node, which isn't expected to carry on within seconds of the
primary's failure. I briefly considered sync mode when it appeared as a
feature, but I see it's still too early in its development cycle, because
there are no degraded operation modes. That's fine, I'm willing to wait.

But this is where some of us are confused with what your asking for.
async is actually *FASTER* than sync. It's got less over head.
Synchrounous replication is basicaly async replication, with an extra
overhead, and an artificial delay on the master for the commit to
*RETURN* to the client. The data is still committed and view able to
new queries on the master, and the slave at the same rate as with
async replication. Just that the commit status returned to the client
is delayed.

So the "async is too slow" is what we don't understand.

I just don't understand the push-back, I guess. RAID-1 is the poster child
for synchronous writes for fault tolerance. It will whine constantly to
anyone who will listen when operating only on one device, but at least it
still works. I'm pretty sure nobody would use RAID-1 if its failure mode
was: block writes until someone installs a replacement disk.

I think most of us in the "synchronous replication must be syncronous
replication" camp are there because the guarantees of a simple RAID 1
just isn't good enough for us ;-)

a.

--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.

#8Shaun Thomas
sthomas@optionshouse.com
In reply to: Josh Berkus (#2)
Re: Synchronous Standalone Master Redoux

On 07/09/2012 05:15 PM, Josh Berkus wrote:

"Total-consistency" replication is what I think you want, that is, to
guarantee that at any given time a read query on the master will return
the same results as a read query on the standby. Heck, *most* people
would like to have that. You would also be advancing database science
in general if you could come up with a way to implement it.

Doesn't having consistent transactional state across the systems imply that?

So I'm unclear on why sync rep would be faster than async rep given
that they use exactly the same mechanism. Explain?

Too many mental gymnastics. I get that async is "faster" than sync, but
the inconsistent transactional state makes it *look* slower. If a
customer makes an order, but just happens to check that order state on
the secondary before it can catch up, that's a net loss. Like I said,
that's fine for our DR system, or a reporting mirror, or any one of
several use-case scenarios, but it's not good enough for a failover when
better alternatives exist. In this case, better alternatives are
anything that can guarantee transaction durability: DRBD / PG sync.

PG sync mode does what I want in that regard, it just has no graceful
failure state without relatively invasive intervention. Theoretically we
could write a Pacemaker agent, or some other simple harness, that just
monitors both servers and performs an LSB HUP after modifying the
primary node to disable synchronous_standby_names if the secondary dies,
or promotes the secondary if the primary dies. But after being spoiled
by DRBD knowing the instant the secondary disconnects, but still being
available until the secondary is restored, we can't justifiably switch
to something that will have the primary hang for ten seconds between
monitor checks and service reloads.

I'm just saying I considered it briefly during testing the last few
days, but there's no way I can make a business case for it. PG sync rep
is a great step forward, but it's not for us. Yet.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

#9Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Shaun Thomas (#8)
Re: Synchronous Standalone Master Redoux

On 10.07.2012 17:31, Shaun Thomas wrote:

On 07/09/2012 05:15 PM, Josh Berkus wrote:

So I'm unclear on why sync rep would be faster than async rep given
that they use exactly the same mechanism. Explain?

Too many mental gymnastics. I get that async is "faster" than sync, but
the inconsistent transactional state makes it *look* slower. If a
customer makes an order, but just happens to check that order state on
the secondary before it can catch up, that's a net loss. Like I said,
that's fine for our DR system, or a reporting mirror, or any one of
several use-case scenarios, but it's not good enough for a failover when
better alternatives exist. In this case, better alternatives are
anything that can guarantee transaction durability: DRBD / PG sync.

PG sync mode does what I want in that regard, it just has no graceful
failure state without relatively invasive intervention.

You are mistaken. PostgreSQL's synchronous replication does not
guarantee that the transaction is immediately replayed in the standby.
It only guarantees that it's been sync'd to disk in the standby, but if
there are open snapshots or the system is simply busy, it might takes
minutes or more until the effects of that transaction become visible.

I agree that such a mode would be highly useful, where a transaction is
not acknowledged to the client as committed until it's been replicated
*and* replayed in the standby. And in that mode, a timeout after which
the master just goes ahead without the standby would be useful. You
could then configure your middleware and/or standby to not use the
standby server for queries after that timeout.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#10Shaun Thomas
sthomas@optionshouse.com
In reply to: Heikki Linnakangas (#9)
Re: Synchronous Standalone Master Redoux

On 07/10/2012 09:40 AM, Heikki Linnakangas wrote:

You are mistaken. It only guarantees that it's been sync'd to disk in
the standby, but if there are open snapshots or the system is simply
busy, it might takes minutes or more until the effects of that
transaction become visible.

Well, crap. It's subtle distinctions like this I wish I'd noticed
before. Doesn't really affect our plans, it just makes sync rep even
less viable for our use case. Thanks for the correction! :)

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

#11Daniel Farina
daniel@heroku.com
In reply to: Shaun Thomas (#6)
Re: Synchronous Standalone Master Redoux

On Tue, Jul 10, 2012 at 6:28 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:

On 07/10/2012 01:11 AM, Daniel Farina wrote:

So if I get this straight, what you are saying is "be asynchronous
replication unless someone is around, in which case be synchronous"
is the mode you want.

Er, no. I think I see where you might have gotten that, but no.

From your other communications, this sounds like exactly what you
want, because RAID-1 is rather like this: on writes, a degraded RAID-1
needs not wait on its (non-existent) mirror, and can be faster, but
once it has caught up it is not allowed to leave synchronization,
which is slower than writing to one disk alone, since it is the
maximum of the time taken to write to two disks. While in the
degraded state there is effectively only one copy of the data, and
while a mirror rebuild is occurring the replication is effectively
asynchronous to bring it up to date.

--
fdr

#12Josh Berkus
josh@agliodbs.com
In reply to: Shaun Thomas (#8)
Re: Synchronous Standalone Master Redoux

Shaun,

Too many mental gymnastics. I get that async is "faster" than sync, but
the inconsistent transactional state makes it *look* slower. If a
customer makes an order, but just happens to check that order state on
the secondary before it can catch up, that's a net loss. Like I said,
that's fine for our DR system, or a reporting mirror, or any one of
several use-case scenarios, but it's not good enough for a failover when
better alternatives exist. In this case, better alternatives are
anything that can guarantee transaction durability: DRBD / PG sync.

Per your exchange with Heikki, that's not actually how SyncRep works in
9.1. So it's not giving you what you want anyway.

This is why we felt that the "sync rep if you can" mode was useless and
didn't accept it into 9.1. The *only* difference between sync rep and
async rep is whether or not the master waits for ack that the standby
has written to log.

I think one of the new modes in 9.2 forces synch-to-DB before ack. No?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#13Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Shaun Thomas (#6)
Re: Synchronous Standalone Master Redoux

Shaun Thomas <sthomas@optionshouse.com> writes:

When you re-connect a secondary device, it catches up as fast as possible by
replaying waiting transactions, and then re-attaching to the cluster. Until
it's fully caught-up, it doesn't exist. DRBD acknowledges the secondary is
there and attempting to catch up, but does not leave "degraded" mode until
the secondary reaches "UpToDate" status.

That's exactly what happens with PostgreSQL when using asynchronous
replication and archiving. When joining the cluster, the standby will
feed from the archives and then there's nothing recent enough left over
there, and only at this time it will contact the master.

For a real graceful setup you need both archiving and replication.

Then, synchronous replication means that no transaction can make it to
the master alone. The use case is not being allowed to tell the client
it's ok when you're at risk of losing the transaction by crashing the
master when it's the only one knowing about it.

What you explain you want reads to me "Async replication + Archiving".

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#14Daniel Farina
daniel@heroku.com
In reply to: Dimitri Fontaine (#13)
Re: Synchronous Standalone Master Redoux

On Tue, Jul 10, 2012 at 2:42 PM, Dimitri Fontaine
<dimitri@2ndquadrant.fr> wrote:>

What you explain you want reads to me "Async replication + Archiving".

Notable caveat: one can't very easily measure or bound the amount of
transaction loss in any graceful way as-is. We only have "unlimited
lag" and "2-safe or bust".

Presumably the DRBD setup run by the original poster can do this:

* run without a partner in a degraded mode (to use common RAID terminology)

* asynchronous rebuild and catch-up of a new remote RAID partner

* switch to synchronous RAID-1, which attenuates the source of block
device changes to get 2-safe reliability (i.e. blocking on
confirmations from two block devices)

However, the tricky part is what is DRBD's heuristic when suffering
degraded but non-zero performance of the network or block device will
drop attempts to replicate to its partner. Postgres's interpretation
is "halt, because 2-safe is currently impossible." DRBD seems to be
"continue" (but hopefully record a statistic, because who knows how
often you are actually 2-safe, then).

For example, what if DRBD can only complete one page per second for
some reason? Does it it simply have the primary wait at this glacial
pace, or drop synchronous replication and go degraded? Or does it do
something more clever than just a timeout?

These may seem like theoretical concerns, but 'slow, but non-zero'
progress has been an actual thorn in my side many times.

Regardless of what DRBD does, I think the problem with the async/sync
duality as-is is there is no nice way to manage exposure to
transaction loss under various situations and requirements. I'm not
really sure what a solution might look like; I was going to do
something grotesque and conjure carefully orchestrated standby status
packets to accomplish this.

--
fdr

#15Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Daniel Farina (#14)
Re: Synchronous Standalone Master Redoux

Daniel Farina <daniel@heroku.com> writes:

Notable caveat: one can't very easily measure or bound the amount of
transaction loss in any graceful way as-is. We only have "unlimited
lag" and "2-safe or bust".

¡per-transaction!

You can change your mind mid-transaction and ask for 2-safe or bust.
That's the detail we've not been talking about in this thread and makes
the whole solution practical in real life, at least for me.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#16Shaun Thomas
sthomas@optionshouse.com
In reply to: Daniel Farina (#14)
Re: Synchronous Standalone Master Redoux

On 07/10/2012 06:02 PM, Daniel Farina wrote:

For example, what if DRBD can only complete one page per second for
some reason? Does it it simply have the primary wait at this glacial
pace, or drop synchronous replication and go degraded? Or does it do
something more clever than just a timeout?

That's a good question, and way beyond what I know about the internals.
:) In practice though, there are configurable thresholds, and if
exceeded, it will invalidate the secondary. When using Pacemaker, we've
actually had instances where the 10G link we had between the servers
died, so each node thought the other was down. That lead to the
secondary node self-promoting and trying to steal the VIP from the
primary. Throw in a gratuitous arp, and you get a huge mess.

That lead to what DRBD calls split-brain, because both nodes were
running and writing to the block device. Thankfully, you can actually
tell one node to discard its changes and re-subscribe. Doing that will
replay the transactions from the "good" node on the "bad" one. And even
then, it's a good idea to run an online verify to do a block-by-block
checksum and correct any differences.

Of course, all of that's only possible because it's a block-level
replication. I can't even imagine PG doing anything like that. It would
have to know the last good transaction from the primary and do an
implied PIT recovery to reach that state, then re-attach for sync commits.

Regardless of what DRBD does, I think the problem with the
async/sync duality as-is is there is no nice way to manage exposure
to transaction loss under various situations and requirements.

Which would be handy. With synchronous commits, it's given that the
protocol is bi-directional. Then again, PG can detect when clients
disconnect the instant they do so, and having such an event implicitly
disable synchronous_standby_names until reconnect would be an easy fix.
The database already keeps transaction logs, so replaying would still
happen on re-attach. It could easily throw a warning for every
sync-required commit so long as it's in "degraded" mode. Those alone are
very small changes that don't really harm the intent of sync commit.

That's basically what a RAID-1 does, and people have been fine with that
for decades.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

#17Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Shaun Thomas (#16)
Re: Synchronous Standalone Master Redoux

Shaun Thomas <sthomas@optionshouse.com> writes:

Regardless of what DRBD does, I think the problem with the
async/sync duality as-is is there is no nice way to manage exposure
to transaction loss under various situations and requirements.

Yeah.

Which would be handy. With synchronous commits, it's given that the protocol
is bi-directional. Then again, PG can detect when clients disconnect the
instant they do so, and having such an event implicitly disable

It's not always possible, given how TCP works, if I understand correctly.

synchronous_standby_names until reconnect would be an easy fix. The database
already keeps transaction logs, so replaying would still happen on
re-attach. It could easily throw a warning for every sync-required commit so
long as it's in "degraded" mode. Those alone are very small changes that
don't really harm the intent of sync commit.

We already have that, with the archives. The missing piece is how to
apply that to Synchronous Replication…

That's basically what a RAID-1 does, and people have been fine with that for
decades.

… and we want to cover *data* availability (durability), not just
service availability.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#18Josh Berkus
josh@agliodbs.com
In reply to: Shaun Thomas (#16)
Re: Synchronous Standalone Master Redoux

On 7/11/12 6:41 AM, Shaun Thomas wrote:

Which would be handy. With synchronous commits, it's given that the
protocol is bi-directional. Then again, PG can detect when clients
disconnect the instant they do so, and having such an event implicitly
disable synchronous_standby_names until reconnect would be an easy fix.
The database already keeps transaction logs, so replaying would still
happen on re-attach. It could easily throw a warning for every
sync-required commit so long as it's in "degraded" mode. Those alone are
very small changes that don't really harm the intent of sync commit.

So your suggestion is to have a switch "allow degraded", where if the
sync standby doesn't respond within a certain threshold, will switch to
async with a warning for each transaction which asks for sync?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#19Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#12)
Re: Synchronous Standalone Master Redoux

On Tue, Jul 10, 2012 at 12:57 PM, Josh Berkus <josh@agliodbs.com> wrote:

Per your exchange with Heikki, that's not actually how SyncRep works in
9.1. So it's not giving you what you want anyway.

This is why we felt that the "sync rep if you can" mode was useless and
didn't accept it into 9.1. The *only* difference between sync rep and
async rep is whether or not the master waits for ack that the standby
has written to log.

I think one of the new modes in 9.2 forces synch-to-DB before ack. No?

No. Such a mode has been discussed and draft patches have been
circulated, but nothing's been committed. The new mode in 9.2 is less
synchronous than the previous mode (wait for remote write rather than
remote fsync), not more.

Now, if we DID have such a mode, then many people would likely attempt
to use synchronous replication in that mode as a way of ensuring that
read queries can't see stale data, rather than as a method of
providing increased durability. And in that case it sure seems like
it would be useful to wait only if the standby is connected. In fact,
you'd almost certainly want to have multiple standbys running
synchronously, and have the ability to wait for only those connected
at the moment. You might also want to have a way for standbys that
lose their connection to the master to refuse to take any new
snapshots until the slave is reconnected and has caught up. Then you
could guarantee that any query run on the slave will see all the
commits that are visible on the master (and possibly more, since
commits become visible on the slave first), which would be useful for
many applications.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Jose Ildefonso Camargo Tolosa
ildefonso.camargo@gmail.com
In reply to: Shaun Thomas (#16)
Re: Synchronous Standalone Master Redoux

Greetings,

On Wed, Jul 11, 2012 at 9:11 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:

On 07/10/2012 06:02 PM, Daniel Farina wrote:

For example, what if DRBD can only complete one page per second for
some reason? Does it it simply have the primary wait at this glacial
pace, or drop synchronous replication and go degraded? Or does it do
something more clever than just a timeout?

That's a good question, and way beyond what I know about the internals. :)
In practice though, there are configurable thresholds, and if exceeded, it
will invalidate the secondary. When using Pacemaker, we've actually had
instances where the 10G link we had between the servers died, so each node
thought the other was down. That lead to the secondary node self-promoting
and trying to steal the VIP from the primary. Throw in a gratuitous arp, and
you get a huge mess.

That's why Pacemaker *recommends* STONITH (Shoot The Other Node In The
Head). Whenever the standby decides to promote itself, it would just
kill the former master (just in case)... the STONITH thing have to use
an independent connection. Additionally, redundant link between
cluster nodes is a must.

That lead to what DRBD calls split-brain, because both nodes were running
and writing to the block device. Thankfully, you can actually tell one node
to discard its changes and re-subscribe. Doing that will replay the
transactions from the "good" node on the "bad" one. And even then, it's a
good idea to run an online verify to do a block-by-block checksum and
correct any differences.

Of course, all of that's only possible because it's a block-level
replication. I can't even imagine PG doing anything like that. It would have
to know the last good transaction from the primary and do an implied PIT
recovery to reach that state, then re-attach for sync commits.

Regardless of what DRBD does, I think the problem with the
async/sync duality as-is is there is no nice way to manage exposure
to transaction loss under various situations and requirements.

Which would be handy. With synchronous commits, it's given that the protocol
is bi-directional. Then again, PG can detect when clients disconnect the
instant they do so, and having such an event implicitly disable
synchronous_standby_names until reconnect would be an easy fix. The database
already keeps transaction logs, so replaying would still happen on
re-attach. It could easily throw a warning for every sync-required commit so
long as it's in "degraded" mode. Those alone are very small changes that
don't really harm the intent of sync commit.

That's basically what a RAID-1 does, and people have been fine with that for
decades.

I can't believe how many times I have seen this topic arise in the
mailing list... I was myself about to start a thread like this!
(thanks Shaun!).

I don't really get what people wants out of the synchronous streaming
replication.... DRBD (that is being used as comparison) in protocol C
is synchronous (it won't confirm a write unless it was written to disk
on both nodes). PostgreSQL (8.4, 9.0, 9.1, ...) will work just fine
with it, except that you don't have a standby that you can connect
to... also, you need to setup a dedicated volume to put the DRBD block
device, setup DRBD, then put the filesystem on top of DRBD, and handle
the DRBD promotion, partition mount (with possible FS error handling),
and then starting PostgreSQL after the FS is correctly mounted......

With synchronous streaming replication you can have about the same:
the standby will have the changes written to disk before master
confirms commit.... I don't really care if standby has already applied
the changes to its DB (although that would certainly be nice).... the
point is: the data is on the standby, and if the master were to crash,
and I were to "promote" the standby: the standby would have the same
commited data the server had before it crashed.

So, why are we, HA people, bothering you DB people so much?: simplify
the things, it is simpler to setup synchronous streaming replication,
than having to setup DRBD + pacemaker rules to make it promote DRBD,
mount FS, and then start pgsql.

Also, there is an great perk to synchronous replication with Hot
Standby: you have a read/only standby that can be used for some things
(even though it doesn't always have exactly the same data as the
master).

I mean, a lot of people here have a really valid point: 2-safe
reliability is great, but how good is it if when you lose it, ALL the
system just freeze? I mean, RAID1 gives you 2-safe reliability, but no
one would use it if the machine were to freeze when you lose 1 disk,
same for DRBD: it offers 2-safe reliability too (at block-level), but
it doesn't freeze if the secondary goes away!

Now, I see some people who are arguing because, apparently,
synchronous replication is not an HA feature (those who says that SR
doesn't fit the HA environment)... please, those people, answer why is
synchronous streaming replication under the High Availability
PostgreSQL manual chapter?

I really feel bad that people are so closed to fix something, I mean:
making the master note that the standby is no longer there and just
fallback to "standalone" mode seems to just bother them so much, that
they wouldn't even allow *an option* to allow that.... we are not
asking you to change default behavior, just add an option that makes
it gracefully continue operation and issue warnings, after all: if you
lose a disk on a RAID array, you get some kind of indication of the
failure to get it fixed ASAP: you know you are in risk until you fix
it, but you can continue to function... name a single RAID controller
that will shutdown your server on single disk failure?, I haven't seen
any card that does that: no body would buy it.

Adding more on a related issue: what's up with the fact that the
standby doesn't respect wal_keep_segments? This is forcing some people
to have to copy the WAL files *twice*: one through streaming
replication, and again to a WAL archive, because if the master dies,
and you have more than one standby (say: 1 synchronous, and 2
asynchronous), you can actually point the async ones to the sync one
once you promote it (as long as you trick the sync one into *not*
switching the timeline, by moving away recovery.conf and restarting,
instead of using "normal" promotion), but if you don't have the WAL
archive, and one of the standbys was too lagged: it wouldn't be able
to recover.

Please, stop arguing on all of this: I don't think that adding an
option will hurt anybody (specially because the work was already done
by someone), we are not asking to change how the things work, we just
want an option to decided whether we want it to freeze on standby
disconnection, or if we want it to continue automatically... is that
asking so much?

Sincerely,

Ildefonso

#21Josh Berkus
josh@agliodbs.com
In reply to: Jose Ildefonso Camargo Tolosa (#20)
#22Jose Ildefonso Camargo Tolosa
ildefonso.camargo@gmail.com
In reply to: Josh Berkus (#21)
#23Daniel Farina
daniel@heroku.com
In reply to: Dimitri Fontaine (#15)
#24Daniel Farina
daniel@heroku.com
In reply to: Shaun Thomas (#16)
#25Amit Kapila
amit.kapila16@gmail.com
In reply to: Jose Ildefonso Camargo Tolosa (#20)
#26Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Jose Ildefonso Camargo Tolosa (#22)
#27Shaun Thomas
sthomas@optionshouse.com
In reply to: Daniel Farina (#24)
#28Aidan Van Dyk
aidan@highrise.ca
In reply to: Shaun Thomas (#27)
#29Bruce Momjian
bruce@momjian.us
In reply to: Amit Kapila (#25)
#30Bruce Momjian
bruce@momjian.us
In reply to: Shaun Thomas (#27)
#31Shaun Thomas
sthomas@optionshouse.com
In reply to: Bruce Momjian (#30)
#32Jose Ildefonso Camargo Tolosa
ildefonso.camargo@gmail.com
In reply to: Dimitri Fontaine (#26)
#33Jose Ildefonso Camargo Tolosa
ildefonso.camargo@gmail.com
In reply to: Aidan Van Dyk (#28)
#34Jose Ildefonso Camargo Tolosa
ildefonso.camargo@gmail.com
In reply to: Bruce Momjian (#29)
#35Aidan Van Dyk
aidan@highrise.ca
In reply to: Jose Ildefonso Camargo Tolosa (#32)
#36Jose Ildefonso Camargo Tolosa
ildefonso.camargo@gmail.com
In reply to: Aidan Van Dyk (#35)
#37Jose Ildefonso Camargo Tolosa
ildefonso.camargo@gmail.com
In reply to: Shaun Thomas (#31)
#38Amit Kapila
amit.kapila16@gmail.com
In reply to: Jose Ildefonso Camargo Tolosa (#33)
#39Hampus Wessman
hampus@hampuswessman.se
In reply to: Shaun Thomas (#31)
#40Bruce Momjian
bruce@momjian.us
In reply to: Hampus Wessman (#39)
#41Jose Ildefonso Camargo Tolosa
ildefonso.camargo@gmail.com
In reply to: Shaun Thomas (#1)
#42Jose Ildefonso Camargo Tolosa
ildefonso.camargo@gmail.com
In reply to: Hampus Wessman (#39)
#43Jose Ildefonso Camargo Tolosa
ildefonso.camargo@gmail.com
In reply to: Bruce Momjian (#40)
#44Amit Kapila
amit.kapila16@gmail.com
In reply to: Jose Ildefonso Camargo Tolosa (#43)
#45Jose Ildefonso Camargo Tolosa
ildefonso.camargo@gmail.com
In reply to: Amit Kapila (#44)
#46Amit Kapila
amit.kapila16@gmail.com
In reply to: Jose Ildefonso Camargo Tolosa (#45)
#47Jose Ildefonso Camargo Tolosa
ildefonso.camargo@gmail.com
In reply to: Amit Kapila (#46)
#48Josh Berkus
josh@agliodbs.com
In reply to: Jose Ildefonso Camargo Tolosa (#47)
#49Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#48)
#50Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#49)
#51Daniel Farina
daniel@heroku.com
In reply to: Heikki Linnakangas (#50)
#52Bruce Momjian
bruce@momjian.us
In reply to: Jose Ildefonso Camargo Tolosa (#43)