Core team statement on replication in PostgreSQL
The Postgres core team met at PGCon to discuss a few issues, the largest
of which is the need for simple, built-in replication for PostgreSQL.
Historically the project policy has been to avoid putting replication
into core PostgreSQL, so as to leave room for development of competing
solutions, recognizing that there is no "one size fits all" replication
solution. However, it is becoming clear that this policy is hindering
acceptance of PostgreSQL to too great an extent, compared to the benefit
it offers to the add-on replication projects. Users who might consider
PostgreSQL are choosing other database systems because our existing
replication options are too complex to install and use for simple cases.
In practice, simple asynchronous single-master-multiple-slave
replication covers a respectable fraction of use cases, so we have
concluded that we should allow such a feature to be included in the core
project. We emphasize that this is not meant to prevent continued
development of add-on replication projects that cover more complex use
cases.
We believe that the most appropriate base technology for this is
probably real-time WAL log shipping, as was demoed by NTT OSS at PGCon.
We hope that such a feature can be completed for 8.4. Ideally this
would be coupled with the ability to execute read-only queries on the
slave servers, but we see technical difficulties that might prevent that
from being completed before 8.5 or even further out. (The big problem
is that long-running slave-side queries might still need tuples that are
vacuumable on the master, and so replication of vacuuming actions would
cause the slave's queries to deliver wrong answers.)
Again, this will not replace Slony, pgPool, Continuent, Londiste, or
other systems for many users, as it will be not be highly scalable nor
support long-distance replication nor replicating less than an entire
installation. But it is time to include a simple, reliable basic
replication feature in the core system.
regards, tom lane
On 5/29/08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The Postgres core team met at PGCon to discuss a few issues, the largest
of which is the need for simple, built-in replication for PostgreSQL.
Historically the project policy has been to avoid putting replication
into core PostgreSQL, so as to leave room for development of competing
solutions, recognizing that there is no "one size fits all" replication
solution. However, it is becoming clear that this policy is hindering
acceptance of PostgreSQL to too great an extent, compared to the benefit
it offers to the add-on replication projects. Users who might consider
PostgreSQL are choosing other database systems because our existing
replication options are too complex to install and use for simple cases.
In practice, simple asynchronous single-master-multiple-slave
replication covers a respectable fraction of use cases, so we have
concluded that we should allow such a feature to be included in the core
project. We emphasize that this is not meant to prevent continued
development of add-on replication projects that cover more complex use
cases.We believe that the most appropriate base technology for this is
probably real-time WAL log shipping, as was demoed by NTT OSS at PGCon.
We hope that such a feature can be completed for 8.4.
+1
Although I would explain it more shortly - we do need a solution for
lossless failover servers and such solution needs to live in core backend.
Ideally this
would be coupled with the ability to execute read-only queries on the
slave servers, but we see technical difficulties that might prevent that
from being completed before 8.5 or even further out. (The big problem
is that long-running slave-side queries might still need tuples that are
vacuumable on the master, and so replication of vacuuming actions would
cause the slave's queries to deliver wrong answers.)
Well, both Slony-I and upcoming Skytools 3 have the same problem when
cleaning events and have it solved simply by slaves reporting back their
lowest position on event stream. I cannot see why it cannot be applied
in this case too. So each slave just needs to report its own longest
open tx as "open" to master. Yes, it bloats master but no way around it.
Only problem could be the plan to vacuum tuples updated in between long
running tx and the regular ones, but such behaviour can be just turned off.
We could also have a option of "inaccessible slave", for those who
fear bloat on master.
--
marko
On Thu, May 29, 2008 at 10:12:55AM -0400, Tom Lane wrote:
The Postgres core team met at PGCon to discuss a few issues, the
largest of which is the need for simple, built-in replication for
PostgreSQL. Historically the project policy has been to avoid
putting replication into core PostgreSQL, so as to leave room for
development of competing solutions, recognizing that there is no
"one size fits all" replication solution. However, it is becoming
clear that this policy is hindering acceptance of PostgreSQL to too
great an extent, compared to the benefit it offers to the add-on
replication projects. Users who might consider PostgreSQL are
choosing other database systems because our existing replication
options are too complex to install and use for simple cases. In
practice, simple asynchronous single-master-multiple-slave
replication covers a respectable fraction of use cases, so we have
concluded that we should allow such a feature to be included in the
core project. We emphasize that this is not meant to prevent
continued development of add-on replication projects that cover more
complex use cases.We believe that the most appropriate base technology for this is
probably real-time WAL log shipping, as was demoed by NTT OSS at
PGCon. We hope that such a feature can be completed for 8.4.
Ideally this would be coupled with the ability to execute read-only
queries on the slave servers, but we see technical difficulties that
might prevent that from being completed before 8.5 or even further
out. (The big problem is that long-running slave-side queries might
still need tuples that are vacuumable on the master, and so
replication of vacuuming actions would cause the slave's queries to
deliver wrong answers.)
This part is a deal-killer. It's a giant up-hill slog to sell warm
standby to those in charge of making resources available because the
warm standby machine consumes SA time, bandwidth, power, rack space,
etc., but provides no tangible benefit, and this feature would have
exactly the same problem.
IMHO, without the ability to do read-only queries on slaves, it's not
worth doing this feature at all.
Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
On 5/29/08, David Fetter <david@fetter.org> wrote:
On Thu, May 29, 2008 at 10:12:55AM -0400, Tom Lane wrote:
Ideally this would be coupled with the ability to execute read-only
queries on the slave servers, but we see technical difficulties that
might prevent that from being completed before 8.5 or even further
out. (The big problem is that long-running slave-side queries might
still need tuples that are vacuumable on the master, and so
replication of vacuuming actions would cause the slave's queries to
deliver wrong answers.)This part is a deal-killer. It's a giant up-hill slog to sell warm
standby to those in charge of making resources available because the
warm standby machine consumes SA time, bandwidth, power, rack space,
etc., but provides no tangible benefit, and this feature would have
exactly the same problem.IMHO, without the ability to do read-only queries on slaves, it's not
worth doing this feature at all.
I would not be so harsh - I'd like to have the lossless standby even
without read-only slaves.
But Tom's mail gave me impression core wants to wait until we get "perfect"
read-only slave implementation so we wait with it until 8.6, which does
not seem sensible. If we can do slightly inefficient (but simple)
implementation
right now, I see no reason to reject it, we can always improve it later.
Especially as it can be switchable. And we could also have
transaction_timeout paramenter on slaves so the hit on master is limited.
--
marko
On Thu, 2008-05-29 at 08:21 -0700, David Fetter wrote:
On Thu, May 29, 2008 at 10:12:55AM -0400, Tom Lane wrote:
This part is a deal-killer. It's a giant up-hill slog to sell warm
standby to those in charge of making resources available because the
warm standby machine consumes SA time, bandwidth, power, rack space,
etc., but provides no tangible benefit, and this feature would have
exactly the same problem.IMHO, without the ability to do read-only queries on slaves, it's not
worth doing this feature at all.
The only question I have is... what does this give us that PITR doesn't
give us?
Sincerely,
Joshua D. Drake
On Thu, May 29, 2008 at 11:46 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
The only question I have is... what does this give us that PITR doesn't
give us?
I think the idea is that WAL records would be shipped (possibly via
socket) and applied as they're generated, rather than on a
file-by-file basis. At least that's what "real-time" implies to me...
-Doug
Marko,
But Tom's mail gave me impression core wants to wait until we get "perfect"
read-only slave implementation so we wait with it until 8.6, which does
not seem sensible. If we can do slightly inefficient (but simple)
implementation
right now, I see no reason to reject it, we can always improve it later.
That's incorrect. We're looking for a workable solution. If we could
get one for 8.4, that would be brilliant but we think it's going to be
harder than that.
Publishing the XIDs back to the master is one possibility. We also
looked at using "spillover segments" for vacuumed rows, but that seemed
even less viable.
I'm also thinking, for *async replication*, that we could simply halt
replication on the slave whenever a transaction passes minxid on the
master. However, the main focus will be on synchrounous hot standby.
--Josh
On Thu, May 29, 2008 at 08:46:22AM -0700, Joshua D. Drake wrote:
On Thu, 2008-05-29 at 08:21 -0700, David Fetter wrote:
This part is a deal-killer. It's a giant up-hill slog to sell
warm standby to those in charge of making resources available
because the warm standby machine consumes SA time, bandwidth,
power, rack space, etc., but provides no tangible benefit, and
this feature would have exactly the same problem.IMHO, without the ability to do read-only queries on slaves, it's
not worth doing this feature at all.The only question I have is... what does this give us that PITR
doesn't give us?
It looks like a wrapper for PITR to me, so the gain would be ease of
use.
Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
On Thursday 29 May 2008 09:54:03 am Marko Kreen wrote:
On 5/29/08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The Postgres core team met at PGCon to discuss a few issues, the largest
of which is the need for simple, built-in replication for PostgreSQL.
Historically the project policy has been to avoid putting replication
into core PostgreSQL, so as to leave room for development of competing
solutions, recognizing that there is no "one size fits all" replication
solution. However, it is becoming clear that this policy is hindering
acceptance of PostgreSQL to too great an extent, compared to the benefit
it offers to the add-on replication projects. Users who might consider
PostgreSQL are choosing other database systems because our existing
replication options are too complex to install and use for simple cases.
In practice, simple asynchronous single-master-multiple-slave
replication covers a respectable fraction of use cases, so we have
concluded that we should allow such a feature to be included in the core
project. We emphasize that this is not meant to prevent continued
development of add-on replication projects that cover more complex use
cases.We believe that the most appropriate base technology for this is
probably real-time WAL log shipping, as was demoed by NTT OSS at PGCon.
We hope that such a feature can be completed for 8.4.+1
Although I would explain it more shortly - we do need a solution for
lossless failover servers and such solution needs to live in core backend.
+1 for lossless failover (ie, synchronous)
Josh Berkus wrote:
Marko,
But Tom's mail gave me impression core wants to wait until we get "perfect"
read-only slave implementation so we wait with it until 8.6, which does
not seem sensible. If we can do slightly inefficient (but simple)
implementation
right now, I see no reason to reject it, we can always improve it later.That's incorrect. We're looking for a workable solution. If we could
get one for 8.4, that would be brilliant but we think it's going to be
harder than that.Publishing the XIDs back to the master is one possibility. We also
looked at using "spillover segments" for vacuumed rows, but that seemed
even less viable.I'm also thinking, for *async replication*, that we could simply halt
replication on the slave whenever a transaction passes minxid on the
master. However, the main focus will be on synchrounous hot standby.
Another idea I discussed with Tom is having the slave _delay_ applying
WAL files until all slave snapshots are ready.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
On Thu, May 29, 2008 at 4:48 PM, Douglas McNaught <doug@mcnaught.org> wrote:
On Thu, May 29, 2008 at 11:46 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
The only question I have is... what does this give us that PITR doesn't
give us?I think the idea is that WAL records would be shipped (possibly via
socket) and applied as they're generated, rather than on a
file-by-file basis. At least that's what "real-time" implies to me...
Yes, we're talking real-time streaming (synchronous) log shipping.
--
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com
Tom Lane wrote:
In practice, simple asynchronous single-master-multiple-slave
replication covers a respectable fraction of use cases, so we have
concluded that we should allow such a feature to be included in the core
project. We emphasize that this is not meant to prevent continued
development of add-on replication projects that cover more complex use
cases.
IMHO, this will help PostgreSQL adoption, mindshare and even boost interest in
development for the other replication use cases.
We believe that the most appropriate base technology for this is
probably real-time WAL log shipping, as was demoed by NTT OSS at PGCon.
The slides are up at http://www.pgcon.org/2008/schedule/events/76.en.html
From what I gather from those slides it seems to me that the NTT solution is
synchronous not asynchronous. In my opinion it's even better, but I do
understand that others might prefer asynchronous. I'm going to speculate, but I
would think it should be possible (without a substancial rewrite) to support
both modes (or even some intermediate modes, like DRBD on Linux).
We hope that such a feature can be completed for 8.4. Ideally this
would be coupled with the ability to execute read-only queries on the
slave servers, but we see technical difficulties that might prevent that
from being completed before 8.5 or even further out. (The big problem
is that long-running slave-side queries might still need tuples that are
vacuumable on the master, and so replication of vacuuming actions would
cause the slave's queries to deliver wrong answers.)
From the 8.4dev documentation, another problem for read-only slaves would be :
« Operations on hash indexes are not presently WAL-logged, so replay will not
update these indexes. The recommended workaround is to manually REINDEX each
such index after completing a recovery operation. ».
Sincerely,
--
Mathias Brossard
* Josh Berkus <josh@agliodbs.com> [080529 11:52]:
Marko,
But Tom's mail gave me impression core wants to wait until we get "perfect"
read-only slave implementation so we wait with it until 8.6, which does
not seem sensible. If we can do slightly inefficient (but simple)
implementation
right now, I see no reason to reject it, we can always improve it later.That's incorrect. We're looking for a workable solution. If we could
get one for 8.4, that would be brilliant but we think it's going to be
harder than that.Publishing the XIDs back to the master is one possibility. We also
looked at using "spillover segments" for vacuumed rows, but that seemed
even less viable.I'm also thinking, for *async replication*, that we could simply halt
replication on the slave whenever a transaction passes minxid on the
master. However, the main focus will be on synchrounous hot standby.
Or, instead of statement timeout killing statements on the RO slave,
simply kill any "old" transactions on the RO slave. "Old" in the sense
that the master's xmin has passed it. And it's just an exersise in
controlling the age of xmin on the master, which could even be done
user-side.
Doesn't fit all, but no one size does... It would work for where you're
hammering your slaves with a diverse set of high-velocity short queries
that you're trying to avoid on the master...
An option to "pause reply (making it async)" or "abort transactions
(for sync)" might make it possible to easily run an async slave for slow
reporting queries, and a sync slave for short queries.
a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
Joshua D. Drake wrote:
On Thu, 2008-05-29 at 08:21 -0700, David Fetter wrote:
On Thu, May 29, 2008 at 10:12:55AM -0400, Tom Lane wrote:
This part is a deal-killer. It's a giant up-hill slog to sell warm
standby to those in charge of making resources available because the
warm standby machine consumes SA time, bandwidth, power, rack space,
etc., but provides no tangible benefit, and this feature would have
exactly the same problem.IMHO, without the ability to do read-only queries on slaves, it's not
worth doing this feature at all.The only question I have is... what does this give us that PITR doesn't
give us?
Since people seem to be unclear on what we're proposing:
8.4 Synchronous Warm Standby: makes PostgreSQL more suitable for HA
systems by eliminating failover data loss and cutting failover time.
8.5 (probably) Synchronous & Asynchronous Hot Standby: adds read-only
queries on slaves to the above.
Again, if we can implement queries on slaves for 8.4, we're all for it.
However, after conversations in Core and with Simon we all think it's
going to be too big a task to complete in 4-5 months. We *don't* want
to end up delaying 8.4 for 5 months because we're debugging hot standby.
--Josh
David Fetter wrote:
This part is a deal-killer. It's a giant up-hill slog to sell warm
standby to those in charge of making resources available because the
warm standby machine consumes SA time, bandwidth, power, rack space,
etc., but provides no tangible benefit, and this feature would have
exactly the same problem.IMHO, without the ability to do read-only queries on slaves, it's not
worth doing this feature at all.
I don't think I agree with this. There are a large number of situations
where it's positive expectancy to do precisely this- it's not unlike
buying a $1 lottery ticket with a 1 chance in 100 of winning $1000- the
vast majority of the time (99 times out of 100), you're going to lose
$1. But when you win, you win big, and make up for all the small losses
you incurred getting there and then some. Failover machines are like
that- most of the time they're negative value, as you said- taking up SA
time, bandwidth, power, rack space, money, etc. But every once in a
(great) while, they save you. If the cost of having the database down
for hours or days (as you madly try to next-day replacement hardware)
isn't that great, then no, this isn't worthwhile- but in cases where
the database being down chalks up the lost money quickly, this is easy
to cost-justify.
Being able to do read-only queries makes this feature more valuable in
more situations, but I disagree that it's a deal-breaker.
Brian
On Thu, May 29, 2008 at 11:58:31AM -0400, Bruce Momjian wrote:
Josh Berkus wrote:
Publishing the XIDs back to the master is one possibility. We
also looked at using "spillover segments" for vacuumed rows, but
that seemed even less viable.I'm also thinking, for *async replication*, that we could simply
halt replication on the slave whenever a transaction passes minxid
on the master. However, the main focus will be on synchrounous
hot standby.Another idea I discussed with Tom is having the slave _delay_
applying WAL files until all slave snapshots are ready.
Either one of these would be great, but something that involves
machines that stay useless most of the time is just not going to work.
Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter wrote:
On Thu, May 29, 2008 at 11:58:31AM -0400, Bruce Momjian wrote:
Josh Berkus wrote:
Publishing the XIDs back to the master is one possibility. We
also looked at using "spillover segments" for vacuumed rows, but
that seemed even less viable.I'm also thinking, for *async replication*, that we could simply
halt replication on the slave whenever a transaction passes minxid
on the master. However, the main focus will be on synchrounous
hot standby.Another idea I discussed with Tom is having the slave _delay_
applying WAL files until all slave snapshots are ready.Either one of these would be great, but something that involves
machines that stay useless most of the time is just not going to work.
Right, the ultimate target is to have the slave be read-only, but we
need to get the streaming of WAL logs done first.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
* Dave Page <dpage@pgadmin.org> [080529 12:03]:
On Thu, May 29, 2008 at 4:48 PM, Douglas McNaught <doug@mcnaught.org> wrote:
I think the idea is that WAL records would be shipped (possibly via
socket) and applied as they're generated, rather than on a
file-by-file basis. At least that's what "real-time" implies to me...Yes, we're talking real-time streaming (synchronous) log shipping.
But synchronous streaming doesn't mean the WAL has to be *applied* on
the salve yet. Just that it has to be "safely" on the slave (i.e on
disk, not just in kernel buffers).
The whole single-threaded WAL replay problem is going to rear it's ugly
head here too, and mean that a slave *won't* be able to keep up with a
busy master if it's actually trying to apply all the changes in
real-time. Well, actually, if it's synchronous, it will keep up, but it
just means that now your master is IO capabilities is limited to the
speed of the slaves single-threaded WAL application.
a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
Bruce,
Another idea I discussed with Tom is having the slave _delay_ applying
WAL files until all slave snapshots are ready.
Well, again, that only works for async mode. I personally think that's
the correct solution for async. But for synch mode, I think we need to
push the xids back to the master; generally if a user is running in
synch mode they're concerned about failover time and zero data loss, so
holding back the WAL files doesn't make sense.
Also, if you did delay applying WAL files on an async slave, you'd reach
a point (perhaps after a 6-hour query) where it'd actually be cheaper to
rebuild the slave than to apply the pent-up WAL files.
--Josh Berkus
Dave Page wrote:
On Thu, May 29, 2008 at 4:48 PM, Douglas McNaught <doug@mcnaught.org> wrote:
On Thu, May 29, 2008 at 11:46 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
The only question I have is... what does this give us that PITR doesn't
give us?I think the idea is that WAL records would be shipped (possibly via
socket) and applied as they're generated, rather than on a
file-by-file basis. At least that's what "real-time" implies to me...Yes, we're talking real-time streaming (synchronous) log shipping.
That's not what Tom's email said, AIUI. "Synchronous" replication surely
means that the master and slave always have the same set of transactions
applied. Streaming <> synchronous. But streaming log shipping will allow
us to get get closer to synchronicity in some situations, i.e. the
window for missing transactions will be much smaller.
Some of us were discussing this late on Friday night after PGcon. ISTM
that we can have either 1) fairly hot failover slaves that are
guaranteed to be almost up to date, or 2) slaves that can support
read-only transactions but might get somewhat out of date if they run
long transactions. The big problem is in having slaves which are both
highly up to date and support arbitrary read-only transactions. Maybe in
the first instance, at least, we need to make slaves choose which role
they will play.
cheers
andrew