Switching timeline over streaming replication
I've been working on the often-requested feature to handle timeline
changes over streaming replication. At the moment, if you kill the
master and promote a standby server, and you have another standby server
that you'd like to keep following the new master server, you need a WAL
archive in addition to streaming replication to make it cross the
timeline change. Streaming replication will just error out. Having a WAL
archive is usually a good idea in complex replication scenarios anyway,
but it would be good to not require it.
Attached is a WIP patch for that. It needs cleanup, but works.
Protocol changes
----------------
When we invented the COPY-both mode, we left out any description of how
to get out of that mode, simply stating that both ends "may then send
CopyData messages until the connection is terminated". The patch makes
it possible to revert back to regular processing, by sending a CopyDone
message, like in normal Copy-in or Copy-out mode. Either end can take
the initiative and send CopyDone, and after doing that may not send any
more CopyDone messages. When both ends have sent a CopyDone message, and
received a CopyDone message from the other end, the connection is out of
Copy-mode, and the server finishes the command with a CommandComplete
message.
Another way to think of it is that when the server sends a CopyDone
message, the connection switches from copy-both to Copy-in mode. And if
the client sends a CopyDone message first, the connection goes from
Copy-both to Copy-out mode, until the server ends the streaming from its
end.
New replication command: TIMELINE_HISTORY
-----------------------------------------
To switch recovery target timeline, a standby needs the timeline history
file (e.g 00000002.history) of the new timeline. The patch adds a new
command to the set of commands accepted by walsender, to transmit a
given timeline history file from master to slave.
Walsender changes to stream a particular timeline
-------------------------------------------------
The walsender now keeps track of exactly which timeline it is currently
streaming; it's not necessarily the latest one anymore. The
START_REPLICATION command is extended with a TIMELINE option that the
client can use to request streaming from a particular timeline. If the
client asks for a timeline that's not the current, but is part of the
history of the server, the walsender knows to read from the correct WAL
file that contains that. Also, the walsender knows where the server's
history branched off from that timeline, and will only stream WAL up to
that point. When that point is reached, it ends the streaming (with a
CopyDone message), and prepares to accept a new replication command.
Typically, the walreceiver will then ask to start streaming from the
next timeline.
Walreceiver changes
-------------------
Previously, when the timeline reported by the server didn't match the
current timeline in the standby, walreceiver simply errored out. Now, it
requests for any missing timeline history files using the new
TIMELINE_HISTORY command, and then tries to start replication from the
current standby's timeline, even if that's older than the master's.
When the end of the old timeline is reached, walreceiver sets a state
variable in shared memory to indicate that, pings the the startup
process, and waits for the startup process for new orders. The startup
process can set receiveStart and timeline in shared memory and ping the
walreceiver again, to get the walreceiver to restart streaming from the
new starting point [1]Initially, I tried to do this by simply letting walreceiver die and have the startup process launch a new walreceiver process that would reconnect, but it turned out to be hard to rapidly disconnect and connect, because the postmaster, which forks the walreceiver process, does not always have the same idea of whether the walreceiver is active as the startup process does. It would eventually be ok, thanks to timeouts, but would require polling. But not having to disconnect seems nicer, anyway. Before the startup process does that, it will
scan pg_xlog for new timeline history files if
recovery_target_timeline='latest'. It will find any new histrory files
the walreceiver stored there, and switch over to the latest timeline
just as it does with a WAL archive.
Some parts of this patch are just refactoring that probably make sense
regardless of the new functionality. For example, I split off the
timeline history file related functions to a new file, timeline.c.
That's not very much code, but it's fairly isolated, and xlog.c is
massive, so I feel that anything that we can move off from xlog.c is a
good thing. I also moved off the two functions RestoreArchivedFile() and
ExecuteRecoveryCommand(), to a separate file. Those are also not much
code, but are fairly isolated. If no-one objects to those changes, and
the general direction this work is going to, I'm going split off those
refactorings to separate patches and commit them separately.
I also made the timeline history file a bit more detailed: instead of
recording just the WAL segment where the timeline was changed, it now
records the exact XLogRecPtr. That was required for the walsender to
know the switchpoint, without having to parse the XLOG records (it reads
and parses the history file, instead)
[1]: Initially, I tried to do this by simply letting walreceiver die and have the startup process launch a new walreceiver process that would reconnect, but it turned out to be hard to rapidly disconnect and connect, because the postmaster, which forks the walreceiver process, does not always have the same idea of whether the walreceiver is active as the startup process does. It would eventually be ok, thanks to timeouts, but would require polling. But not having to disconnect seems nicer, anyway
have the startup process launch a new walreceiver process that would
reconnect, but it turned out to be hard to rapidly disconnect and
connect, because the postmaster, which forks the walreceiver process,
does not always have the same idea of whether the walreceiver is active
as the startup process does. It would eventually be ok, thanks to
timeouts, but would require polling. But not having to disconnect seems
nicer, anyway
- Heikki
Attachments:
streaming-tli-switch-1.patch.gzapplication/x-gzip; name=streaming-tli-switch-1.patch.gzDownload
On Tuesday, September 11, 2012 10:53 PM Heikki Linnakangas wrote:
I've been working on the often-requested feature to handle timeline
changes over streaming replication. At the moment, if you kill the
master and promote a standby server, and you have another standby
server that you'd like to keep following the new master server, you
need a WAL archive in addition to streaming replication to make it
cross the timeline change. Streaming replication will just error out.
Having a WAL archive is usually a good idea in complex replication
scenarios anyway, but it would be good to not require it.
Confirm my understanding of this feature:
This feature is for case when standby-1 who is going to be promoted to
master has archive mode 'on'.
As in that case only its timeline will change.
If above is right, then there can be other similar scenario's where it can
be used:
Scenario-1 (1 Master, 1 Stand-by)
1. Master (archive_mode=on) goes down.
2. Master again comes up
3. Stand-by tries to follow it
Now in above scenario also due to timeline mismatch it gives error, but your
patch should fix it.
Some parts of this patch are just refactoring that probably make sense
regardless of the new functionality. For example, I split off the
timeline history file related functions to a new file, timeline.c.
That's not very much code, but it's fairly isolated, and xlog.c is
massive, so I feel that anything that we can move off from xlog.c is a
good thing. I also moved off the two functions RestoreArchivedFile()
and ExecuteRecoveryCommand(), to a separate file. Those are also not
much code, but are fairly isolated. If no-one objects to those changes,
and the general direction this work is going to, I'm going split off
those refactorings to separate patches and commit them separately.I also made the timeline history file a bit more detailed: instead of
recording just the WAL segment where the timeline was changed, it now
records the exact XLogRecPtr. That was required for the walsender to
know the switchpoint, without having to parse the XLOG records (it
reads and parses the history file, instead)
IMO separating timeline history file related functions to a new file is
good.
However I am not sure about splitting for RestoreArchivedFile() and
ExecuteRecoveryCommand() into separate file.
How about splitting for all Archive related functions:
static void XLogArchiveNotify(const char *xlog);
static void XLogArchiveNotifySeg(XLogSegNo segno);
static bool XLogArchiveCheckDone(const char *xlog);
static bool XLogArchiveIsBusy(const char *xlog);
static void XLogArchiveCleanup(const char *xlog);
..
..
In any case, it will be better if you can split it into multiple patches:
1. Having new functionality of "Switching timeline over streaming
replication"
2. Refactoring related changes.
It can make my testing and review for new feature patch little easier.
With Regards,
Amit Kapila.
On Monday, September 24, 2012 9:08 PM md@rpzdesign.com wrote:
What a disaster waiting to happen. Maybe the only replication should be
master-master replication
so there is no need to sequence timelines or anything, all servers are
ready masters, no backups or failovers.
If you really do not want a master serving, then it should only be
handled in the routing
of traffic to that server and not the replication logic itself. The
only thing that ever came about
from failovers was the failure to turn over. The above is opinion
only.
This feature is for users who want to use master-standby configurations.
What do you mean by :
"then it should only be handled in the routing of traffic to that server
and not the replication logic itself."
Do you have any idea other than proposed implementation or do you see any
problem in currently proposed solution?
Show quoted text
On 9/24/2012 7:33 AM, Amit Kapila wrote:
On Tuesday, September 11, 2012 10:53 PM Heikki Linnakangas wrote:
I've been working on the often-requested feature to handle timeline
changes over streaming replication. At the moment, if you kill the
master and promote a standby server, and you have another standby
server that you'd like to keep following the new master server, you
need a WAL archive in addition to streaming replication to make it
cross the timeline change. Streaming replication will just errorout.
Having a WAL archive is usually a good idea in complex replication
scenarios anyway, but it would be good to not require it.Confirm my understanding of this feature:
This feature is for case when standby-1 who is going to be promoted
to
master has archive mode 'on'.
As in that case only its timeline will change.If above is right, then there can be other similar scenario's where
it can
be used:
Scenario-1 (1 Master, 1 Stand-by)
1. Master (archive_mode=on) goes down.
2. Master again comes up
3. Stand-by tries to follow itNow in above scenario also due to timeline mismatch it gives error,
but your
patch should fix it.
Some parts of this patch are just refactoring that probably make
sense
regardless of the new functionality. For example, I split off the
timeline history file related functions to a new file, timeline.c.
That's not very much code, but it's fairly isolated, and xlog.c is
massive, so I feel that anything that we can move off from xlog.c isa
good thing. I also moved off the two functions RestoreArchivedFile()
and ExecuteRecoveryCommand(), to a separate file. Those are also not
much code, but are fairly isolated. If no-one objects to thosechanges,
and the general direction this work is going to, I'm going split off
those refactorings to separate patches and commit them separately.I also made the timeline history file a bit more detailed: instead
of
recording just the WAL segment where the timeline was changed, it
now
records the exact XLogRecPtr. That was required for the walsender to
know the switchpoint, without having to parse the XLOG records (it
reads and parses the history file, instead)IMO separating timeline history file related functions to a new file
is
good.
However I am not sure about splitting for RestoreArchivedFile() and
ExecuteRecoveryCommand() into separate file.
How about splitting for all Archive related functions:
static void XLogArchiveNotify(const char *xlog);
static void XLogArchiveNotifySeg(XLogSegNo segno);
static bool XLogArchiveCheckDone(const char *xlog);
static bool XLogArchiveIsBusy(const char *xlog);
static void XLogArchiveCleanup(const char *xlog);
..
..In any case, it will be better if you can split it into multiple
patches:
1. Having new functionality of "Switching timeline over streaming
replication"
2. Refactoring related changes.It can make my testing and review for new feature patch little
easier.
With Regards,
Amit Kapila.
Import Notes
Reply to msg id not found: 50607E64.70104@rpzdesign.com
On 24.09.2012 16:33, Amit Kapila wrote:
On Tuesday, September 11, 2012 10:53 PM Heikki Linnakangas wrote:
I've been working on the often-requested feature to handle timeline
changes over streaming replication. At the moment, if you kill the
master and promote a standby server, and you have another standby
server that you'd like to keep following the new master server, you
need a WAL archive in addition to streaming replication to make it
cross the timeline change. Streaming replication will just error out.
Having a WAL archive is usually a good idea in complex replication
scenarios anyway, but it would be good to not require it.Confirm my understanding of this feature:
This feature is for case when standby-1 who is going to be promoted to
master has archive mode 'on'.
No. This is for the case where there is no WAL archive.
archive_mode='off' on all servers.
Or to be precise, you can also have a WAL archive, but this patch
doesn't affect that in any way. This is strictly about streaming
replication.
As in that case only its timeline will change.
The timeline changes whenever you promote a standby. It's not related to
whether you have a WAL archive or not.
If above is right, then there can be other similar scenario's where it can
be used:Scenario-1 (1 Master, 1 Stand-by)
1. Master (archive_mode=on) goes down.
2. Master again comes up
3. Stand-by tries to follow itNow in above scenario also due to timeline mismatch it gives error, but your
patch should fix it.
If the master simply crashes or is shut down, and then restarted, the
timeline doesn't change. The standby will reconnect / poll the archive,
and sync up just fine, even without this patch.
However I am not sure about splitting for RestoreArchivedFile() and
ExecuteRecoveryCommand() into separate file.
How about splitting for all Archive related functions:
static void XLogArchiveNotify(const char *xlog);
static void XLogArchiveNotifySeg(XLogSegNo segno);
static bool XLogArchiveCheckDone(const char *xlog);
static bool XLogArchiveIsBusy(const char *xlog);
static void XLogArchiveCleanup(const char *xlog);
Hmm, sounds reasonable.
In any case, it will be better if you can split it into multiple patches:
1. Having new functionality of "Switching timeline over streaming
replication"
2. Refactoring related changes.It can make my testing and review for new feature patch little easier.
Yep, I'll go ahead and split the patch. Thanks!
- Heikki
On Tuesday, September 25, 2012 12:39 PM Heikki Linnakangas wrote:
On 24.09.2012 16:33, Amit Kapila wrote:
On Tuesday, September 11, 2012 10:53 PM Heikki Linnakangas wrote:
I've been working on the often-requested feature to handle timeline
changes over streaming replication. At the moment, if you kill the
master and promote a standby server, and you have another standby
server that you'd like to keep following the new master server, you
need a WAL archive in addition to streaming replication to make it
cross the timeline change. Streaming replication will just errorout.
Having a WAL archive is usually a good idea in complex replication
scenarios anyway, but it would be good to not require it.Confirm my understanding of this feature:
This feature is for case when standby-1 who is going to be promoted
to
master has archive mode 'on'.
No. This is for the case where there is no WAL archive.
archive_mode='off' on all servers.Or to be precise, you can also have a WAL archive, but this patch
doesn't affect that in any way. This is strictly about streaming
replication.As in that case only its timeline will change.
The timeline changes whenever you promote a standby. It's not related
to
whether you have a WAL archive or not.
Yes that is correct. I thought timeline change happens only when somebody
does PITR.
Can you please tell me why we change timeline after promotion, because the
original
Timeline concept was for PITR and I am not able to trace from code the
reason
why on promotion it is required?
If above is right, then there can be other similar scenario's where
it can
be used:
Scenario-1 (1 Master, 1 Stand-by)
1. Master (archive_mode=on) goes down.
2. Master again comes up
3. Stand-by tries to follow itNow in above scenario also due to timeline mismatch it gives error,
but your
patch should fix it.
If the master simply crashes or is shut down, and then restarted, the
timeline doesn't change. The standby will reconnect / poll the archive,
and sync up just fine, even without this patch.
How about when Master does PITR when it comes again?
With Regards,
Amit Kapila.
On 25.09.2012 14:10, Amit Kapila wrote:
On Tuesday, September 25, 2012 12:39 PM Heikki Linnakangas wrote:
On 24.09.2012 16:33, Amit Kapila wrote:
On Tuesday, September 11, 2012 10:53 PM Heikki Linnakangas wrote:
I've been working on the often-requested feature to handle timeline
changes over streaming replication. At the moment, if you kill the
master and promote a standby server, and you have another standby
server that you'd like to keep following the new master server, you
need a WAL archive in addition to streaming replication to make it
cross the timeline change. Streaming replication will just errorout.
Having a WAL archive is usually a good idea in complex replication
scenarios anyway, but it would be good to not require it.Confirm my understanding of this feature:
This feature is for case when standby-1 who is going to be promoted
to
master has archive mode 'on'.
No. This is for the case where there is no WAL archive.
archive_mode='off' on all servers.Or to be precise, you can also have a WAL archive, but this patch
doesn't affect that in any way. This is strictly about streaming
replication.As in that case only its timeline will change.
The timeline changes whenever you promote a standby. It's not related
to
whether you have a WAL archive or not.Yes that is correct. I thought timeline change happens only when somebody
does PITR.
Can you please tell me why we change timeline after promotion, because the
original
Timeline concept was for PITR and I am not able to trace from code the
reason
why on promotion it is required?
Bumping the timeline helps to avoid confusion if, for example, the
master crashes, and the standby isn't fully in sync with it. In that
situation, there are some WAL records in the master that are not in the
standby, so promoting the standby is effectively the same as doing PITR.
If you promote the standby, and later try to turn the old master into a
standby server that connects to the new master, things will go wrong.
Assigning the new master a new timeline ID helps the system and the
administrator to notice that.
It's not bulletproof, for example you can easily avoid the timeline
change if you just remove recovery.conf and restart the server, but the
timelines help to manage such situations.
If above is right, then there can be other similar scenario's where
it can
be used:
Scenario-1 (1 Master, 1 Stand-by)
1. Master (archive_mode=on) goes down.
2. Master again comes up
3. Stand-by tries to follow itNow in above scenario also due to timeline mismatch it gives error,
but your
patch should fix it.
If the master simply crashes or is shut down, and then restarted, the
timeline doesn't change. The standby will reconnect / poll the archive,
and sync up just fine, even without this patch.How about when Master does PITR when it comes again?
Then the timeline will be bumped and this patch will be helpful.
Assuming the standby is behind the point in time that the master was
recovered to, it will be able to follow the master to the new timeline.
- Heikki
On 25.09.2012 10:08, Heikki Linnakangas wrote:
On 24.09.2012 16:33, Amit Kapila wrote:
In any case, it will be better if you can split it into multiple patches:
1. Having new functionality of "Switching timeline over streaming
replication"
2. Refactoring related changes.It can make my testing and review for new feature patch little easier.
Yep, I'll go ahead and split the patch. Thanks!
Ok, here you go. xlog-c-split-1.patch contains the refactoring of
existing code, with no user-visible changes.
streaming-tli-switch-2.patch applies over xlog-c-split-1.patch, and
contains the new functionality.
- Heikki
On Tuesday, September 25, 2012 6:29 PM Heikki Linnakangas wrote:
On 25.09.2012 10:08, Heikki Linnakangas wrote:On 24.09.2012 16:33, Amit Kapila wrote:
In any case, it will be better if you can split it into multiple
patches:
1. Having new functionality of "Switching timeline over streaming
replication"
2. Refactoring related changes.It can make my testing and review for new feature patch little
easier.
Yep, I'll go ahead and split the patch. Thanks!
Ok, here you go. xlog-c-split-1.patch contains the refactoring of
existing code, with no user-visible changes.
streaming-tli-switch-2.patch applies over xlog-c-split-1.patch, and
contains the new functionality.
Thanks, it will make my review easier than previous.
With Regards,
Amit Kapila.
Amit:
At some point, every master - slave replicator gets to the point where
they need
to start thinking about master-master replication.
Instead of getting stuck in the weeds to finally realize that
master-master is the ONLY way
to go, many developers do not start out planning for master - master,
but they should, out of habit.
You can save yourself a lot of grief just be starting with master-master
architecture.
But you don't have to USE it, you can just not send WRITE traffic to the
servers that you do
not want to WRITE to, but all of them should be WRITE servers. That way,
the only timeline
you ever need is your decision to send WRITE traffic request to them,
but there is nothing
that prevents you from running MASTER - MASTER all the time and skip the
whole slave thing
entirely.
At this point, I think synchronous replication is only for immediate
local replication needs
and async for all the master - master stuff.
cheers,
marco
Show quoted text
On 9/24/2012 9:44 PM, Amit Kapila wrote:
On Monday, September 24, 2012 9:08 PM md@rpzdesign.com wrote:
What a disaster waiting to happen. Maybe the only replication should be
master-master replication
so there is no need to sequence timelines or anything, all servers are
ready masters, no backups or failovers.
If you really do not want a master serving, then it should only be
handled in the routing
of traffic to that server and not the replication logic itself. The
only thing that ever came about
from failovers was the failure to turn over. The above is opinion
only.This feature is for users who want to use master-standby configurations.
What do you mean by :
"then it should only be handled in the routing of traffic to that server
and not the replication logic itself."Do you have any idea other than proposed implementation or do you see any
problem in currently proposed solution?On 9/24/2012 7:33 AM, Amit Kapila wrote:
On Tuesday, September 11, 2012 10:53 PM Heikki Linnakangas wrote:
I've been working on the often-requested feature to handle timeline
changes over streaming replication. At the moment, if you kill the
master and promote a standby server, and you have another standby
server that you'd like to keep following the new master server, you
need a WAL archive in addition to streaming replication to make it
cross the timeline change. Streaming replication will just errorout.
Having a WAL archive is usually a good idea in complex replication
scenarios anyway, but it would be good to not require it.Confirm my understanding of this feature:
This feature is for case when standby-1 who is going to be promoted
to
master has archive mode 'on'.
As in that case only its timeline will change.If above is right, then there can be other similar scenario's where
it can
be used:
Scenario-1 (1 Master, 1 Stand-by)
1. Master (archive_mode=on) goes down.
2. Master again comes up
3. Stand-by tries to follow itNow in above scenario also due to timeline mismatch it gives error,
but your
patch should fix it.
Some parts of this patch are just refactoring that probably make
sense
regardless of the new functionality. For example, I split off the
timeline history file related functions to a new file, timeline.c.
That's not very much code, but it's fairly isolated, and xlog.c is
massive, so I feel that anything that we can move off from xlog.c isa
good thing. I also moved off the two functions RestoreArchivedFile()
and ExecuteRecoveryCommand(), to a separate file. Those are also not
much code, but are fairly isolated. If no-one objects to thosechanges,
and the general direction this work is going to, I'm going split off
those refactorings to separate patches and commit them separately.I also made the timeline history file a bit more detailed: instead
of
recording just the WAL segment where the timeline was changed, it
now
records the exact XLogRecPtr. That was required for the walsender to
know the switchpoint, without having to parse the XLOG records (it
reads and parses the history file, instead)IMO separating timeline history file related functions to a new file
is
good.
However I am not sure about splitting for RestoreArchivedFile() and
ExecuteRecoveryCommand() into separate file.
How about splitting for all Archive related functions:
static void XLogArchiveNotify(const char *xlog);
static void XLogArchiveNotifySeg(XLogSegNo segno);
static bool XLogArchiveCheckDone(const char *xlog);
static bool XLogArchiveIsBusy(const char *xlog);
static void XLogArchiveCleanup(const char *xlog);
..
..In any case, it will be better if you can split it into multiple
patches:
1. Having new functionality of "Switching timeline over streaming
replication"
2. Refactoring related changes.It can make my testing and review for new feature patch little
easier.
With Regards,
Amit Kapila.
On Tue, Sep 25, 2012 at 11:01 AM, md@rpzdesign.com <md@rpzdesign.com> wrote:
Amit:
At some point, every master - slave replicator gets to the point where they
need
to start thinking about master-master replication.
Even in a master-master system, the ability to cleanly swap leaders
managing a member of the master-master cluster is very useful. This
patch can make writing HA software for Postgres a lot less ridiculous.
Instead of getting stuck in the weeds to finally realize that master-master
is the ONLY way
to go, many developers do not start out planning for master - master, but
they should, out of habit.You can save yourself a lot of grief just be starting with master-master
architecture.
I've seen more projects get stuck spinning their wheels on the one
Master-Master system to rule them all then succeed and move on. It
doesn't help that master-master does not have a single definition, and
different properties are possible with different logical models, too,
so that pervades its way up to the language layer.
As-is, managing single-master HA Postgres is a huge pain without this
patch. If there is work to be done on master-master, the logical
replication and event trigger work are probably more relevant, and I
know the authors of those projects are keen to make it more feasible
to experiment.
--
fdr
On 09/25/12 11:01 AM, md@rpzdesign.com wrote:
At some point, every master - slave replicator gets to the point where
they need
to start thinking about master-master replication.
master-master and transactional integrity are mutually exclusive, except
perhaps in special cases like Oracle RAC, where the masters share a
coherent cache and implement global locks.
--
john r pierce N 37, W 122
santa cruz ca mid-left coast
John:
Who has the money for oracle RAC or funding arrogant bastard Oracle CEO
Ellison to purchase another island?
Postgres needs CHEAP, easy to setup, self healing,
master-master-master-master and it needs it yesterday.
I was able to patch the 9.2.0 code base in 1 day and change my entire
architecture strategy for replication
into self healing async master-master-master and the tiniest bit of
sharding code imaginable
That is why I suggest something to replace OIDs with ROIDs for
replication ID. (CREATE TABLE with ROIDS)
I implement ROIDs as a uniform design pattern for the table structures.
Synchronous replication maybe between 2 local machines if absolutely no
local
hardware failure is acceptable, but cheap, scaleable synchronous,
TRANSACTIONAL, master-master-master-master is a real tough slog.
I could implement global locks in the external replication layer if I
choose, but there are much easier ways in routing
requests thru the load balancer and request sharding than trying to
manage global locks across the WAN.
Good luck with your HA patch for Postgres.
Thanks for all of the responses!
You guys are 15 times more active than the MySQL developer group, likely
because
they do not have a single db engine that meets all the requirements like PG.
marco
Show quoted text
On 9/25/2012 5:10 PM, John R Pierce wrote:
On 09/25/12 11:01 AM, md@rpzdesign.com wrote:
At some point, every master - slave replicator gets to the point
where they need
to start thinking about master-master replication.master-master and transactional integrity are mutually exclusive,
except perhaps in special cases like Oracle RAC, where the masters
share a coherent cache and implement global locks.
I was able to patch the 9.2.0 code base in 1 day and change my entire
architecture strategy for replication
into self healing async master-master-master and the tiniest bit of
sharding code imaginable
Sounds cool. Do you have a fork available on Github? I'll try it out.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
Yes that is correct. I thought timeline change happens only when somebody
does PITR.
Can you please tell me why we change timeline after promotion, because the
original
Timeline concept was for PITR and I am not able to trace from code the
reason
why on promotion it is required?
The idea behind the timeline switch is to prevent a server from
subscribing to a master which is actually behind it. For example,
consider this sequence:
1. M1->async->S1
2. M1 is at xid 2001 and fails.
3. S1 did not receive transaction 2001 and is at xid 2000
4. S1 is promoted.
5. S1 processed an new, different transaction 2001
6. M1 is repaired and brought back up
7. M1 is subscribed to S1
8. M1 is now corrupt.
That's why we need the timeline switch.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
Josh:
The good part is you are the first person to ask for a copy
and I will send you the hook code that I have and you can be a good sport
and put it on GitHub, that is great, you can give us both credit for a
joint effort, I do the code,
you put it GitHub.
The not so good part is that the community has a bunch of other trigger work
and other stuff going on, so there was not much interest in non-WAL
replication hook code.
I do not have time to debate implementation nor wait for release of 9.3
with my needs not met, so I will just keep patching the hook code into
whatever release
code base comes along.
The bad news is that I have not implemented the logic of the external
replication daemon.
The other good and bad news is that you are free to receive the messages
from the hook code
thru the unix socket and implement replication any way you want and the
bad news is that you are free
to IMPLEMENT replication any way you want.
I am going to implement master-master-master-master SELF HEALING
replication, but that is just my preference.
Should take about a week to get it operational and another week to see
how it works in my geographically dispersed
servers in the cloud.
Send me a note if it is ok to send you a zip file with the source code
files that I touched in the 9.2 code base so you
can shove it up on GitHub.
Cheers,
marco
Show quoted text
On 9/26/2012 6:48 PM, Josh Berkus wrote:
I was able to patch the 9.2.0 code base in 1 day and change my entire
architecture strategy for replication
into self healing async master-master-master and the tiniest bit of
sharding code imaginableSounds cool. Do you have a fork available on Github? I'll try it out.
On Thursday, September 27, 2012 6:30 AM Josh Berkus wrote:
Yes that is correct. I thought timeline change happens only when
somebody
does PITR.
Can you please tell me why we change timeline after promotion,because the
original
Timeline concept was for PITR and I am not able to trace from codethe
reason
why on promotion it is required?The idea behind the timeline switch is to prevent a server from
subscribing to a master which is actually behind it. For example,
consider this sequence:1. M1->async->S1
2. M1 is at xid 2001 and fails.
3. S1 did not receive transaction 2001 and is at xid 2000
4. S1 is promoted.
5. S1 processed an new, different transaction 2001
6. M1 is repaired and brought back up
7. M1 is subscribed to S1
8. M1 is now corrupt.That's why we need the timeline switch.
Thanks.
I understood this point, but currently in documentation of Timelines, this usecase is not documented (Section 24.3.5).
With Regards,
Amit Kapila.
On 09/26/2012 01:02 AM, md@rpzdesign.com wrote:
John:
Who has the money for oracle RAC or funding arrogant bastard Oracle
CEO Ellison to purchase another island?Postgres needs CHEAP, easy to setup, self healing,
master-master-master-master and it needs it yesterday.I was able to patch the 9.2.0 code base in 1 day and change my entire
architecture strategy for replication
into self healing async master-master-master and the tiniest bit of
sharding code imaginable
Tell us about the compromises you had to make.
It is an established fact that you can either have it replicate fast and
loose or slow and correct.
In the fast and loose case you have to be ready to do a lot of
mopping-up in case of conflicts.
That is why I suggest something to replace OIDs with ROIDs for
replication ID. (CREATE TABLE with ROIDS)
I implement ROIDs as a uniform design pattern for the table structures.Synchronous replication maybe between 2 local machines if absolutely
no local
hardware failure is acceptable, but cheap, scaleable synchronous,
Scaleable / synchronous is probably doable, if we are ready to take the
initial performance hit of lock propagation.
Show quoted text
TRANSACTIONAL, master-master-master-master is a real tough slog.
I could implement global locks in the external replication layer if I
choose, but there are much easier ways in routing
requests thru the load balancer and request sharding than trying to
manage global locks across the WAN.Good luck with your HA patch for Postgres.
Thanks for all of the responses!
You guys are 15 times more active than the MySQL developer group,
likely because
they do not have a single db engine that meets all the requirements
like PG.marco
On 9/25/2012 5:10 PM, John R Pierce wrote:
On 09/25/12 11:01 AM, md@rpzdesign.com wrote:
At some point, every master - slave replicator gets to the point
where they need
to start thinking about master-master replication.master-master and transactional integrity are mutually exclusive,
except perhaps in special cases like Oracle RAC, where the masters
share a coherent cache and implement global locks.
On 9/26/12 6:17 PM, md@rpzdesign.com wrote:
Josh:
The good part is you are the first person to ask for a copy
and I will send you the hook code that I have and you can be a good sport
and put it on GitHub, that is great, you can give us both credit for a
joint effort, I do the code,
you put it GitHub.
Well, I think it just makes sense for you to put it up somewhere public
so that folks can review it; if not Github, then somewhere else. If
it's useful and well-written, folks will be interested.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
On 27-09-2012 01:30, Amit Kapila wrote:
I understood this point, but currently in documentation of Timelines, this usecase is not documented (Section 24.3.5).
Timeline documentation was written during PITR implementation. There wasn't SR
yet. AFAICS it doesn't cite SR but is sufficiently generic (it use 'wal
records' term to explain the feature). Feel free to reword those paragraphs
mentioning SR.
--
Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
On Tuesday, September 25, 2012 6:29 PM Heikki Linnakangas wrote:
On 25.09.2012 10:08, Heikki Linnakangas wrote:On 24.09.2012 16:33, Amit Kapila wrote:
In any case, it will be better if you can split it into multiple
patches:
1. Having new functionality of "Switching timeline over streaming
replication"
2. Refactoring related changes.
Ok, here you go. xlog-c-split-1.patch contains the refactoring of existing
code, with no user-visible changes.
streaming-tli-switch-2.patch applies over xlog-c-split-1.patch, and
contains the new functionality.
Please find the initial review of the patch. Still more review is pending,
but I thought whatever is done I shall post
Basic stuff:
----------------------
- Patch applies OK
- Compiles cleanly with no warnings
- Regression tests pass.
- Documentation changes are mostly fine.
- Basic replication tests works.
Testing
---------
Start primary server
Start standby server
Start cascade standby server
Stopped the primary server
Promoted the standby server with ./pg_ctl -D data_repl promote
In postgresql.conf file
archive_mode = off
The following logs are observing in the cascade standby server.
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: walreceiver ended streaming and awaits new instructions
LOG: record with zero length at 0/17E3888
LOG: re-handshaking at position 0/1000000 on tli 1
LOG: fetching timeline history file for timeline 2 from primary server
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: walreceiver ended streaming and awaits new instructions
LOG: re-handshaking at position 0/1000000 on tli 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
In postgresql.conf file
archive_mode = on
The following logs are observing in the cascade standby server.
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: walreceiver ended streaming and awaits new instructions
sh:
/home/amit/installation/bin/data_sub/pg_xlog/archive_status/0000000100000000
00000002: No such file or directory
LOG: record with zero length at 0/20144B8
sh:
/home/amit/installation/bin/data_sub/pg_xlog/archive_status/0000000100000000
00000002: No such file or directory
LOG: re-handshaking at position 0/2000000 on tli 1
LOG: fetching timeline history file for timeline 2 from primary server
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: walreceiver ended streaming and awaits new instructions
sh:
/home/amit/installation/bin/data_sub/pg_xlog/archive_status/0000000100000000
00000002: No such file or directory
sh:
/home/amit/installation/bin/data_sub/pg_xlog/archive_status/0000000100000000
00000002: No such file or directory
LOG: re-handshaking at position 0/2000000 on tli 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: walreceiver ended streaming and awaits new instructions
Verified that files are present in respective directories.
Code Review
----------------
1. In function readTimeLineHistory(),
two mechanisms are used to fetch timeline from history file
+ sscanf(fline, "%u\t%X/%X", &tli, &switchpoint_hi,
&switchpoint_lo);
+
+ /* expect a numeric timeline ID as first field of line */
+ tli = (TimeLineID) strtoul(ptr, &endptr, 0);
If we use new mechanism, it will not be able to detect error as it is
doing in current case.
2. In function readTimeLineHistory(),
+ fd = AllocateFile(path, "r");
+ if (fd == NULL)
+ {
+ if (errno != ENOENT)
+ ereport(FATAL,
+ (errcode_for_file_access(),
+ errmsg("could not open file
\"%s\": %m", path)));
+ /* Not there, so assume no parents */
+ return list_make1_int((int) targetTLI);
+ }
still return list_make1_int((int) targetTLI); is used.
3. Function timelineOfPointInHistory(), should return the timeline of recptr
passed to it.
a. is it okay to decide based on xlog recordpointer that which timeline
it belongs to, as different
timelines can have same xlog recordpointer?
b. it seems from logic that it will return timeline previous to the
timeline of recptr passed.
For example if the timeline 3's switchpoint is equal to recptr passed
then it will return timeline 2.
4. In writeTimeLineHistory function variable endTLI is never used.
5. In header of function writeTimeLineHistory(), can give explanation about
XLogRecPtr switchpoint
6. @@ -6869,11 +5947,35 @@ StartupXLOG(void)
*/
if (InArchiveRecovery)
{
+ char reason[200];
+
+ /*
+ * Write comment to history file to explain why and where
timeline
+ * changed. Comment varies according to the recovery target
used.
+ */
+ if (recoveryTarget == RECOVERY_TARGET_XID)
+ snprintf(reason, sizeof(reason),
+ "%s transaction %u",
+ recoveryStopAfter ? "after" :
"before",
+ recoveryStopXid);
In the comment above this line you mentioned why and where timeline changed.
However in the reason field only specifies about where part.
7. + * Returns the redo pointer of the "previous" checkpoint.
+GetOldestRestartPoint(XLogRecPtr *oldrecptr, TimeLineID *oldtli)
+{
+ if (InRedo)
+ {
+ LWLockAcquire(ControlFileLock, LW_SHARED);
+ *oldrecptr = ControlFile->checkPointCopy.redo;
+ *oldtli = ControlFile->checkPointCopy.ThisTimeLineID;
+ LWLockRelease(ControlFileLock);
+ }
a. In this function, is it required to take ControlFileLock as earlier also
there was no lock to protect this read
when it get called from RestoreArchivedFile, and I think at this point no
one else can modify these values.
However for code consistency purpose like whenever or wherever read the
controlfile values, read it with read lock.
b. As per your comment it should have returned "previous" checkpoint,
however the function returns values of
latest checkpoint.
8. In function writeTimeLineHistoryFile(), will it not be better to directly
write rather than to later do pg_fsync().
as it is just one time write.
9. +XLogRecPtr
+timeLineSwitchPoint(XLogRecPtr startpoint, TimeLineID tli)
..
..
+ * starting point. This is because the client can
legimately
spelling of legitimately needs to be corrected.
10.+XLogRecPtr
+timeLineSwitchPoint(XLogRecPtr startpoint, TimeLineID tli)
..
..
+ if (tli < ThisTimeLineID)
+ {
+ if (!nexttle)
+ elog(ERROR, "could not find history entry for child
of timeline %u", tli); /* shouldn't happen */
+ }
I don't understand the meaning of the above check, as I think this situation
can occur
when this function gets called from StartReplication, because always tli
sent by standby to new
master will be less than ThisTimeLineID and it can be first in list.
Documentation
---------------
1. In explanation of TIMELINE_HISTORY:
Filename of the timeline history file. This is always of the form
[insert correct example here].
Give example.
2. In protocol.sgml change, I feel better explain when the COPYDONE message
will be initiated.
With Regards,
Amit Kapila.