Synchronous Log Shipping Replication

Started by Fujii Masaoover 17 years ago106 messageshackers
Jump to latest
#1Fujii Masao
masao.fujii@gmail.com

Hi,

In PGCon 2008, I proposed synchronous log shipping replication.
Sorry for late posting, but I'd like to start the discussion
about its implementation from now.
http://www.pgcon.org/2008/schedule/track/Horizontal%20Scaling/76.en.html

First of all, I'm not planning to put the prototype which I demoed
in PGCon into core directly.

- Portability issues (using message queue, multi-threaded ...)
- Have too much dependency on Heartbeat

Yes, since the prototype is useful reference of implementation,
I plan to open it ASAP. But, I'm sorry - it still takes a month
to open it.

Pavan re-designed the sync replication based on the prototype
and I posted that design doc on wiki. Please check it if you
are interested in it.
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects

This design is too huge. In order to enhance the extensibility
of postgres, I'd like to divide the sync replication into
minimum hooks and some plugins and to develop it, respectively.
Plugins for the sync replication plan to be available at the
time of 8.4 release.

In my design, WAL sending is achieved as follow by WALSender.
WALSender is a new process which I introduce.

1) On COMMIT, backend requests WALSender to send WAL.
2) WALSender reads WAL from walbuffers and send it to slave.
3) WALSender waits for the response from slave and replies
backend.

I propose two hooks for WAL sending.

WAL-writing hook
----------------
This hook is for backend to communicate with WALSender.
WAL-writing hook intercepts write system call in XLogWrite.
That is, backend requests WAL sending whenever write is called.

WAL-writing hook is available also for other uses e.g.
Software RAID (writes WAL into two files for durability).

Hook for WALSender
------------------
This hook is for introducing WALSender. There are the following
three ideas of how to introduce WALSender. A required hook
differs by which idea is adopted.

a) Use WALWriter as WALSender

This idea needs WALWriter hook which intercepts WALWriter
literally. WALWriter stops the local WAL write and focuses on
WAL sending. This idea is very simple, but I don't think of
the use of WALWriter hook other than WAL sending.

b) Use new background process as WALSender

This idea needs background-process hook which enables users
to define new background processes. I think the design of this
hook resembles that of rmgr hook proposed by Simon. I define
the table like RmgrTable. It's for registering some functions
(e.g. main function and exit...) for operating a background
process. Postmaster calls the function from the table suitably,
and manages a start and end of background process. ISTM that
there are many uses in this hook, e.g. performance monitoring
process like statspack.

c) Use one backend as WALSender

In this idea, slave calls the user-defined function which
takes charge of WAL sending via SQL e.g. "SELECT pg_walsender()".
Compared with other ideas, it's easy to implement WALSender
because postmater handles the establishment and authentication
of connection. But, this SQL causes a long transaction which
prevents vacuum. So, this idea needs idle-state hook which
executes plugin before transaction starts. I don't think of
the use of this hook other than WAL sending either.

Which idea should we adopt?

Comments welcome.

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#2Markus Wanner
markus@bluegap.ch
In reply to: Fujii Masao (#1)
Re: Synchronous Log Shipping Replication

Hi,

Fujii Masao wrote:

Pavan re-designed the sync replication based on the prototype
and I posted that design doc on wiki. Please check it if you
are interested in it.
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects

I've read that wiki page and allow myself to comment from a Postgres-R
developer's perspective ;-)

R1: "without ... any negative performance overhead"? For fully
synchronous replication, that's clearly not possible. I guess that
applies only for async WAL shipping.

NR3: who is supposed to do failure detection and manage automatic
failover? How does integration with such an additional tool work?

I got distracted by the SBY and ACT abbreviations. Why abbreviate
standby or active at all? It's not like we don't already have enough
three letter acronyms, but those stand for rather more complex terms
than single words.

Standby Bootstrap: "stopping the archiving at the ACT" doesn't prevent
overriding WAL files in pg_xlog. It just stops archiving a WAL file
before it gets overridden - which clearly doesn't solve the problem here.

How is communication done? "Serialization of WAL shipping" should better
not mean serialization on the network, i.e. the WAL Sender Process
should be able to await acknowledgment of multiple WAL packets in
parallel, otherwise the interconnect latency might turn into a
bottleneck. How is communication done? What happens if the link between
the active and standby goes down? Or if it's temporarily unavailable for
some time?

The IPC mechanism reminds me a lot of what I did for Postgres-R, which
also has a central "replication manager" process, which receives
changesets from multiple backends. I've implemented an internal
messaging mechanism based on shared memory and signals, using only
Postgres methods. It allows arbitrary processes to send messages to each
other by process id.

Moving the WAL Sender and WAL Receiver processes under the control of
the postmaster certainly sounds like a good thing. After all, those are
fiddling wiht Postgres internals.

This design is too huge. In order to enhance the extensibility
of postgres, I'd like to divide the sync replication into
minimum hooks and some plugins and to develop it, respectively.
Plugins for the sync replication plan to be available at the
time of 8.4 release.

Hooks again? I bet you all know by now, that my excitement for hooks has
always been pretty narrow. ;-)

In my design, WAL sending is achieved as follow by WALSender.
WALSender is a new process which I introduce.

1) On COMMIT, backend requests WALSender to send WAL.
2) WALSender reads WAL from walbuffers and send it to slave.
3) WALSender waits for the response from slave and replies
backend.

I propose two hooks for WAL sending.

WAL-writing hook
----------------
This hook is for backend to communicate with WALSender.
WAL-writing hook intercepts write system call in XLogWrite.
That is, backend requests WAL sending whenever write is called.

WAL-writing hook is available also for other uses e.g.
Software RAID (writes WAL into two files for durability).

Hook for WALSender
------------------
This hook is for introducing WALSender. There are the following
three ideas of how to introduce WALSender. A required hook
differs by which idea is adopted.

a) Use WALWriter as WALSender

This idea needs WALWriter hook which intercepts WALWriter
literally. WALWriter stops the local WAL write and focuses on
WAL sending. This idea is very simple, but I don't think of
the use of WALWriter hook other than WAL sending.

b) Use new background process as WALSender

This idea needs background-process hook which enables users
to define new background processes. I think the design of this
hook resembles that of rmgr hook proposed by Simon. I define
the table like RmgrTable. It's for registering some functions
(e.g. main function and exit...) for operating a background
process. Postmaster calls the function from the table suitably,
and manages a start and end of background process. ISTM that
there are many uses in this hook, e.g. performance monitoring
process like statspack.

c) Use one backend as WALSender

In this idea, slave calls the user-defined function which
takes charge of WAL sending via SQL e.g. "SELECT pg_walsender()".
Compared with other ideas, it's easy to implement WALSender
because postmater handles the establishment and authentication
of connection. But, this SQL causes a long transaction which
prevents vacuum. So, this idea needs idle-state hook which
executes plugin before transaction starts. I don't think of
the use of this hook other than WAL sending either.

The above cited wiki page sounds like you've already decided for b).

I'm unclear on what you want hooks for. If additional processes get
integrated into Postgres, those certainly need to get integrated very
much like we integrated other auxiliary processes. I wouldn't call that
'hooking', but YMMV.

Regards

Markus Wanner

#3Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#1)
Re: Synchronous Log Shipping Replication

On Fri, 2008-09-05 at 23:21 +0900, Fujii Masao wrote:

Pavan re-designed the sync replication based on the prototype
and I posted that design doc on wiki. Please check it if you
are interested in it.
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects

It's good to see the detailed design, many thanks.

I will begin looking at technical details next week.

This design is too huge. In order to enhance the extensibility
of postgres, I'd like to divide the sync replication into
minimum hooks and some plugins and to develop it, respectively.
Plugins for the sync replication plan to be available at the
time of 8.4 release.

What is Core's commentary on this plan?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#4Bruce Momjian
bruce@momjian.us
In reply to: Markus Wanner (#2)
Re: Synchronous Log Shipping Replication

Markus Wanner wrote:

Hook for WALSender
------------------
This hook is for introducing WALSender. There are the following
three ideas of how to introduce WALSender. A required hook
differs by which idea is adopted.

a) Use WALWriter as WALSender

This idea needs WALWriter hook which intercepts WALWriter
literally. WALWriter stops the local WAL write and focuses on
WAL sending. This idea is very simple, but I don't think of
the use of WALWriter hook other than WAL sending.

The problem with this approach is that you are not sending WAL to the
disk _while_ you are, in parallel, sending WAL to the slave; I think
this is useful for performance reasons in synrchonous replication.

b) Use new background process as WALSender

This idea needs background-process hook which enables users
to define new background processes. I think the design of this
hook resembles that of rmgr hook proposed by Simon. I define
the table like RmgrTable. It's for registering some functions
(e.g. main function and exit...) for operating a background
process. Postmaster calls the function from the table suitably,
and manages a start and end of background process. ISTM that
there are many uses in this hook, e.g. performance monitoring
process like statspack.

I think starting/stopping a process for each WAL send is too much
overhead.

c) Use one backend as WALSender

In this idea, slave calls the user-defined function which
takes charge of WAL sending via SQL e.g. "SELECT pg_walsender()".
Compared with other ideas, it's easy to implement WALSender
because postmater handles the establishment and authentication
of connection. But, this SQL causes a long transaction which
prevents vacuum. So, this idea needs idle-state hook which
executes plugin before transaction starts. I don't think of
the use of this hook other than WAL sending either.

The above cited wiki page sounds like you've already decided for b).

I assumed that there would be a background process like bgwriter that
would be notified during a commit and send the appropriate WAL files to
the slave.

I'm unclear on what you want hooks for. If additional processes get
integrated into Postgres, those certainly need to get integrated very
much like we integrated other auxiliary processes. I wouldn't call that
'hooking', but YMMV.

Yea, I am unclear how this is going to work using simple hooks.

It sounds like Fujii-san is basically saying they can only get the hooks
done for 8.4, not the actual solution. But, as I said above, I am
unclear how a hook solution would even work long-term; I am afraid it
would be thrown away once an integrated solution was developed.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#4)
Re: Synchronous Log Shipping Replication

On Sat, 2008-09-06 at 22:09 -0400, Bruce Momjian wrote:

Markus Wanner wrote:

Hook for WALSender
------------------
This hook is for introducing WALSender. There are the following
three ideas of how to introduce WALSender. A required hook
differs by which idea is adopted.

a) Use WALWriter as WALSender

This idea needs WALWriter hook which intercepts WALWriter
literally. WALWriter stops the local WAL write and focuses on
WAL sending. This idea is very simple, but I don't think of
the use of WALWriter hook other than WAL sending.

The problem with this approach is that you are not sending WAL to the
disk _while_ you are, in parallel, sending WAL to the slave; I think
this is useful for performance reasons in synrchonous replication.

Agreed

b) Use new background process as WALSender

This idea needs background-process hook which enables users
to define new background processes. I think the design of this
hook resembles that of rmgr hook proposed by Simon. I define
the table like RmgrTable. It's for registering some functions
(e.g. main function and exit...) for operating a background
process. Postmaster calls the function from the table suitably,
and manages a start and end of background process. ISTM that
there are many uses in this hook, e.g. performance monitoring
process like statspack.

I think starting/stopping a process for each WAL send is too much
overhead.

I would agree with that, but I don't think that was being suggested was
it? See later.

c) Use one backend as WALSender

In this idea, slave calls the user-defined function which
takes charge of WAL sending via SQL e.g. "SELECT pg_walsender()".
Compared with other ideas, it's easy to implement WALSender
because postmater handles the establishment and authentication
of connection. But, this SQL causes a long transaction which
prevents vacuum. So, this idea needs idle-state hook which
executes plugin before transaction starts. I don't think of
the use of this hook other than WAL sending either.

The above cited wiki page sounds like you've already decided for b).

I assumed that there would be a background process like bgwriter that
would be notified during a commit and send the appropriate WAL files to
the slave.

ISTM that this last paragraph is actually what was meant by option b).

I think it would work the other way around though, the WALSender would
send continuously and backends may choose to wait for it to reach a
certain LSN, or not. WALWriter really should work this way too.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#1)
Re: Synchronous Log Shipping Replication

On Fri, 2008-09-05 at 23:21 +0900, Fujii Masao wrote:

b) Use new background process as WALSender

This idea needs background-process hook which enables users
to define new background processes. I think the design of this
hook resembles that of rmgr hook proposed by Simon. I define
the table like RmgrTable. It's for registering some functions
(e.g. main function and exit...) for operating a background
process. Postmaster calls the function from the table suitably,
and manages a start and end of background process. ISTM that
there are many uses in this hook, e.g. performance monitoring
process like statspack.

Sorry, but the comparison with the rmgr hook is mistaken. The rmgr hook
exists only within the Startup process and I go to some lengths to
ensure it is never called in normal backends. So it has got absolutely
nothing to do with generating WAL messages (existing/new/modified) or
sending them since it doesn't even exist during normal processing.

The intention of the rmgr hook is to allow WAL messages to be
manipulated in new ways in recovery mode. It isn't a sufficient change
to implement replication, and the functionality is orthogonal to
streaming WAL replication.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#4)
Re: Synchronous Log Shipping Replication

On Sat, 2008-09-06 at 22:09 -0400, Bruce Momjian wrote:

I'm unclear on what you want hooks for. If additional processes get
integrated into Postgres, those certainly need to get integrated very
much like we integrated other auxiliary processes. I wouldn't call that
'hooking', but YMMV.

Yea, I am unclear how this is going to work using simple hooks.

It sounds like Fujii-san is basically saying they can only get the hooks
done for 8.4, not the actual solution. But, as I said above, I am
unclear how a hook solution would even work long-term; I am afraid it
would be thrown away once an integrated solution was developed.

It will be interesting to have various hooks in streaming WAL code to
implement various additional features for enterprise integration.

But that doesn't mean I support hooks in every/all places.

For me, the proposed hook amounts to "we've only got time to implement
2/3 of the required features, so we'd like to circumvent the release
cycle by putting in a hook and providing the code later". For me, hooks
are for adding additional features, not for making up for the lack of
completed code. It's kinda hard to say "we now have WAL streaming"
without the streaming bit. We need either a fully working WAL streaming
feature, or we wait until next release.

We probably need to ask if there is anybody willing to complete the
middle part of this feature so we can get it into 8.4. It would be
sensible to share the code we have now, so we can see what remains to be
implemented. I just committed to delivering Hot Standby for 8.4, so I
can't now get involved to deliver this code.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#8ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Bruce Momjian (#4)
Re: Synchronous Log Shipping Replication

Bruce Momjian <bruce@momjian.us> wrote:

b) Use new background process as WALSender

This idea needs background-process hook which enables users
to define new background processes

I think starting/stopping a process for each WAL send is too much
overhead.

Yes, of course slow. But I guess it is the only way to share one socket
in all backends. Postgres is not a multi-threaded architecture,
so each backend should use dedicated connections to send WAL buffers.
300 backends require 300 connections for each slave... it's not good at all.

It sounds like Fujii-san is basically saying they can only get the hooks
done for 8.4, not the actual solution.

No! He has an actual solution in his prototype ;-)
It is very similar to b) and the overhead was not so bad.
It's not so clean to be a part of postgres, though.

Are there any better idea to share one socket connection between
backends (and bgwriter)? The connections could be established after
fork() from postmaster, and number of them could be two or more.
This is one of the most complicated part of synchronous log shipping.
Switching-processes apporach like b) is just one idea for it.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#9Markus Wanner
markus@bluegap.ch
In reply to: ITAGAKI Takahiro (#8)
Re: Synchronous Log Shipping Replication

Hi,

ITAGAKI Takahiro wrote:

Are there any better idea to share one socket connection between
backends (and bgwriter)? The connections could be established after
fork() from postmaster, and number of them could be two or more.
This is one of the most complicated part of synchronous log shipping.
Switching-processes apporach like b) is just one idea for it.

I fear I'm repeating myself, but I've had the same problem for
Postgres-R and solved it with an internal message passing infrastructure
which I've simply called imessages. It requires only standard Postgres
shared memory, signals and locking and should thus be pretty portable.

In simple benchmarks, it's not quite as efficient as unix pipes, but
doesn't require as many file descriptors, is independent of the
parent-child relations of processes, maintains message borders and it is
more portable (I hope). It could certainly be improved WRT efficiency
and could theoretically even beat Unix pipes, because it involves less
copying of data and less syscalls.

It has not been reviewed nor commented much. I'd still appreciate that.

Regards

Markus Wanner

#10ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Markus Wanner (#9)
Re: Synchronous Log Shipping Replication

Markus Wanner <markus@bluegap.ch> wrote:

ITAGAKI Takahiro wrote:

Are there any better idea to share one socket connection between
backends (and bgwriter)?

I fear I'm repeating myself, but I've had the same problem for
Postgres-R and solved it with an internal message passing infrastructure
which I've simply called imessages. It requires only standard Postgres
shared memory, signals and locking and should thus be pretty portable.

Imessage serves as a useful reference, but it is one of the detail parts
of the issue. I can break down the issue into three parts:

1. Is process-switching approach the best way to share one socket?
Both Postgres-R and the log-shipping prototype use the approach now.
Can I think there is no objection here?

2. If 1 is reasonable, how should we add a new WAL sender process?
Just add a new process using a core-patch?
Merge into WAL writer?
Consider framework to add any of user-defined auxiliary process?

3. If 1 is reasonable, what should we use for the process-switching
primitive?
Postgres-R uses signals and locking and the log-shipping prototype
uses multi-threads and POSIX message queues now.

Signals and locking is possible choice for 3, but I want to use better
approach if any. Faster is always better.

I guess we could invent a new semaphore-like primitive at the same layer
as LWLocks using spinlock and PGPROC directly...

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#11Markus Wanner
markus@bluegap.ch
In reply to: ITAGAKI Takahiro (#10)
Re: Synchronous Log Shipping Replication

Hi,

ITAGAKI Takahiro wrote:

1. Is process-switching approach the best way to share one socket?
Both Postgres-R and the log-shipping prototype use the approach now.
Can I think there is no objection here?

I don't see any appealing alternative. The postmaster certainly
shouldn't need to worry with any such socket for replication. Threading
falls pretty flat for Postgres. So the socket must be held by one of the
child processes of the Postmaster.

2. If 1 is reasonable, how should we add a new WAL sender process?
Just add a new process using a core-patch?

Seems feasible to me, yes.

Merge into WAL writer?

Uh.. that would mean you'd loose parallelism between WAL writing to disk
and WAL shipping via network. That does not sound appealing to me.

Consider framework to add any of user-defined auxiliary process?

What for? What do you miss in the existing framework?

3. If 1 is reasonable, what should we use for the process-switching
primitive?
Postgres-R uses signals and locking and the log-shipping prototype
uses multi-threads and POSIX message queues now.

AFAIK message queues are problematic WRT portability. At least Postgres
doesn't currently use them and introducing dependencies on those might
lead to problems, but I'm not sure. Others certainly know more about
issues involved.

A multi-threaded approach is certainly out of bounds, at least within
the Postgres core code.

Signals and locking is possible choice for 3, but I want to use better
approach if any. Faster is always better.

I think the approach can reach better throughput than POSIX message
queues or unix pipes, because of the mentioned savings in copying around
between system and application memory. However, that hasn't been proved,
yet.

I guess we could invent a new semaphore-like primitive at the same layer
as LWLocks using spinlock and PGPROC directly...

Sure, but in what way would that differ from what I do with imessages?

Regards

Markus Wanner

#12Fujii Masao
masao.fujii@gmail.com
In reply to: Markus Wanner (#11)
Re: Synchronous Log Shipping Replication

On Mon, Sep 8, 2008 at 8:44 PM, Markus Wanner <markus@bluegap.ch> wrote:

Merge into WAL writer?

Uh.. that would mean you'd loose parallelism between WAL writing to disk and
WAL shipping via network. That does not sound appealing to me.

That depends on the order of WAL writing and WAL shipping.
How about the following order?

1. A backend writes WAL to disk.
2. The backend wakes up WAL sender process and sleeps.
3. WAL sender process does WAL shipping and wakes up the backend.
4. The backend issues sync command.

I guess we could invent a new semaphore-like primitive at the same layer
as LWLocks using spinlock and PGPROC directly...

Sure, but in what way would that differ from what I do with imessages?

Performance ;)

The timing of the process's receiving a signal is dependent on the scheduler
of kernel. The scheduler does not always handle a signal immediately.

Regards

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#13Markus Wanner
markus@bluegap.ch
In reply to: Fujii Masao (#12)
Re: Synchronous Log Shipping Replication

Hi,

Fujii Masao wrote:

1. A backend writes WAL to disk.
2. The backend wakes up WAL sender process and sleeps.
3. WAL sender process does WAL shipping and wakes up the backend.
4. The backend issues sync command.

Right, that would work. But still, the WAL writer process would block
during writing WAL blocks.

Are there compelling reasons for using the existing WAL writer process,
as opposed to introducing a new process?

The timing of the process's receiving a signal is dependent on the scheduler
of kernel.

Sure, so are pipes or shmem queues.

The scheduler does not always handle a signal immediately.

What exactly are you proposing to use instead of signals? Semaphores are
pretty inconvenient when trying to wake up arbitrary processes or in
conjunction with listening on sockets via select(), for example.

See src/backend/replication/manager.c from Postgres-R for a working
implementation of such a process using select() and signaling.

Regards

Markus Wanner

#14Simon Riggs
simon@2ndQuadrant.com
In reply to: ITAGAKI Takahiro (#8)
Re: Synchronous Log Shipping Replication

On Mon, 2008-09-08 at 19:19 +0900, ITAGAKI Takahiro wrote:

Bruce Momjian <bruce@momjian.us> wrote:

b) Use new background process as WALSender

This idea needs background-process hook which enables users
to define new background processes

I think starting/stopping a process for each WAL send is too much
overhead.

Yes, of course slow. But I guess it is the only way to share one socket
in all backends. Postgres is not a multi-threaded architecture,
so each backend should use dedicated connections to send WAL buffers.
300 backends require 300 connections for each slave... it's not good at all.

So... don't have individual backends do the sending. Have them wait
while somebody else does it for them.

It sounds like Fujii-san is basically saying they can only get the hooks
done for 8.4, not the actual solution.

No! He has an actual solution in his prototype ;-)

The usual thing if you have a WIP patch you're not sure of is to post
the patch for feedback.

If you guys aren't going to post any code to the project then I'm not
clear why it's being discussed here. Is this a community project or a
private project?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#15Bruce Momjian
bruce@momjian.us
In reply to: Fujii Masao (#12)
Re: Synchronous Log Shipping Replication

Fujii Masao wrote:

On Mon, Sep 8, 2008 at 8:44 PM, Markus Wanner <markus@bluegap.ch> wrote:

Merge into WAL writer?

Uh.. that would mean you'd loose parallelism between WAL writing to disk and
WAL shipping via network. That does not sound appealing to me.

That depends on the order of WAL writing and WAL shipping.
How about the following order?

1. A backend writes WAL to disk.
2. The backend wakes up WAL sender process and sleeps.
3. WAL sender process does WAL shipping and wakes up the backend.
4. The backend issues sync command.

I am confused why this is considered so complicated. Having individual
backends doing the wal transfer to the slave is never going to work
well.

I figured we would have a single WAL streamer that continues advancing
forward in the WAL file, streaming to the standby. Backends would
update a shared memory variable specifying how far they want the wal
streamer to advance and send a signal to the wal streamer if necessary.
Backends would monitor another shared memory variable that specifies how
far the wal streamer has advanced.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#16Markus Wanner
markus@bluegap.ch
In reply to: Bruce Momjian (#15)
Re: Synchronous Log Shipping Replication

Hi,

Bruce Momjian wrote:

Backends would
update a shared memory variable specifying how far they want the wal
streamer to advance and send a signal to the wal streamer if necessary.
Backends would monitor another shared memory variable that specifies how
far the wal streamer has advanced.

That sounds like WAL needs to be written to disk, before it can be sent
to the standby. Except maybe with some sort of mmap'ing the WAL.

Regards

Markus Wanner

#17Bruce Momjian
bruce@momjian.us
In reply to: Markus Wanner (#16)
Re: Synchronous Log Shipping Replication

Markus Wanner wrote:

Hi,

Bruce Momjian wrote:

Backends would
update a shared memory variable specifying how far they want the wal
streamer to advance and send a signal to the wal streamer if necessary.
Backends would monitor another shared memory variable that specifies how
far the wal streamer has advanced.

That sounds like WAL needs to be written to disk, before it can be sent
to the standby. Except maybe with some sort of mmap'ing the WAL.

Well, WAL is either on disk or in the wal_buffers in shared memory ---
in either case, a WAL streamer can get to it.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#18Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#1)
Re: Synchronous Log Shipping Replication

Hi,

I looked some comment for the synchronous replication and understood
the consensus of the community was that the sync replication should be
added using not hooks and plug-ins but core-patches. If my understanding
is right, I will change my development plan so that the sync replication
may be put into core.

But, I don't think every features should be put into core. Of course, the
high-availability features (like clustering, automatic failover, ...etc) are
out of postgres. The user who wants whole HA solution using the sync
replication must integrate postgres and clustering software like heartbeat.

WAL sending should be put into core. But, I'd like to separate WAL
receiving from core and provide it as a new contrib tool. Because,
there are some users who use the sync replication as only WAL
streaming. They don't want to start postgres on the slave. Of course,
the slave can replay WAL by using pg_standby and WAL receiver tool
which I'd like to provide as a new contrib tool. I think the patch against
recovery code is not necessary.

I arrange the development items below :

1) Patch around XLogWrite.
It enables a backend to wake up the WAL sender process at the
timing of COMMIT.

2) Patch for the communication between a backend and WAL
sender process.
There were some discussions about this topic. Now, I decided to
adopt imessages proposed by Markus.

3) Patch of introducing new background process which I've called
WALSender. It takes charge of sending WAL to the slave.

Now, I assume that WALSender also listens the connection from
the slave, i.e. only one sender process manages multiple slaves.
The relation between WALSender and backend is 1:1. So,
the communication mechanism between them can be simple.
As other idea, I can introduce new listener process and fork new
WALSender for every slave. Which architecture is better? Or,
should postmaster listen also the connection from the slave?

4) New contrib tool which I've called WALReceiver. It takes charge
of receiving WAL from the master and writing it to disk on the slave.

I will submit these patches and tool by Nov Commit Fest at the latest.

Any comment welcome!

best regards

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#19ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Fujii Masao (#18)
Re: Synchronous Log Shipping Replication

"Fujii Masao" <masao.fujii@gmail.com> wrote:

3) Patch of introducing new background process which I've called
WALSender. It takes charge of sending WAL to the slave.

Now, I assume that WALSender also listens the connection from
the slave, i.e. only one sender process manages multiple slaves.

The relation between WALSender and backend is 1:1. So,
the communication mechanism between them can be simple.

I assume that he says only one backend communicates with WAL sender
at a time. The communication is done during WALWriteLock is held,
so other backends wait for the communicating backend on WALWriteLock.
WAL sender only needs to send one signal for each time it sends WAL
buffers to slave.

We could be split the LWLock to WALWriterLock and WALSenderLock,
but the essential point is same.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#20Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#15)
Re: Synchronous Log Shipping Replication

On Mon, 2008-09-08 at 17:40 -0400, Bruce Momjian wrote:

Fujii Masao wrote:

On Mon, Sep 8, 2008 at 8:44 PM, Markus Wanner <markus@bluegap.ch> wrote:

Merge into WAL writer?

Uh.. that would mean you'd loose parallelism between WAL writing to disk and
WAL shipping via network. That does not sound appealing to me.

That depends on the order of WAL writing and WAL shipping.
How about the following order?

1. A backend writes WAL to disk.
2. The backend wakes up WAL sender process and sleeps.
3. WAL sender process does WAL shipping and wakes up the backend.
4. The backend issues sync command.

I am confused why this is considered so complicated. Having individual
backends doing the wal transfer to the slave is never going to work
well.

Agreed.

I figured we would have a single WAL streamer that continues advancing
forward in the WAL file, streaming to the standby. Backends would
update a shared memory variable specifying how far they want the wal
streamer to advance and send a signal to the wal streamer if necessary.
Backends would monitor another shared memory variable that specifies how
far the wal streamer has advanced.

Yes. We should have a LogwrtRqst pointer and LogwrtResult pointer for
the send operation. The Write and Send operations can then continue
independently of one another. XLogInsert() cannot advance to a new page
while we are waiting to send or write. Notice that the Send process
might be the bottleneck - that is the price of synchronous replication.

Backends then wait
* not at all for asynch commit
* just for Write for local synch commit
* for both Write and Send for remote synch commit
(various additional options for what happens to confirm Send)

So normal backends neither write nor send. We have two dedicated
processes, one for write, one for send. We need to put an extra test
into WALWriter loop so that it will continue immediately (with no wait)
if there is an outstanding request for synchronous operation.

This gives us the Group Commit feature also, even if we are not using
replication. So we can drop the commit_delay stuff.

XLogBackgroundFlush() processes data page at a time if it can. That may
not be the correct batch size for XLogBackgroundSend(), so we may need a
tunable for the MTU. Under heavy load we need the Write and Send to act
in a way to maximise throughput rather than minimise response time, as
we do now.

If wal_buffers overflows, we continue to hold WALInsertLock while we
wait for WALWriter and WALSender to complete.

We should increase default wal_buffers to 64.

After (or during) XLogInsert backends will sleep in a proc queue,
similar to LWlocks and protected by a spinlock. When preparing to
write/send the WAL process should read the proc at the *tail* of the
queue to see what the next LogwrtRqst should be. Then it performs its
action and wakes procs up starting with the head of the queue. We would
add LSN into PGPROC, so WAL processes can check whether the backend
should be woken. The LSN field can be accessed without spinlocks since
it is only ever set by the backend itself and only read while a backend
is sleeping. So we access spinlock, find tail, drop spinlock then read
LSN of the backend that (was) the tail.

Another thought occurs that we might measure the time a Send takes and
specify a limit on how long we are prepared to wait for confirmation.
Limit=0 => asynchronous. Limit > 0 implies synchronous-up-to-the-limit.
This would give better user behaviour across a highly variable network
connection.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#21Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#20)
#22Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#21)
#23Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#22)
#24ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Heikki Linnakangas (#23)
#25Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#20)
#26Markus Wanner
markus@bluegap.ch
In reply to: ITAGAKI Takahiro (#24)
#27Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#25)
#28Markus Wanner
markus@bluegap.ch
In reply to: Fujii Masao (#25)
#29Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#25)
#30Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#23)
#31Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#25)
#32Simon Riggs
simon@2ndQuadrant.com
In reply to: Markus Wanner (#28)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fujii Masao (#25)
#34Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#33)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#34)
#36Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Heikki Linnakangas (#23)
#37Markus Wanner
markus@bluegap.ch
In reply to: Dimitri Fontaine (#36)
#38Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Markus Wanner (#37)
#39Simon Riggs
simon@2ndQuadrant.com
In reply to: Dimitri Fontaine (#38)
#40Markus Wanner
markus@bluegap.ch
In reply to: ITAGAKI Takahiro (#24)
#41Markus Wanner
markus@bluegap.ch
In reply to: Dimitri Fontaine (#38)
#42Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Simon Riggs (#39)
#43Simon Riggs
simon@2ndQuadrant.com
In reply to: Dimitri Fontaine (#42)
#44Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#35)
#45Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#44)
#46Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#45)
#47Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#46)
#48Zeugswetter Andreas ADI SD
Andreas.Zeugswetter@s-itsolutions.at
In reply to: Heikki Linnakangas (#46)
#49Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#27)
#50Fujii Masao
masao.fujii@gmail.com
In reply to: Markus Wanner (#28)
#51Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#46)
#52Hannu Krosing
hannu@tm.ee
In reply to: Fujii Masao (#51)
#53Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Hannu Krosing (#52)
#54Simon Riggs
simon@2ndQuadrant.com
In reply to: Hannu Krosing (#52)
#55Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#49)
#56Simon Riggs
simon@2ndQuadrant.com
In reply to: Pavan Deolasee (#53)
#57Csaba Nagy
nagy@ecircle-ag.com
In reply to: Zeugswetter Andreas ADI SD (#48)
#58Markus Wanner
markus@bluegap.ch
In reply to: Simon Riggs (#54)
#59Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#55)
#60Hannu Krosing
hannu@tm.ee
In reply to: Simon Riggs (#54)
#61Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Heikki Linnakangas (#59)
#62Hannu Krosing
hannu@tm.ee
In reply to: Markus Wanner (#58)
#63Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#59)
#64Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#63)
#65Simon Riggs
simon@2ndQuadrant.com
In reply to: Markus Wanner (#58)
#66Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#54)
#67Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#66)
#68Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Heikki Linnakangas (#64)
#69Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#66)
#70Markus Wanner
markus@bluegap.ch
In reply to: Simon Riggs (#65)
#71Simon Riggs
simon@2ndQuadrant.com
In reply to: Dimitri Fontaine (#68)
#72Aidan Van Dyk
aidan@highrise.ca
In reply to: Simon Riggs (#71)
#73Simon Riggs
simon@2ndQuadrant.com
In reply to: Aidan Van Dyk (#72)
#74Fujii Masao
masao.fujii@gmail.com
In reply to: Markus Wanner (#40)
#75Markus Wanner
markus@bluegap.ch
In reply to: Fujii Masao (#74)
#76Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#66)
#77Fujii Masao
masao.fujii@gmail.com
In reply to: Markus Wanner (#75)
#78Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fujii Masao (#77)
#79Markus Wanner
markus@bluegap.ch
In reply to: Fujii Masao (#77)
#80Markus Wanner
markus@bluegap.ch
In reply to: Tom Lane (#78)
#81Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#76)
#82Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#81)
#83Tom Lane
tgl@sss.pgh.pa.us
In reply to: Markus Wanner (#80)
#84Markus Wanner
markus@bluegap.ch
In reply to: Tom Lane (#83)
#85Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#83)
#86Markus Wanner
markus@bluegap.ch
In reply to: Bruce Momjian (#85)
#87Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#82)
#88Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#81)
#89Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#87)
#90Hannu Krosing
hannu@tm.ee
In reply to: Heikki Linnakangas (#89)
#91Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#89)
#92Csaba Nagy
nagy@ecircle-ag.com
In reply to: Hannu Krosing (#90)
#93Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#91)
#94Andrew Dunstan
andrew@dunslane.net
In reply to: Heikki Linnakangas (#93)
#95Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Csaba Nagy (#92)
#96Markus Wanner
markus@bluegap.ch
In reply to: Andrew Dunstan (#94)
#97Simon Riggs
simon@2ndQuadrant.com
In reply to: Csaba Nagy (#92)
#98Hannu Krosing
hannu@tm.ee
In reply to: Simon Riggs (#97)
#99Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Hannu Krosing (#98)
#100Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#87)
#101Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#82)
#102Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Simon Riggs (#91)
#103Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alvaro Herrera (#102)
#104Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#101)
#105Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#88)
#106Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#20)