Synchronous replication - patch status inquiry

Started by fazool meinover 15 years ago58 messages

fazoolmein@gmail.com

over 15 years ago

Hello everyone,

I'm interested in benchmarking synchronous replication, to see how
performance degrades compared to asynchronous streaming replication.

I browsed through the archive of emails, but things still seem unclear. Do
we have a final agreed upon patch that I can use? Any links for that?

Thanks.

OS = Linux Suse, sles 11, 64-bit
Postgres version = 9.0 beta-4

Bruce Momjian

bruce@momjian.us

over 15 years ago

In reply to: fazool mein (#1)

Re: Synchronous replication - patch status inquiry

fazool mein wrote:

Hello everyone,

I'm interested in benchmarking synchronous replication, to see how
performance degrades compared to asynchronous streaming replication.

I browsed through the archive of emails, but things still seem unclear. Do
we have a final agreed upon patch that I can use? Any links for that?

No.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

David Fetter

david@fetter.org

over 15 years ago

In reply to: Bruce Momjian (#2)

Re: Synchronous replication - patch status inquiry

On Tue, Aug 31, 2010 at 05:44:15PM -0400, Bruce Momjian wrote:

fazool mein wrote:

Hello everyone,

I'm interested in benchmarking synchronous replication, to see how
performance degrades compared to asynchronous streaming replication.

I browsed through the archive of emails, but things still seem unclear. Do
we have a final agreed upon patch that I can use? Any links for that?

No.

That was a mite brusque and not super informative.

There are patches, and the latest from Fujii Masao is probably worth
looking at :)

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: David Fetter (#3)

Re: Synchronous replication - patch status inquiry

On Tue, Aug 31, 2010 at 6:24 PM, David Fetter <david@fetter.org> wrote:

On Tue, Aug 31, 2010 at 05:44:15PM -0400, Bruce Momjian wrote:

fazool mein wrote:

Hello everyone,

I'm interested in benchmarking synchronous replication, to see how
performance degrades compared to asynchronous streaming replication.

I browsed through the archive of emails, but things still seem unclear. Do
we have a final agreed upon patch that I can use? Any links for that?

No.

That was a mite brusque and not super informative.

There are patches, and the latest from Fujii Masao is probably worth
looking at :)

I am pretty sure, however, that the performance will be terrible at
this point. Heikki is working on fixing that, but it ain't done yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: David Fetter (#3)

Re: Synchronous replication - patch status inquiry

On Tue, Aug 31, 2010 at 6:24 PM, David Fetter <david@fetter.org> wrote:

On Tue, Aug 31, 2010 at 05:44:15PM -0400, Bruce Momjian wrote:

fazool mein wrote:

Hello everyone,

I'm interested in benchmarking synchronous replication, to see how
performance degrades compared to asynchronous streaming replication.

I browsed through the archive of emails, but things still seem unclear. Do
we have a final agreed upon patch that I can use? Any links for that?

No.

That was a mite brusque and not super informative.

There are patches, and the latest from Fujii Masao is probably worth
looking at :)

I am pretty sure, however, that the performance will be terrible at
this point. Heikki is working on fixing that, but it ain't done yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

David Fetter

david@fetter.org

over 15 years ago

In reply to: Robert Haas (#4)

Re: Synchronous replication - patch status inquiry

On Tue, Aug 31, 2010 at 08:34:31PM -0400, Robert Haas wrote:

On Tue, Aug 31, 2010 at 6:24 PM, David Fetter <david@fetter.org> wrote:

On Tue, Aug 31, 2010 at 05:44:15PM -0400, Bruce Momjian wrote:

fazool mein wrote:

Hello everyone,

I'm interested in benchmarking synchronous replication, to see
how performance degrades compared to asynchronous streaming
replication.

I browsed through the archive of emails, but things still seem
unclear. Do we have a final agreed upon patch that I can use?
Any links for that?

No.

That was a mite brusque and not super informative.

There are patches, and the latest from Fujii Masao is probably
worth looking at :)

I am pretty sure, however, that the performance will be terrible at
this point. Heikki is working on fixing that, but it ain't done
yet.

Is this something for an eDB feature, or for community PostgreSQL,
or...?

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: David Fetter (#6)

Re: Synchronous replication - patch status inquiry

On Tue, Aug 31, 2010 at 8:45 PM, David Fetter <david@fetter.org> wrote:

I am pretty sure, however, that the performance will be terrible at
this point. Heikki is working on fixing that, but it ain't done
yet.

Is this something for an eDB feature, or for community PostgreSQL,
or...?

It's an EDB feature in the sense that Heikki is developing it as part
of his employment with EDB, but it will be committed to community
PostgreSQL. See the thread on interruptible sleeps. The problem
right now is that there are some polling loops that act to throttle
the maximum rate at which a node doing sync rep can make forward
progress, independent of the capabilities of the hardware. Those need
to be replaced with a system that doesn't inject unnecessary delays
into the process, which is what Heikki is working on.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Robert Haas (#4)

Re: Synchronous replication - patch status inquiry

On Wed, Sep 1, 2010 at 9:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:

There are patches, and the latest from Fujii Masao is probably worth
looking at :)

I am pretty sure, however, that the performance will be terrible at
this point. Heikki is working on fixing that, but it ain't done yet.

Yep. The latest WIP code is available in my git repository, but it's
not worth benchmarking yet. I'll need to merge Heikki's effort and
the synchronous replication patch.

git://git.postgresql.org/git/users/fujii/postgres.git
branch: synchrep

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

fazool mein

fazoolmein@gmail.com

over 15 years ago

In reply to: Fujii Masao (#8)

Re: Synchronous replication - patch status inquiry

Thanks!

I'll wait for the merging then; there is no point in benchmarking otherwise.

Regards

On Tue, Aug 31, 2010 at 6:06 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Show quoted text

On Wed, Sep 1, 2010 at 9:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:

There are patches, and the latest from Fujii Masao is probably worth
looking at :)

I am pretty sure, however, that the performance will be terrible at
this point. Heikki is working on fixing that, but it ain't done yet.

Yep. The latest WIP code is available in my git repository, but it's
not worth benchmarking yet. I'll need to merge Heikki's effort and
the synchronous replication patch.

git://git.postgresql.org/git/users/fujii/postgres.git
branch: synchrep

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#10

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Robert Haas (#7)

Re: Synchronous replication - patch status inquiry

On 01/09/10 04:02, Robert Haas wrote:

See the thread on interruptible sleeps. The problem
right now is that there are some polling loops that act to throttle
the maximum rate at which a node doing sync rep can make forward
progress, independent of the capabilities of the hardware.

To be precise, the polling doesn't affect the "bandwidth" the
replication can handle, but it introduces a delay wh

Those need
to be replaced with a system that doesn't inject unnecessary delays
into the process, which is what Heikki is working on.

Right.

Once we're done with that, all the big questions are still left. How to
configure it? What does synchronous replication mean, when is a
transaction acknowledged as committed? What to do if a standby server
dies and never acknowledges a commit? All these issues have been
discussed, but there is no consensus yet.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Heikki Linnakangas (#10)

Re: Synchronous replication - patch status inquiry

On Wed, Sep 1, 2010 at 2:33 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Once we're done with that, all the big questions are still left.

Yeah, let's discuss about those topics :)

How to configure it?

Before discussing about that, we should determine whether registering
standbys in master is really required. It affects configuration a lot.
Heikki thinks that it's required, but I'm still unclear about why and
how.

Why do standbys need to be registered in master? What information
should be registered?

What does synchronous replication mean, when is a transaction
acknowledged as committed?

I proposed four synchronization levels:

1. async
doesn't make transaction commit wait for replication, i.e.,
asynchronous replication. This mode has been already supported in
9.0.

2. recv
makes transaction commit wait until the standby has received WAL
records.

3. fsync
makes transaction commit wait until the standby has received and
flushed WAL records to disk

4. replay
makes transaction commit wait until the standby has replayed WAL
records after receiving and flushing them to disk

OTOH, Simon proposed the quorum commit feature. I think that both
is required for various our use cases. Thought?

What to do if a standby server dies and never
acknowledges a commit?

The master's reaction to that situation should be configurable. So
I'd propose new configuration parameter specifying the reaction.
Valid values are:

- standalone
When the master has waited for the ACK much longer than the timeout
(or detected the failure of the standby), it closes the connection
to the standby and restarts transactions.

- down
When that situation occurs, the master shuts down immediately.
Though this is unsafe for the system requiring high availability,
as far as I recall, some people wanted this mode in the previous
discussion.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#12

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Fujii Masao (#11)

Re: Synchronous replication - patch status inquiry

On 01/09/10 10:53, Fujii Masao wrote:

Before discussing about that, we should determine whether registering
standbys in master is really required. It affects configuration a lot.
Heikki thinks that it's required, but I'm still unclear about why and
how.

Why do standbys need to be registered in master? What information
should be registered?

That requirement falls out from the handling of disconnected standbys.
If a standby is not connected, what does the master do with commits? If
the answer is anything else than acknowledge them to the client
immediately, as if the standby never existed, the master needs to know
what standby servers exist. Otherwise it can't know if all the standbys
are connected or not.

What does synchronous replication mean, when is a transaction
acknowledged as committed?

I proposed four synchronization levels:

1. async
doesn't make transaction commit wait for replication, i.e.,
asynchronous replication. This mode has been already supported in
9.0.

2. recv
makes transaction commit wait until the standby has received WAL
records.

3. fsync
makes transaction commit wait until the standby has received and
flushed WAL records to disk

4. replay
makes transaction commit wait until the standby has replayed WAL
records after receiving and flushing them to disk

OTOH, Simon proposed the quorum commit feature. I think that both
is required for various our use cases. Thought?

I'd like to keep this as simple as possible, yet flexible so that with
enough scripting and extensions, you can get all sorts of behavior. I
think quorum commit falls into the "extension" category; if you're setup
is complex enough, it's going to be impossible to represent that in our
config files no matter what. But if you write a little proxy, you can
implement arbitrary rules there.

I think recv/fsync/replay should be specified in the standby. It has no
direct effect on the master, the master would just relay the setting to
the standby when it connects, or the standby would send multiple
XLogRecPtrs and let the master decide when the WAL is persistent enough.
And what if you write a proxy that has some other meaning of "persistent
enough"? Like when it has been written to the OS buffers but not yet
fsync'd, or when it has been fsync'd to at least one standby and
received by at least three others. recv/fsync/replay is not going to
represent that behavior well.

"sync vs async" on the other hand should be specified in the master,
because it has a direct impact on the behavior of commits in the master.

I propose a configuration file standbys.conf, in the master:

# STANDBY NAME SYNCHRONOUS TIMEOUT
importantreplica yes 100ms
tempcopy no 10s

Or perhaps this should be stored in a system catalog.

What to do if a standby server dies and never
acknowledges a commit?

The master's reaction to that situation should be configurable. So
I'd propose new configuration parameter specifying the reaction.
Valid values are:

- standalone
When the master has waited for the ACK much longer than the timeout
(or detected the failure of the standby), it closes the connection
to the standby and restarts transactions.

- down
When that situation occurs, the master shuts down immediately.
Though this is unsafe for the system requiring high availability,
as far as I recall, some people wanted this mode in the previous
discussion.

Yeah, though of course you might want to set that per-standby too..

Let's step back a bit and ask what would be the simplest thing that you
could call "synchronous replication" in good conscience, and also be
useful at least to some people. Let's leave out the "down" mode, because
that requires registration. We'll probably have to do registration at
some point, but let's take as small steps as possible.

Without the "down" mode in the master, frankly I don't see the point of
the "recv" and "fsync" levels in the standby. Either way, when the
master acknowledges a commit to the client, you don't know if it has
made it to the standby yet because the replication connection might be
down for some reason.

That leaves us the 'replay' mode, which *is* useful, because it gives
you the guarantee that when the master acknowledges a commit, it will
appear committed in all hot standby servers that are currently
connected. With that guarantee you can build a reliable cluster with
something pgpool-II where all writes go to one node, and reads are
distributed to multiple nodes.

I'm not sure what we should aim for in the first phase. But if you want
as little code as possible yet have something useful, I think 'replay'
mode with no standby registration is the way to go.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#13

Simon Riggs

simon@2ndQuadrant.com

over 15 years ago

In reply to: Heikki Linnakangas (#10)

Re: Synchronous replication - patch status inquiry

On Wed, 2010-09-01 at 08:33 +0300, Heikki Linnakangas wrote:

On 01/09/10 04:02, Robert Haas wrote:

See the thread on interruptible sleeps. The problem
right now is that there are some polling loops that act to throttle
the maximum rate at which a node doing sync rep can make forward
progress, independent of the capabilities of the hardware.

To be precise, the polling doesn't affect the "bandwidth" the
replication can handle, but it introduces a delay wh

We're sending the WAL data in batches. We can't really escape from the
fact that we're effectively using group commit when we use synch rep.
That will necessarily increase delay and require more sessions to get
same throughput.

Those need
to be replaced with a system that doesn't inject unnecessary delays
into the process, which is what Heikki is working on.

Right.

Once we're done with that, all the big questions are still left. How to
configure it? What does synchronous replication mean, when is a
transaction acknowledged as committed? What to do if a standby server
dies and never acknowledges a commit? All these issues have been
discussed, but there is no consensus yet.

That sounds an awful lot like performance tuning first and the feature
additions last.

And if you're in the middle of performance tuning, surely some objective
performance tests would help us, no?

IMHO we should be concentrating on how to add the next features because
its clear to me that if you do things in the wrong order you'll be
wasting time. And we don't have much of that, ever.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

#14

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Heikki Linnakangas (#12)

Re: Synchronous replication - patch status inquiry

On Wed, Sep 1, 2010 at 6:23 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I'm not sure what we should aim for in the first phase. But if you want as
little code as possible yet have something useful, I think 'replay' mode
with no standby registration is the way to go.

IMHO, less is more. Trying to do too much at once can cause us to
miss the release window (and can also create more bugs). We just need
to leave the door open to adding later whatever we leave out now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#15

Simon Riggs

simon@2ndQuadrant.com

over 15 years ago

In reply to: Heikki Linnakangas (#12)

Re: Synchronous replication - patch status inquiry

On Wed, 2010-09-01 at 13:23 +0300, Heikki Linnakangas wrote:

On 01/09/10 10:53, Fujii Masao wrote:

Before discussing about that, we should determine whether registering
standbys in master is really required. It affects configuration a lot.
Heikki thinks that it's required, but I'm still unclear about why and
how.

Why do standbys need to be registered in master? What information
should be registered?

That requirement falls out from the handling of disconnected standbys.
If a standby is not connected, what does the master do with commits? If
the answer is anything else than acknowledge them to the client
immediately, as if the standby never existed, the master needs to know
what standby servers exist. Otherwise it can't know if all the standbys
are connected or not.

"All the standbys" presupposes that we know what they are, i.e. we have
registered them, so I see that argument as circular. Quorum commit does
not need registration, so quorum commit is the "easy to implement"
option and registration is the more complex later feature. I don't have
a problem with adding registration later and believe it can be done
later without issues.

What does synchronous replication mean, when is a transaction
acknowledged as committed?

I proposed four synchronization levels:

1. async
doesn't make transaction commit wait for replication, i.e.,
asynchronous replication. This mode has been already supported in
9.0.

2. recv
makes transaction commit wait until the standby has received WAL
records.

3. fsync
makes transaction commit wait until the standby has received and
flushed WAL records to disk

4. replay
makes transaction commit wait until the standby has replayed WAL
records after receiving and flushing them to disk

OTOH, Simon proposed the quorum commit feature. I think that both
is required for various our use cases. Thought?

I'd like to keep this as simple as possible, yet flexible so that with
enough scripting and extensions, you can get all sorts of behavior. I
think quorum commit falls into the "extension" category; if you're setup
is complex enough, it's going to be impossible to represent that in our
config files no matter what. But if you write a little proxy, you can
implement arbitrary rules there.

I think recv/fsync/replay should be specified in the standby.

I think the wait mode (i.e. recv/fsync/replay or others) should be
specified in the master. This allows the application to specify whatever
level of protection it requires, and also allows the behaviour to be
different for user-specifiable parts of the application. As soon as you
set this on the standby then you have the one-size fits all approach to
synchronisation.

We already know performance of synchronous rep is poor, which is exactly
why I want to be able to control it at the application level. Fine
grained control is important, otherwise we may as well just use DRBD and
skip this project completely, since we already have that. It will also
be a feature that no other database has, taking us truly beyond what has
gone before.

The master/standby decision is not something that is easily changed.
Whichever we decide now will be the thing we stick with.

It has no
direct effect on the master, the master would just relay the setting to
the standby when it connects, or the standby would send multiple
XLogRecPtrs and let the master decide when the WAL is persistent enough.
And what if you write a proxy that has some other meaning of "persistent
enough"? Like when it has been written to the OS buffers but not yet
fsync'd, or when it has been fsync'd to at least one standby and
received by at least three others. recv/fsync/replay is not going to
represent that behavior well.

"sync vs async" on the other hand should be specified in the master,
because it has a direct impact on the behavior of commits in the master.

I propose a configuration file standbys.conf, in the master:

# STANDBY NAME SYNCHRONOUS TIMEOUT
importantreplica yes 100ms
tempcopy no 10s

Or perhaps this should be stored in a system catalog.

That part sounds like complexity that can wait until later. I would not
object if you really want this, but would prefer it to look like this:

# STANDBY NAME DEFAULT_WAIT_MODE TIMEOUT
importantreplica sync 100ms
tempcopy async 10s

You don't *have* to use the application level control if you don't want
it. But its an important capability for real world apps, since the
alternative is deliberately splitting an application across two database
servers each with different wait modes.

What to do if a standby server dies and never
acknowledges a commit?

The master's reaction to that situation should be configurable. So
I'd propose new configuration parameter specifying the reaction.
Valid values are:

- standalone
When the master has waited for the ACK much longer than the timeout
(or detected the failure of the standby), it closes the connection
to the standby and restarts transactions.

- down
When that situation occurs, the master shuts down immediately.
Though this is unsafe for the system requiring high availability,
as far as I recall, some people wanted this mode in the previous
discussion.

Yeah, though of course you might want to set that per-standby too..

Let's step back a bit and ask what would be the simplest thing that you
could call "synchronous replication" in good conscience, and also be
useful at least to some people. Let's leave out the "down" mode, because
that requires registration. We'll probably have to do registration at
some point, but let's take as small steps as possible.

Without the "down" mode in the master, frankly I don't see the point of
the "recv" and "fsync" levels in the standby. Either way, when the
master acknowledges a commit to the client, you don't know if it has
made it to the standby yet because the replication connection might be
down for some reason.

That leaves us the 'replay' mode, which *is* useful, because it gives
you the guarantee that when the master acknowledges a commit, it will
appear committed in all hot standby servers that are currently
connected. With that guarantee you can build a reliable cluster with
something pgpool-II where all writes go to one node, and reads are
distributed to multiple nodes.

I'm not sure what we should aim for in the first phase. But if you want
as little code as possible yet have something useful, I think 'replay'
mode with no standby registration is the way to go.

I don't see it as any more code to implement.

When the standby replies, it can return
* latest LSN received
* latest LSN fsynced
* latest LSN replayed
etc

We then release waiting committers on the master according to which of
the above they said they want to wait for. The standby does *not* need
to know the wishes of transactions on the master.

Note that means that receiving, fsyncing and replaying can all progress
as an asynchronous pipeline, giving great overall throughput.

Once you accept that there are multiple modes, then the actual number of
wait modes is unimportant. It's just an array of [NUM_WAIT_MODES], so
the project need not be delayed just because we have 2, 3 or 4 wait
modes.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

#16

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Heikki Linnakangas (#12)

Re: Synchronous replication - patch status inquiry

On Wed, Sep 1, 2010 at 7:23 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

That requirement falls out from the handling of disconnected standbys. If a
standby is not connected, what does the master do with commits? If the
answer is anything else than acknowledge them to the client immediately, as
if the standby never existed, the master needs to know what standby servers
exist. Otherwise it can't know if all the standbys are connected or not.

Thanks. I understood why the registration is required.

I'd like to keep this as simple as possible, yet flexible so that with
enough scripting and extensions, you can get all sorts of behavior. I think
quorum commit falls into the "extension" category; if you're setup is
complex enough, it's going to be impossible to represent that in our config
files no matter what. But if you write a little proxy, you can implement
arbitrary rules there.

Agreed.

I think recv/fsync/replay should be specified in the standby. It has no
direct effect on the master, the master would just relay the setting to the
standby when it connects, or the standby would send multiple XLogRecPtrs and
let the master decide when the WAL is persistent enough.

The latter seems wasteful since the master uses only one XLogRecPtr even if
the standby sends multiple ones. So I prefer the former design. Which also
makes the code and design very simple, and we can easily write the proxy.

"sync vs async" on the other hand should be specified in the master, because
it has a direct impact on the behavior of commits in the master.

I propose a configuration file standbys.conf, in the master:

# STANDBY NAME SYNCHRONOUS TIMEOUT
importantreplica yes 100ms
tempcopy no 10s

Seems good. In fact, instead of yes/no, async/recv/fsync/replay is specified
in SYNCHRONOUS field?

OTOH, something like standby_name parameter should be introduced in
recovery.conf.

We should allow multiple standbys with the same name? Probably yes.
We might need to add NUMBER field into the standbys.conf, in the future.

Yeah, though of course you might want to set that per-standby too..

Yep.

Let's step back a bit and ask what would be the simplest thing that you
could call "synchronous replication" in good conscience, and also be useful
at least to some people. Let's leave out the "down" mode, because that
requires registration. We'll probably have to do registration at some point,
but let's take as small steps as possible.

Agreed.

Without the "down" mode in the master, frankly I don't see the point of the
"recv" and "fsync" levels in the standby. Either way, when the master
acknowledges a commit to the client, you don't know if it has made it to the
standby yet because the replication connection might be down for some
reason.

True. We cannot know whether the standby can be brought up to the master
without any data loss when the master crashes, because the standby might
be disconnected before for some reasons and not have some latest data.

But the situation would be the same even when 'replay' mode is chosen.
Though we might be able to check whether the latest transaction has
replicated to the standby by running read only query to the standby,
it's actually difficult to do that. How can we know the content of the
latest transaction?

Also even when 'recv' or 'fsync' is chosen, we might be able to check
that by doing pg_last_xlog_receive_location() on the standby. But the
similar question occurs to me: How can we know the LSN of the latest
transaction?

I'm thinking to introduce new parameter specifying the command which
is executed when the standby is disconnected. This command is executed
by walsender before resuming the transaction processings which have
been suspended by the disconnection. For example, if STONISH against
the standby is supplied as the command, we can prevent the standby not
having the latest data from becoming the master by forcibly shutting
such a delayed standby down. Thought?

That leaves us the 'replay' mode, which *is* useful, because it gives you
the guarantee that when the master acknowledges a commit, it will appear
committed in all hot standby servers that are currently connected. With that
guarantee you can build a reliable cluster with something pgpool-II where
all writes go to one node, and reads are distributed to multiple nodes.

I'm concerned that the conflict by read-only query and recovery might
harm the performance on the master in 'replay' mode. If the conflict
occurs, all running transactions on the master have to wait for it to
disappear, and which can take very long. Of course, wihtout the conflict,
waiting until the standby has received, fsync'd, read and replayed WAL
would take long. So I'd like to support also 'recv' and 'fsync'.
I believe that it's not complicated and difficult to implement those
two modes.

I'm not sure what we should aim for in the first phase. But if you want as
little code as possible yet have something useful, I think 'replay' mode
with no standby registration is the way to go.

What about recv/fsync/replay mode with no standby registration?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#17

Simon Riggs

simon@2ndQuadrant.com

over 15 years ago

In reply to: Fujii Masao (#16)

Re: Synchronous replication - patch status inquiry

On Thu, 2010-09-02 at 19:24 +0900, Fujii Masao wrote:

On Wed, Sep 1, 2010 at 7:23 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

That requirement falls out from the handling of disconnected standbys. If a
standby is not connected, what does the master do with commits? If the
answer is anything else than acknowledge them to the client immediately, as
if the standby never existed, the master needs to know what standby servers
exist. Otherwise it can't know if all the standbys are connected or not.

Thanks. I understood why the registration is required.

I don't. There is a simpler design that does not require registration.

Please explain why we need registration, with an explanation that does
not presume it as a requirement.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

#18

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Simon Riggs (#17)

Re: Synchronous replication - patch status inquiry

On 02/09/10 15:03, Simon Riggs wrote:

On Thu, 2010-09-02 at 19:24 +0900, Fujii Masao wrote:

On Wed, Sep 1, 2010 at 7:23 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

That requirement falls out from the handling of disconnected standbys. If a
standby is not connected, what does the master do with commits? If the
answer is anything else than acknowledge them to the client immediately, as
if the standby never existed, the master needs to know what standby servers
exist. Otherwise it can't know if all the standbys are connected or not.

Thanks. I understood why the registration is required.

I don't. There is a simpler design that does not require registration.

Please explain why we need registration, with an explanation that does
not presume it as a requirement.

Please explain how you would implement "don't acknowledge commits until
they're replicated to all standbys" without standby registration.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#19

Simon Riggs

simon@2ndQuadrant.com

over 15 years ago

In reply to: Heikki Linnakangas (#18)

Re: Synchronous replication - patch status inquiry

On Thu, 2010-09-02 at 15:15 +0300, Heikki Linnakangas wrote:

On 02/09/10 15:03, Simon Riggs wrote:

On Thu, 2010-09-02 at 19:24 +0900, Fujii Masao wrote:

On Wed, Sep 1, 2010 at 7:23 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

That requirement falls out from the handling of disconnected standbys. If a
standby is not connected, what does the master do with commits? If the
answer is anything else than acknowledge them to the client immediately, as
if the standby never existed, the master needs to know what standby servers
exist. Otherwise it can't know if all the standbys are connected or not.

Thanks. I understood why the registration is required.

I don't. There is a simpler design that does not require registration.

Please explain why we need registration, with an explanation that does
not presume it as a requirement.

Please explain how you would implement "don't acknowledge commits until
they're replicated to all standbys" without standby registration.

"All standbys" has no meaning without registration. It is not a question
that needs an answer.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

#20

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Simon Riggs (#19)

Re: Synchronous replication - patch status inquiry

On Thu, Sep 2, 2010 at 8:44 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

"All standbys" has no meaning without registration. It is not a question
that needs an answer.

Tell that to the DBA. I bet s/he knows what "all standbys" means.
The fact that the system doesn't know something doesn't make it
unimportant.

I agree that we don't absolutely need standby registration for some
really basic version of synchronous replication. But I think we'd be
better off biting the bullet and adding it. I think that without it
we're going to resort to a series of increasingly grotty and
user-unfriendly hacks to make this work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#21

Dimitri Fontaine

dfontaine@hi-media.com

over 15 years ago

In reply to: Robert Haas (#20)

Re: Synchronous replication - patch status inquiry

Robert Haas <robertmhaas@gmail.com> writes:

Tell that to the DBA. I bet s/he knows what "all standbys" means.
The fact that the system doesn't know something doesn't make it
unimportant.

Well as a DBA I think I'd much prefer to attribute "votes" to each
standby so that each ack is weighted. Let me explain in more details the
setup I'm thinking about.

The transaction on the master wants a certain "service level" (async,
recv, fsync, replay) and a certain number of votes. As proposed earlier,
the standby would feedback the last XID known locally in each state
(received, synced, replayed) and its current weight, and the master
would arbitrate given those information.

That's highly flexible, you can have slaves join the party at any point
in time, and change 2 user GUC (set by session, transaction, function,
database, role, in postgresql.conf) to setup the service level target
you want to ensure, from the master.

(We could go as far as wanting fsync:2,replay:1 as a service level.)

From that you have either the "fail when slave disappear" and the
"please don't shut the service down if a slave disappear" settings, per
transaction, and per slave too (that depends on its weight, remember).

(You can setup the slave weights as powers of 2 and have the service
level be masks to allow you to choose precisely which slave will ack
your fsync service level, and you can switch this slave at run time
easily — sounds cleverer, but sounds also easier to implement given
the flexibility it gives — precedents in PostgreSQL? the PITR and WAL
Shipping facilities are hard to use, full of traps, but very
flexible).

You can even give some more weight to one slave while you're maintaining
another so that the master just don't complain.

I see a need for very dynamic *and decentralized* replication topology
setup, I fail to see a need for a centralized registration based setup.

I agree that we don't absolutely need standby registration for some
really basic version of synchronous replication. But I think we'd be
better off biting the bullet and adding it.

What does that mechanism allow us to implement we can't do without?
--
dim

#22

Simon Riggs

simon@2ndQuadrant.com

over 15 years ago

In reply to: Robert Haas (#20)

Re: Synchronous replication - patch status inquiry

On Thu, 2010-09-02 at 08:59 -0400, Robert Haas wrote:

On Thu, Sep 2, 2010 at 8:44 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

"All standbys" has no meaning without registration. It is not a question
that needs an answer.

Tell that to the DBA. I bet s/he knows what "all standbys" means.
The fact that the system doesn't know something doesn't make it
unimportant.

I agree that we don't absolutely need standby registration for some
really basic version of synchronous replication. But I think we'd be
better off biting the bullet and adding it. I think that without it
we're going to resort to a series of increasingly grotty and
user-unfriendly hacks to make this work.

I'm personally quite happy to have server registration.

My interest is in ensuring we have master-controlled robustness, which
is so far being ignored because "we need simple". Refrring to above, we
are clearly quite willing to go beyond the most basic implementation, so
there's no further argument to exclude it for that reason.

The implementation of master-controlled robustness is no more difficult
than the alternative.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

#23

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Simon Riggs (#22)

Re: Synchronous replication - patch status inquiry

On Thu, Sep 2, 2010 at 10:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, 2010-09-02 at 08:59 -0400, Robert Haas wrote:

On Thu, Sep 2, 2010 at 8:44 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

"All standbys" has no meaning without registration. It is not a question
that needs an answer.

Tell that to the DBA. I bet s/he knows what "all standbys" means.
The fact that the system doesn't know something doesn't make it
unimportant.

I agree that we don't absolutely need standby registration for some
really basic version of synchronous replication. But I think we'd be
better off biting the bullet and adding it. I think that without it
we're going to resort to a series of increasingly grotty and
user-unfriendly hacks to make this work.

I'm personally quite happy to have server registration.

OK, thanks for clarifying.

My interest is in ensuring we have master-controlled robustness, which
is so far being ignored because "we need simple". Refrring to above, we
are clearly quite willing to go beyond the most basic implementation, so
there's no further argument to exclude it for that reason.

The implementation of master-controlled robustness is no more difficult
than the alternative.

But I'm not sure I quite follow this part. I don't think I know what
you mean by "master-controlled robustness".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#24

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Simon Riggs (#22)

Re: Synchronous replication - patch status inquiry

On 02/09/10 17:06, Simon Riggs wrote:

On Thu, 2010-09-02 at 08:59 -0400, Robert Haas wrote:

On Thu, Sep 2, 2010 at 8:44 AM, Simon Riggs<simon@2ndquadrant.com> wrote:

"All standbys" has no meaning without registration. It is not a question
that needs an answer.

Tell that to the DBA. I bet s/he knows what "all standbys" means.
The fact that the system doesn't know something doesn't make it
unimportant.

I agree that we don't absolutely need standby registration for some
really basic version of synchronous replication. But I think we'd be
better off biting the bullet and adding it. I think that without it
we're going to resort to a series of increasingly grotty and
user-unfriendly hacks to make this work.

I'm personally quite happy to have server registration.

My interest is in ensuring we have master-controlled robustness, which
is so far being ignored because "we need simple". Refrring to above, we
are clearly quite willing to go beyond the most basic implementation, so
there's no further argument to exclude it for that reason.

The implementation of master-controlled robustness is no more difficult
than the alternative.

I understand what you're after, the idea of being able to set
synchronization level on a per-transaction basis is cool. But I haven't
seen a satisfactory design for it. I don't understand how it would work
in practice. Even though it's cool, having different kinds of standbys
connected is a more common scenario, and the design needs to accommodate
that too. I'm all ears if you can sketch a design that can do that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#25

Joshua Tolley

eggyknap@gmail.com

over 15 years ago

In reply to: Fujii Masao (#11)

Re: Synchronous replication - patch status inquiry

On Wed, Sep 01, 2010 at 04:53:38PM +0900, Fujii Masao wrote:

- down
When that situation occurs, the master shuts down immediately.
Though this is unsafe for the system requiring high availability,
as far as I recall, some people wanted this mode in the previous
discussion.

Oracle provides this, among other possible configurations; perhaps that's why
it came up earlier.

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com

#26

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Heikki Linnakangas (#24)

Re: Synchronous replication - patch status inquiry

On Thu, Sep 2, 2010 at 11:32 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I understand what you're after, the idea of being able to set
synchronization level on a per-transaction basis is cool. But I haven't seen
a satisfactory design for it. I don't understand how it would work in
practice. Even though it's cool, having different kinds of standbys
connected is a more common scenario, and the design needs to accommodate
that too. I'm all ears if you can sketch a design that can do that.

That design would affect what the standby should reply. If we choose
async/recv/fsync/replay on a per-transaction basis, the standby
should send multiple LSNs and the master needs to decide when
replication has been completed. OTOH, if we choose just sync/async,
the standby has only to send one LSN.

The former seems to be more useful, but triples the number of ACK
from the standby. I'm not sure whether its overhead is ignorable,
especially when the distance between the master and the standby is
very long.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#27

Simon Riggs

simon@2ndQuadrant.com

over 15 years ago

In reply to: Fujii Masao (#26)

Re: Synchronous replication - patch status inquiry

On Fri, 2010-09-03 at 12:50 +0900, Fujii Masao wrote:

On Thu, Sep 2, 2010 at 11:32 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I understand what you're after, the idea of being able to set
synchronization level on a per-transaction basis is cool. But I haven't seen
a satisfactory design for it. I don't understand how it would work in
practice. Even though it's cool, having different kinds of standbys
connected is a more common scenario, and the design needs to accommodate
that too. I'm all ears if you can sketch a design that can do that.

That design would affect what the standby should reply. If we choose
async/recv/fsync/replay on a per-transaction basis, the standby
should send multiple LSNs and the master needs to decide when
replication has been completed. OTOH, if we choose just sync/async,
the standby has only to send one LSN.

The former seems to be more useful, but triples the number of ACK
from the standby. I'm not sure whether its overhead is ignorable,
especially when the distance between the master and the standby is
very long.

No, it doesn't. There is no requirement for additional messages. It just
adds 16 bytes onto the reply message, maybe 24. If there is a noticeable
overhead from that, shoot me.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

#28

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Fujii Masao (#16)

Re: Synchronous replication - patch status inquiry

On Thu, Sep 2, 2010 at 7:24 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I propose a configuration file standbys.conf, in the master:

# STANDBY NAME SYNCHRONOUS TIMEOUT
importantreplica yes 100ms
tempcopy no 10s

Seems good. In fact, instead of yes/no, async/recv/fsync/replay is specified
in SYNCHRONOUS field?

OTOH, something like standby_name parameter should be introduced in
recovery.conf.

We should allow multiple standbys with the same name? Probably yes.
We might need to add NUMBER field into the standbys.conf, in the future.

Here is the proposed detailed design:

standbys.conf
=============
# This is not initialized by initdb, so users need to create it under $PGDATA.
* The template is located in the PREFIX/share directory.

# This is read by postmaster at the startup as well as pg_hba.conf is.
* In EXEC_BACKEND environement, each walsender must read it at the startup.
* This is ignored when max_wal_senders is zero.
* FATAL is emitted when standbys.conf doesn't exist even if max_wal_senders
is positive.

# SIGHUP makes only postmaser re-read the standbys.conf.
* New configuration doesn't affect the existing connections to the standbys,
i.e., it's used only for subsequent connections.
* XXX: Should the existing connections react to new configuration? What if
new standbys.conf doesn't have the standby_name of the existing
connection?

# The connection from the standby is rejected if its standby_name is not listed
in standbys.conf.
* Multiple standbys with the same name are allowed.

# The valid values of SYNCHRONOUS field are async, recv, fsync and replay.

standby_name
============
# This is new string-typed parameter in recovery.conf.
* XXX: Should standby_name and standby_mode be merged?

# Walreceiver sends this to the master when establishing the connection.

Comments? Is the above too complicated for the first step? If so, I'd
propose to just introduce new recovery.conf parameter like replication_mode
specifying the synchronization level, instead.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#29

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Simon Riggs (#27)

Re: Synchronous replication - patch status inquiry

On 03/09/10 09:36, Simon Riggs wrote:

On Fri, 2010-09-03 at 12:50 +0900, Fujii Masao wrote:

That design would affect what the standby should reply. If we choose
async/recv/fsync/replay on a per-transaction basis, the standby
should send multiple LSNs and the master needs to decide when
replication has been completed. OTOH, if we choose just sync/async,
the standby has only to send one LSN.

The former seems to be more useful, but triples the number of ACK
from the standby. I'm not sure whether its overhead is ignorable,
especially when the distance between the master and the standby is
very long.

No, it doesn't. There is no requirement for additional messages.

Please explain how you do it then. When a commit record is sent to the
standby, it needs to acknowledge it 1) when it has received it, 2) when
it fsyncs it to disk and c) when it's replayed. I don't see how you can
get around that.

Perhaps you can save a bit by combining multiple messages together, like
in Nagle's algorithm, but then you introduce extra delays which is
exactly what you don't want.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#30

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Simon Riggs (#27)

Re: Synchronous replication - patch status inquiry

On Fri, Sep 3, 2010 at 3:36 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

The former seems to be more useful, but triples the number of ACK
from the standby. I'm not sure whether its overhead is ignorable,
especially when the distance between the master and the standby is
very long.

No, it doesn't. There is no requirement for additional messages. It just
adds 16 bytes onto the reply message, maybe 24. If there is a noticeable
overhead from that, shoot me.

The reply message would be sent at least three times every WAL chunk,
i.e., when the standby has received, synced and replayed it. So ISTM
that additional messagings happen. Though I'm not sure if this really
harms the performance...

You'd like to choose async/recv/fsync/replay on a per-transaction basis
rather than async/sync?

Even when async is chosen as the synchronization level in standbys.conf,
it can be changed to other level in transaction? If so, the standby has
to send the reply even if async is chosen and most replies might be
ignored in the master.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#31

Simon Riggs

simon@2ndQuadrant.com

over 15 years ago

In reply to: Heikki Linnakangas (#29)

Re: Synchronous replication - patch status inquiry

On Fri, 2010-09-03 at 09:55 +0300, Heikki Linnakangas wrote:

On 03/09/10 09:36, Simon Riggs wrote:

On Fri, 2010-09-03 at 12:50 +0900, Fujii Masao wrote:

That design would affect what the standby should reply. If we choose
async/recv/fsync/replay on a per-transaction basis, the standby
should send multiple LSNs and the master needs to decide when
replication has been completed. OTOH, if we choose just sync/async,
the standby has only to send one LSN.

The former seems to be more useful, but triples the number of ACK
from the standby. I'm not sure whether its overhead is ignorable,
especially when the distance between the master and the standby is
very long.

No, it doesn't. There is no requirement for additional messages.

Please explain how you do it then. When a commit record is sent to the
standby, it needs to acknowledge it 1) when it has received it, 2) when
it fsyncs it to disk and c) when it's replayed. I don't see how you can
get around that.

Perhaps you can save a bit by combining multiple messages together, like
in Nagle's algorithm, but then you introduce extra delays which is
exactly what you don't want.

From my perspective, you seem to be struggling to find reasons why this

should not happen, rather than seeing the alternatives that would
obviously present themselves if your attitude was a positive one. We
won't make any progress with this style of discussion.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

#32

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Simon Riggs (#31)

Re: Synchronous replication - patch status inquiry

On 03/09/10 10:45, Simon Riggs wrote:

On Fri, 2010-09-03 at 09:55 +0300, Heikki Linnakangas wrote:

On 03/09/10 09:36, Simon Riggs wrote:

On Fri, 2010-09-03 at 12:50 +0900, Fujii Masao wrote:

That design would affect what the standby should reply. If we choose
async/recv/fsync/replay on a per-transaction basis, the standby
should send multiple LSNs and the master needs to decide when
replication has been completed. OTOH, if we choose just sync/async,
the standby has only to send one LSN.

The former seems to be more useful, but triples the number of ACK
from the standby. I'm not sure whether its overhead is ignorable,
especially when the distance between the master and the standby is
very long.

No, it doesn't. There is no requirement for additional messages.

Please explain how you do it then. When a commit record is sent to the
standby, it needs to acknowledge it 1) when it has received it, 2) when
it fsyncs it to disk and c) when it's replayed. I don't see how you can
get around that.

Perhaps you can save a bit by combining multiple messages together, like
in Nagle's algorithm, but then you introduce extra delays which is
exactly what you don't want.

From my perspective, you seem to be struggling to find reasons why this

should not happen, rather than seeing the alternatives that would
obviously present themselves if your attitude was a positive one. We
won't make any progress with this style of discussion.

Huh? You made a very clear claim above that you don't need additional
messages. I explained why I don't think that's true, and asked you to
explain why you think it is true. Whether the claim is true or not does
not depend on my attitude.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#33

Simon Riggs

simon@2ndQuadrant.com

over 15 years ago

In reply to: Heikki Linnakangas (#32)

Re: Synchronous replication - patch status inquiry

On Fri, 2010-09-03 at 12:33 +0300, Heikki Linnakangas wrote:

On 03/09/10 10:45, Simon Riggs wrote:

On Fri, 2010-09-03 at 09:55 +0300, Heikki Linnakangas wrote:

On 03/09/10 09:36, Simon Riggs wrote:

On Fri, 2010-09-03 at 12:50 +0900, Fujii Masao wrote:

That design would affect what the standby should reply. If we choose
async/recv/fsync/replay on a per-transaction basis, the standby
should send multiple LSNs and the master needs to decide when
replication has been completed. OTOH, if we choose just sync/async,
the standby has only to send one LSN.

The former seems to be more useful, but triples the number of ACK
from the standby. I'm not sure whether its overhead is ignorable,
especially when the distance between the master and the standby is
very long.

No, it doesn't. There is no requirement for additional messages.

Please explain how you do it then. When a commit record is sent to the
standby, it needs to acknowledge it 1) when it has received it, 2) when
it fsyncs it to disk and c) when it's replayed. I don't see how you can
get around that.

Perhaps you can save a bit by combining multiple messages together, like
in Nagle's algorithm, but then you introduce extra delays which is
exactly what you don't want.

From my perspective, you seem to be struggling to find reasons why this

should not happen, rather than seeing the alternatives that would
obviously present themselves if your attitude was a positive one. We
won't make any progress with this style of discussion.

Huh? You made a very clear claim above that you don't need additional
messages. I explained why I don't think that's true, and asked you to
explain why you think it is true. Whether the claim is true or not does
not depend on my attitude.

Why exactly would we need to send 3 messages when we could send 1?
Replace your statements of "it needs to" with "why would it" instead.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

#34

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Simon Riggs (#33)

Re: Synchronous replication - patch status inquiry

On 03/09/10 13:20, Simon Riggs wrote:

On Fri, 2010-09-03 at 12:33 +0300, Heikki Linnakangas wrote:

On 03/09/10 10:45, Simon Riggs wrote:

On Fri, 2010-09-03 at 09:55 +0300, Heikki Linnakangas wrote:

On 03/09/10 09:36, Simon Riggs wrote:

On Fri, 2010-09-03 at 12:50 +0900, Fujii Masao wrote:

That design would affect what the standby should reply. If we choose
async/recv/fsync/replay on a per-transaction basis, the standby
should send multiple LSNs and the master needs to decide when
replication has been completed. OTOH, if we choose just sync/async,
the standby has only to send one LSN.

The former seems to be more useful, but triples the number of ACK
from the standby. I'm not sure whether its overhead is ignorable,
especially when the distance between the master and the standby is
very long.

No, it doesn't. There is no requirement for additional messages.

Please explain how you do it then. When a commit record is sent to the
standby, it needs to acknowledge it 1) when it has received it, 2) when
it fsyncs it to disk and c) when it's replayed. I don't see how you can
get around that.

Perhaps you can save a bit by combining multiple messages together, like
in Nagle's algorithm, but then you introduce extra delays which is
exactly what you don't want.

From my perspective, you seem to be struggling to find reasons why this

should not happen, rather than seeing the alternatives that would
obviously present themselves if your attitude was a positive one. We
won't make any progress with this style of discussion.

Huh? You made a very clear claim above that you don't need additional
messages. I explained why I don't think that's true, and asked you to
explain why you think it is true. Whether the claim is true or not does
not depend on my attitude.

Why exactly would we need to send 3 messages when we could send 1?
Replace your statements of "it needs to" with "why would it" instead.

(scratches head..) What's the point of differentiating
received/fsynced/replayed, if the master receives the ack for all of
them at the same time?

Let's try this with an example: In the master, I do stuff and commit a
transaction. I want to know when the transaction is fsynced in the
standby. The WAL is sent to the standby, up to the commit record.

Upthread you said that:

The standby does *not* need
to know the wishes of transactions on the master.

So, when does standby send the single message back to the master?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#35

Dimitri Fontaine

dfontaine@hi-media.com

over 15 years ago

In reply to: Heikki Linnakangas (#34)

Re: Synchronous replication - patch status inquiry

Disclaimer : I have understood things in a way that allows me to answer
here, I don't know at all if that's the way it's meant to be understood.

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

(scratches head..) What's the point of differentiating
received/fsynced/replayed, if the master receives the ack for all of them at
the same time?

It wouldn't the way I understand Simon's proposal.

What's happening is that the feedback channel is periodically sending an
array of 3 LSN, the currently last received, fsync()ed and applied ones.

Now what you're saying is that we should feed back this information
after each recovery step forward, what Simon is saying is that we could
have a looser coupling between the slave activity and the feedback
channel to the master.

That means the master will not see all the slave's restoring activity,
but as the LSN are a monotonic sequence that's not a problem, we can use
<= rather than = in the wait-and-wakeup loop on the master.

Let's try this with an example: In the master, I do stuff and commit a
transaction. I want to know when the transaction is fsynced in the
standby. The WAL is sent to the standby, up to the commit record.

[...]

So, when does standby send the single message back to the master?

The standby is sending a stream of messages to the master with current
LSN positions at the time the message is sent. Given a synchronous
transaction, the master would wait until the feedback stream reports
that the current transaction is in the past compared to the streamed
last known synced one (or the same).

Hope this helps, regards,
--
dim

#36

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Dimitri Fontaine (#35)

Re: Synchronous replication - patch status inquiry

On 06/09/10 16:03, Dimitri Fontaine wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

(scratches head..) What's the point of differentiating
received/fsynced/replayed, if the master receives the ack for all of them at
the same time?

It wouldn't the way I understand Simon's proposal.

What's happening is that the feedback channel is periodically sending an
array of 3 LSN, the currently last received, fsync()ed and applied ones.

"Periodically" is a performance problem. The bottleneck in synchronous
replication is typically the extra round-trip between master and
standby, as the master needs to wait for the acknowledgment. Any delays
in sending that acknowledgment lead directly to a decrease in
performance. That's also why we need to eliminate the polling loops in
walsender and walreceiver, and make them react immediately when there's
work to do.

Let's try this with an example: In the master, I do stuff and commit a
transaction. I want to know when the transaction is fsynced in the
standby. The WAL is sent to the standby, up to the commit record.

[...]

So, when does standby send the single message back to the master?

The standby is sending a stream of messages to the master with current
LSN positions at the time the message is sent. Given a synchronous
transaction, the master would wait until the feedback stream reports
that the current transaction is in the past compared to the streamed
last known synced one (or the same).

That doesn't really answer the question: *when* does standby send back
the acknowledgment?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#37

Simon Riggs

simon@2ndQuadrant.com

over 15 years ago

In reply to: Heikki Linnakangas (#36)

Re: Synchronous replication - patch status inquiry

On Mon, 2010-09-06 at 16:14 +0300, Heikki Linnakangas wrote:

The standby is sending a stream of messages to the master with current
LSN positions at the time the message is sent. Given a synchronous
transaction, the master would wait until the feedback stream reports
that the current transaction is in the past compared to the streamed
last known synced one (or the same).

That doesn't really answer the question: *when* does standby send back
the acknowledgment?

I think you should explain when you think this happens in your proposal.

Are you saying that you think the standby should send back one message
for every transaction? That you do not think we should buffer the return
messages?

You seem to be proposing a design for responsiveness to a single
transaction, not for overall throughput. That's certainly a design
choice, but it wouldn't be my recommendation that we did that.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

#38

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Simon Riggs (#37)

Re: Synchronous replication - patch status inquiry

On Mon, Sep 6, 2010 at 10:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

That doesn't really answer the question: *when* does standby send back
the acknowledgment?

I think you should explain when you think this happens in your proposal.

Are you saying that you think the standby should send back one message
for every transaction? That you do not think we should buffer the return
messages?

That's certainly what I was assuming - I can't speak for anyone else, of course.

You seem to be proposing a design for responsiveness to a single
transaction, not for overall throughput. That's certainly a design
choice, but it wouldn't be my recommendation that we did that.

Gee, I thought that if we tried to buffer the messages, you'd end up
*reducing* overall throughput. Suppose we have a busy system. The
number of simultaneous transactions in flight is limited by
max_connections. So it seems to me that if each transaction takes X%
longer to commit, then throughput will be reduced by X%. And as
you've said, batching responses will make individual transactions less
responsive. The corresponding advantage of batching the responses is
that you reduce consumption of network bandwidth, but I don't think
that's normally where the bottleneck will be.

Of course, you might be able to opportunistically combine messages, if
additional transactions become ready to acknowledge after the first
one has become ready but before the acknowledgement has actually been
sent. But waiting to try to increase the batch size doesn't seem
right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#39

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Simon Riggs (#37)

Re: Synchronous replication - patch status inquiry

On 06/09/10 17:14, Simon Riggs wrote:

On Mon, 2010-09-06 at 16:14 +0300, Heikki Linnakangas wrote:

The standby is sending a stream of messages to the master with current
LSN positions at the time the message is sent. Given a synchronous
transaction, the master would wait until the feedback stream reports
that the current transaction is in the past compared to the streamed
last known synced one (or the same).

That doesn't really answer the question: *when* does standby send back
the acknowledgment?

I think you should explain when you think this happens in your proposal.

Are you saying that you think the standby should send back one message
for every transaction? That you do not think we should buffer the return
messages?

For the sake of argument, yes that's what I was thinking. Now please
explain how *you're* thinking it should work.

You seem to be proposing a design for responsiveness to a single
transaction, not for overall throughput. That's certainly a design
choice, but it wouldn't be my recommendation that we did that.

Sure, if there's more traffic, you can combine things. For example, if
one fsync in the standby flushes more than one commit record, you only
need one acknowledgment for all of them.

But don't dodge the question!

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#40

Simon Riggs

simon@2ndQuadrant.com

over 15 years ago

In reply to: Heikki Linnakangas (#39)

Re: Synchronous replication - patch status inquiry

On Tue, 2010-09-07 at 09:27 +0300, Heikki Linnakangas wrote:

On 06/09/10 17:14, Simon Riggs wrote:

On Mon, 2010-09-06 at 16:14 +0300, Heikki Linnakangas wrote:

The standby is sending a stream of messages to the master with current
LSN positions at the time the message is sent. Given a synchronous
transaction, the master would wait until the feedback stream reports
that the current transaction is in the past compared to the streamed
last known synced one (or the same).

That doesn't really answer the question: *when* does standby send back
the acknowledgment?

I think you should explain when you think this happens in your proposal.

Are you saying that you think the standby should send back one message
for every transaction? That you do not think we should buffer the return
messages?

For the sake of argument, yes that's what I was thinking. Now please
explain how *you're* thinking it should work.

The WAL is sent from master to standby in 8192 byte chunks, frequently
including multiple commits. From standby, one reply per chunk. If we
need to wait for apply while nothing else is received, we do.

You seem to be proposing a design for responsiveness to a single
transaction, not for overall throughput. That's certainly a design
choice, but it wouldn't be my recommendation that we did that.

Sure, if there's more traffic, you can combine things. For example, if
one fsync in the standby flushes more than one commit record, you only
need one acknowledgment for all of them.

But don't dodge the question!

Given that I've previously outlined the size and contents of request
packets, their role and frequency I don't think I've dodged anything; in
fact, I've almost outlined the whole design for you.

I am coding something to demonstrate the important aspects I've
espoused, just as you have done in the past when I didn't appreciate
and/or understand your ideas. That seems like the best way forwards
rather than wrangle through all the "that can't work" responses, which
actually takes longer.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

#41

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Simon Riggs (#40)

Re: Synchronous replication - patch status inquiry

On 07/09/10 12:47, Simon Riggs wrote:

The WAL is sent from master to standby in 8192 byte chunks, frequently
including multiple commits. From standby, one reply per chunk. If we
need to wait for apply while nothing else is received, we do.

Ok, thank you. The obvious performance problem is that even if you
define a transaction to use synchronization level 'recv', and there's no
other concurrent transactions running, you actually need to wait until
it's applied. If you have only one client, there is no difference
between the levels, you always get the same performance hit you get with
'apply'. With more clients, you get some benefit, but there's still
plenty of delays compared to the optimum.

Also remember that there can be a very big gap between when a record is
fsync'd and when it's applied, if the recovery needs to wait for a hot
standby transaction to finish.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#42

Simon Riggs

simon@2ndQuadrant.com

over 15 years ago

In reply to: Heikki Linnakangas (#41)

Re: Synchronous replication - patch status inquiry

On Tue, 2010-09-07 at 13:11 +0300, Heikki Linnakangas wrote:

The obvious performance problem

Is not obvious at all, and you misunderstand again. This emphasises the
need for me to show code.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

#43

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Simon Riggs (#40)

Re: Synchronous replication - patch status inquiry

Simon Riggs <simon@2ndQuadrant.com> writes:

On Tue, 2010-09-07 at 09:27 +0300, Heikki Linnakangas wrote:

For the sake of argument, yes that's what I was thinking. Now please
explain how *you're* thinking it should work.

The WAL is sent from master to standby in 8192 byte chunks, frequently
including multiple commits. From standby, one reply per chunk. If we
need to wait for apply while nothing else is received, we do.

That premise is completely false. SR does not send WAL in page units.
If it did, it would have the same performance problems as the old
WAL-file-at-a-time implementation, just with slightly smaller
granularity.

regards, tom lane

#44

Simon Riggs

simon@2ndQuadrant.com

over 15 years ago

In reply to: Tom Lane (#43)

Re: Synchronous replication - patch status inquiry

On Tue, 2010-09-07 at 10:47 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On Tue, 2010-09-07 at 09:27 +0300, Heikki Linnakangas wrote:

For the sake of argument, yes that's what I was thinking. Now please
explain how *you're* thinking it should work.

The WAL is sent from master to standby in 8192 byte chunks, frequently
including multiple commits. From standby, one reply per chunk. If we
need to wait for apply while nothing else is received, we do.

That premise is completely false. SR does not send WAL in page units.
If it did, it would have the same performance problems as the old
WAL-file-at-a-time implementation, just with slightly smaller
granularity.

There's no dependence on pages in that proposal, so don't understand.

What aspect of the above would you change? and to what?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

#45

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Simon Riggs (#44)

Re: Synchronous replication - patch status inquiry

Simon Riggs <simon@2ndQuadrant.com> writes:

On Tue, 2010-09-07 at 10:47 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

The WAL is sent from master to standby in 8192 byte chunks, frequently
including multiple commits. From standby, one reply per chunk. If we
need to wait for apply while nothing else is received, we do.

That premise is completely false. SR does not send WAL in page units.
If it did, it would have the same performance problems as the old
WAL-file-at-a-time implementation, just with slightly smaller
granularity.

There's no dependence on pages in that proposal, so don't understand.

Oh, well you certainly didn't explain it well then.

What I *think* you're saying is that the slave doesn't send per-commit
messages, but instead processes the WAL as it's received and then sends
a heres-where-I-am status message back upstream immediately before going
to sleep waiting for the next chunk. That's fine as far as the protocol
goes, but I'm not convinced that it really does all that much in terms
of improving performance. You still have the problem that the master
has to fsync its WAL before it can send it to the slave. Also, the
slave won't know whether it ought to fsync its own WAL before replying.

regards, tom lane

#46

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Tom Lane (#45)

Re: Synchronous replication - patch status inquiry

On Tue, Sep 7, 2010 at 11:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oh, well you certainly didn't explain it well then.

What I *think* you're saying is that the slave doesn't send per-commit
messages, but instead processes the WAL as it's received and then sends
a heres-where-I-am status message back upstream immediately before going
to sleep waiting for the next chunk. That's fine as far as the protocol
goes, but I'm not convinced that it really does all that much in terms
of improving performance. You still have the problem that the master
has to fsync its WAL before it can send it to the slave.

We have that problem in all of these proposals, don't we? We
certainly have no infrastructure to handle the slave getting ahead of
the master in the WAL stream.

Also, the
slave won't know whether it ought to fsync its own WAL before replying.

Right. And whether it ought to replay it before replying.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#47

Simon Riggs

simon@2ndQuadrant.com

over 15 years ago

In reply to: Tom Lane (#45)

Re: Synchronous replication - patch status inquiry

On Tue, 2010-09-07 at 11:41 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On Tue, 2010-09-07 at 10:47 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

The WAL is sent from master to standby in 8192 byte chunks, frequently
including multiple commits. From standby, one reply per chunk. If we
need to wait for apply while nothing else is received, we do.

That premise is completely false. SR does not send WAL in page units.
If it did, it would have the same performance problems as the old
WAL-file-at-a-time implementation, just with slightly smaller
granularity.

There's no dependence on pages in that proposal, so don't understand.

Oh, well you certainly didn't explain it well then.

What I *think* you're saying is that the slave doesn't send per-commit
messages, but instead processes the WAL as it's received and then sends
a heres-where-I-am status message back upstream immediately before going
to sleep waiting for the next chunk. That's fine as far as the protocol
goes, but I'm not convinced that it really does all that much in terms
of improving performance. You still have the problem that the master
has to fsync its WAL before it can send it to the slave. Also, the
slave won't know whether it ought to fsync its own WAL before replying.

Yes, apart from last sentence. Please wait for the code.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

#48

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Simon Riggs (#47)

Re: Synchronous replication - patch status inquiry

On Tue, Sep 7, 2010 at 11:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

What I *think* you're saying is that the slave doesn't send per-commit
messages, but instead processes the WAL as it's received and then sends
a heres-where-I-am status message back upstream immediately before going
to sleep waiting for the next chunk. That's fine as far as the protocol
goes, but I'm not convinced that it really does all that much in terms
of improving performance. You still have the problem that the master
has to fsync its WAL before it can send it to the slave. Also, the
slave won't know whether it ought to fsync its own WAL before replying.

Yes, apart from last sentence. Please wait for the code.

So, we're going around and around in circles here because you're
repeatedly refusing to explain how the slave will know WHEN to send
acknowledgments back to the master without knowing which sync rep
level is in use. It seems to be perfectly evident to everyone else
here that there are only two ways for this to work: either the value
is configured on the standby, or there's a registration system on the
master and the master tells the standby its wishes. Instead of asking
the entire community to wait for an unspecified period of time for you
to write code that will handle this in an unspecified way, how about
answering the question? We've wasted far too much time arguing about
this already.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#49

Simon Riggs

simon@2ndQuadrant.com

over 15 years ago

In reply to: Robert Haas (#48)

Re: Synchronous replication - patch status inquiry

On Tue, 2010-09-07 at 12:07 -0400, Robert Haas wrote:

On Tue, Sep 7, 2010 at 11:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

What I *think* you're saying is that the slave doesn't send per-commit
messages, but instead processes the WAL as it's received and then sends
a heres-where-I-am status message back upstream immediately before going
to sleep waiting for the next chunk. That's fine as far as the protocol
goes, but I'm not convinced that it really does all that much in terms
of improving performance. You still have the problem that the master
has to fsync its WAL before it can send it to the slave. Also, the
slave won't know whether it ought to fsync its own WAL before replying.

Yes, apart from last sentence. Please wait for the code.

So, we're going around and around in circles here because you're
repeatedly refusing to explain how the slave will know WHEN to send
acknowledgments back to the master without knowing which sync rep
level is in use. It seems to be perfectly evident to everyone else
here that there are only two ways for this to work: either the value
is configured on the standby, or there's a registration system on the
master and the master tells the standby its wishes. Instead of asking
the entire community to wait for an unspecified period of time for you
to write code that will handle this in an unspecified way, how about
answering the question? We've wasted far too much time arguing about
this already.

Every time I explain anything, I get someone run around shouting "but
that can't work!". I'm sorry, but again your logic is poor and the bias
against properly considering viable alternatives is the only thing
perfectly evident. So yes, I agree, it is a waste of time discussing it
until I show working code.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

#50

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Simon Riggs (#49)

Re: Synchronous replication - patch status inquiry

On Tue, Sep 7, 2010 at 2:15 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Every time I explain anything, I get someone run around shouting "but
that can't work!". I'm sorry, but again your logic is poor and the bias
against properly considering viable alternatives is the only thing
perfectly evident. So yes, I agree, it is a waste of time discussing it
until I show working code.

Obviously you don't "agree", because that's the exact opposite of what
I just said.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#51

Bruce Momjian

bruce@momjian.us

over 15 years ago

In reply to: Robert Haas (#48)

Re: Synchronous replication - patch status inquiry

Robert Haas wrote:

On Tue, Sep 7, 2010 at 11:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

What I *think* you're saying is that the slave doesn't send per-commit
messages, but instead processes the WAL as it's received and then sends
a heres-where-I-am status message back upstream immediately before going
to sleep waiting for the next chunk. ?That's fine as far as the protocol
goes, but I'm not convinced that it really does all that much in terms
of improving performance. ?You still have the problem that the master
has to fsync its WAL before it can send it to the slave. ?Also, the
slave won't know whether it ought to fsync its own WAL before replying.

Yes, apart from last sentence. Please wait for the code.

So, we're going around and around in circles here because you're
repeatedly refusing to explain how the slave will know WHEN to send
acknowledgments back to the master without knowing which sync rep
level is in use. It seems to be perfectly evident to everyone else
here that there are only two ways for this to work: either the value
is configured on the standby, or there's a registration system on the
master and the master tells the standby its wishes. Instead of asking
the entire community to wait for an unspecified period of time for you
to write code that will handle this in an unspecified way, how about
answering the question? We've wasted far too much time arguing about
this already.

Ideally I would like the sync method to be set on each slave, and have
some method for the master to query the sync mode of all the slaves, e.g.
appname.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#52

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Fujii Masao (#28)

1 attachment(s)

Re: Synchronous replication - patch status inquiry

On Fri, Sep 3, 2010 at 3:42 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Here is the proposed detailed design:

standbys.conf
=============
# This is not initialized by initdb, so users need to create it under $PGDATA.
* The template is located in the PREFIX/share directory.

# This is read by postmaster at the startup as well as pg_hba.conf is.
* In EXEC_BACKEND environement, each walsender must read it at the startup.
* This is ignored when max_wal_senders is zero.
* FATAL is emitted when standbys.conf doesn't exist even if max_wal_senders
is positive.

# SIGHUP makes only postmaser re-read the standbys.conf.
* New configuration doesn't affect the existing connections to the standbys,
i.e., it's used only for subsequent connections.
* XXX: Should the existing connections react to new configuration? What if
new standbys.conf doesn't have the standby_name of the existing
connection?

# The connection from the standby is rejected if its standby_name is not listed
in standbys.conf.
* Multiple standbys with the same name are allowed.

# The valid values of SYNCHRONOUS field are async, recv, fsync and replay.

standby_name
============
# This is new string-typed parameter in recovery.conf.
* XXX: Should standby_name and standby_mode be merged?

# Walreceiver sends this to the master when establishing the connection.

The attached patch implements the above and simple synchronous replication
feature, which doesn't include quorum commit capability. The replication
mode (async, recv, fsync, replay) can be specified on a per-standby basis,
in standbys.conf.

The patch still uses a poll loop in the backend, walsender, startup process
and walreceiver. If a latch feature Heikki proposed will have been committed,
I'll replace that with a latch.

The documentation has not fully updated yet. I'll work on the document until
the deadline of the next CF.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

synchrep_0910.patchapplication/octet-stream; name=synchrep_0910.patchDownload

*** a/doc/src/sgml/protocol.sgml
--- b/doc/src/sgml/protocol.sgml
***************
*** 1360,1365 **** The commands accepted in walsender mode are:
--- 1360,1400 ----
        <variablelist>
        <varlistentry>
        <term>
+           XLogRecPtr (F)
+       </term>
+       <listitem>
+       <para>
+       <variablelist>
+       <varlistentry>
+       <term>
+           Byte1('l')
+       </term>
+       <listitem>
+       <para>
+           Identifies the message as an acknowledgment of replication.
+       </para>
+       </listitem>
+       </varlistentry>
+       <varlistentry>
+       <term>
+           Byte8
+       </term>
+       <listitem>
+       <para>
+           The end of the WAL data replicated to the standby, given in
+           XLogRecPtr format.
+       </para>
+       </listitem>
+       </varlistentry>
+       </variablelist>
+       </para>
+       </listitem>
+       </varlistentry>
+       </variablelist>
+ 
+       <variablelist>
+       <varlistentry>
+       <term>
            XLogData (B)
        </term>
        <listitem>
*** a/src/backend/Makefile
--- b/src/backend/Makefile
***************
*** 208,213 **** endif
--- 208,214 ----
  	$(INSTALL_DATA) $(srcdir)/libpq/pg_ident.conf.sample '$(DESTDIR)$(datadir)/pg_ident.conf.sample'
  	$(INSTALL_DATA) $(srcdir)/utils/misc/postgresql.conf.sample '$(DESTDIR)$(datadir)/postgresql.conf.sample'
  	$(INSTALL_DATA) $(srcdir)/access/transam/recovery.conf.sample '$(DESTDIR)$(datadir)/recovery.conf.sample'
+ 	$(INSTALL_DATA) $(srcdir)/replication/standbys.conf.sample '$(DESTDIR)$(datadir)/standbys.conf.sample'
  
  install-bin: postgres $(POSTGRES_IMP) installdirs
  	$(INSTALL_PROGRAM) postgres$(X) '$(DESTDIR)$(bindir)/postgres$(X)'
***************
*** 262,268 **** endif
  	rm -f '$(DESTDIR)$(datadir)/pg_hba.conf.sample' \
  	      '$(DESTDIR)$(datadir)/pg_ident.conf.sample' \
                '$(DESTDIR)$(datadir)/postgresql.conf.sample' \
! 	      '$(DESTDIR)$(datadir)/recovery.conf.sample'
  
  
  ##########################################################################
--- 263,270 ----
  	rm -f '$(DESTDIR)$(datadir)/pg_hba.conf.sample' \
  	      '$(DESTDIR)$(datadir)/pg_ident.conf.sample' \
                '$(DESTDIR)$(datadir)/postgresql.conf.sample' \
! 	      '$(DESTDIR)$(datadir)/recovery.conf.sample' \
! 	      '$(DESTDIR)$(datadir)/standbys.conf.sample'
  
  
  ##########################################################################
*** a/src/backend/access/transam/recovery.conf.sample
--- b/src/backend/access/transam/recovery.conf.sample
***************
*** 91,102 ****
  #---------------------------------------------------------------------------
  #
  # When standby_mode is enabled, the PostgreSQL server will work as
! # a standby. It tries to connect to the primary according to the
! # connection settings primary_conninfo, and receives XLOG records
! # continuously.
  #
  #standby_mode = 'off'
  #
  #primary_conninfo = ''		# e.g. 'host=localhost port=5432'
  #
  #
--- 91,104 ----
  #---------------------------------------------------------------------------
  #
  # When standby_mode is enabled, the PostgreSQL server will work as
! # a standby under the name of standby_name. It tries to connect to
! # the primary according to the connection settings primary_conninfo,
! # and receives XLOG records continuously.
  #
  #standby_mode = 'off'
  #
+ #standby_name = ''
+ #
  #primary_conninfo = ''		# e.g. 'host=localhost port=5432'
  #
  #
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***************
*** 55,60 ****
--- 55,61 ----
  #include "miscadmin.h"
  #include "pg_trace.h"
  #include "pgstat.h"
+ #include "replication/walsender.h"
  #include "storage/fd.h"
  #include "storage/procarray.h"
  #include "storage/sinvaladt.h"
***************
*** 1062,1067 **** EndPrepare(GlobalTransaction gxact)
--- 1063,1080 ----
  
  	END_CRIT_SECTION();
  
+ 	/*
+ 	 * Wait for WAL to be replicated up to the PREPARE record
+ 	 * if replication is enabled. This operation has to be performed
+ 	 * after the PREPARE record is generated and before other
+ 	 * transactions know that this one has already been prepared.
+ 	 *
+ 	 * XXX: Since the caller prevents cancel/die interrupt, we cannot
+ 	 * process that while waiting. Should we remove this restriction?
+ 	 */
+ 	if (max_wal_senders > 0)
+ 		WaitXLogSend(gxact->prepare_lsn);
+ 
  	records.tail = records.head = NULL;
  }
  
***************
*** 2012,2017 **** RecordTransactionCommitPrepared(TransactionId xid,
--- 2025,2039 ----
  	MyProc->inCommit = false;
  
  	END_CRIT_SECTION();
+ 
+ 	/*
+ 	 * Wait for WAL to be replicated up to the COMMIT PREPARED record
+ 	 * if replication is enabled. This operation has to be performed
+ 	 * after the COMMIT PREPARED record is generated and before other
+ 	 * transactions know that this one has already been committed.
+ 	 */
+ 	if (max_wal_senders > 0)
+ 		WaitXLogSend(recptr);
  }
  
  /*
***************
*** 2084,2087 **** RecordTransactionAbortPrepared(TransactionId xid,
--- 2106,2118 ----
  	TransactionIdAbortTree(xid, nchildren, children);
  
  	END_CRIT_SECTION();
+ 
+ 	/*
+ 	 * Wait for WAL to be replicated up to the ABORT PREPARED record
+ 	 * if replication is enabled. This operation has to be performed
+ 	 * after the ABORT PREPARED record is generated and before other
+ 	 * transactions know that this one has already been aborted.
+ 	 */
+ 	if (max_wal_senders > 0)
+ 		WaitXLogSend(recptr);
  }
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 36,41 ****
--- 36,42 ----
  #include "libpq/be-fsstubs.h"
  #include "miscadmin.h"
  #include "pgstat.h"
+ #include "replication/walsender.h"
  #include "storage/bufmgr.h"
  #include "storage/fd.h"
  #include "storage/lmgr.h"
***************
*** 1110,1115 **** RecordTransactionCommit(void)
--- 1111,1128 ----
  	/* Compute latestXid while we have the child XIDs handy */
  	latestXid = TransactionIdLatest(xid, nchildren, children);
  
+ 	/*
+ 	 * Wait for WAL to be replicated up to the COMMIT record if replication
+ 	 * is enabled. This operation has to be performed after the COMMIT record
+ 	 * is generated and before other transactions know that this one has
+ 	 * already been committed.
+ 	 *
+ 	 * XXX: Since the caller prevents cancel/die interrupt, we cannot
+ 	 * process that while waiting. Should we remove this restriction?
+ 	 */
+ 	if (max_wal_senders > 0)
+ 		WaitXLogSend(XactLastRecEnd);
+ 
  	/* Reset XactLastRecEnd until the next transaction writes something */
  	XactLastRecEnd.xrecoff = 0;
  
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 189,194 **** static TimestampTz recoveryTargetTime;
--- 189,195 ----
  static bool StandbyMode = false;
  static char *PrimaryConnInfo = NULL;
  static char *TriggerFile = NULL;
+ static char *StandbyName = NULL;
  
  /* if recoveryStopsHere returns true, it saves actual stop xid/time here */
  static TransactionId recoveryStopXid;
***************
*** 532,537 **** typedef struct xl_parameter_change
--- 533,548 ----
  	int			wal_level;
  } xl_parameter_change;
  
+ /* Replication mode names */
+ const char *ReplicationModeNames[] = {
+ 	"async",				/* REPLICATION_MODE_ASYNC */
+ 	"recv",				/* REPLICATION_MODE_RECV */
+ 	"fsync",				/* REPLICATION_MODE_FSYNC */
+ 	"replay"				/* REPLICATION_MODE_REPLAY */
+ };
+ 
+ ReplicationMode		rplMode = InvalidReplicationMode;
+ 
  /*
   * Flags set by interrupt handlers for later service in the redo loop.
   */
***************
*** 5258,5263 **** readRecoveryCommandFile(void)
--- 5269,5281 ----
  					(errmsg("trigger_file = '%s'",
  							TriggerFile)));
  		}
+ 		else if (strcmp(tok1, "standby_name") == 0)
+ 		{
+ 			StandbyName = pstrdup(tok2);
+ 			ereport(DEBUG2,
+ 					(errmsg("standby_name = '%s'",
+ 							StandbyName)));
+ 		}
  		else
  			ereport(FATAL,
  					(errmsg("unrecognized recovery parameter \"%s\"",
***************
*** 6867,6872 **** GetFlushRecPtr(void)
--- 6885,6907 ----
  }
  
  /*
+  * GetReplayRecPtr -- Returns the last replay position.
+  */
+ XLogRecPtr
+ GetReplayRecPtr(void)
+ {
+ 	/* use volatile pointer to prevent code rearrangement */
+ 	volatile XLogCtlData *xlogctl = XLogCtl;
+ 	XLogRecPtr	recptr;
+ 
+ 	SpinLockAcquire(&xlogctl->info_lck);
+ 	recptr = xlogctl->recoveryLastRecPtr;
+ 	SpinLockRelease(&xlogctl->info_lck);
+ 
+ 	return recptr;
+ }
+ 
+ /*
   * Get the time of the last xlog segment switch
   */
  pg_time_t
***************
*** 8828,8842 **** pg_last_xlog_receive_location(PG_FUNCTION_ARGS)
  Datum
  pg_last_xlog_replay_location(PG_FUNCTION_ARGS)
  {
- 	/* use volatile pointer to prevent code rearrangement */
- 	volatile XLogCtlData *xlogctl = XLogCtl;
  	XLogRecPtr	recptr;
  	char		location[MAXFNAMELEN];
  
! 	SpinLockAcquire(&xlogctl->info_lck);
! 	recptr = xlogctl->recoveryLastRecPtr;
! 	SpinLockRelease(&xlogctl->info_lck);
! 
  	if (recptr.xlogid == 0 && recptr.xrecoff == 0)
  		PG_RETURN_NULL();
  
--- 8863,8872 ----
  Datum
  pg_last_xlog_replay_location(PG_FUNCTION_ARGS)
  {
  	XLogRecPtr	recptr;
  	char		location[MAXFNAMELEN];
  
! 	recptr = GetReplayRecPtr();
  	if (recptr.xlogid == 0 && recptr.xrecoff == 0)
  		PG_RETURN_NULL();
  
***************
*** 9467,9473 **** retry:
  						{
  							RequestXLogStreaming(
  									  fetching_ckpt ? RedoStartLSN : *RecPtr,
! 												 PrimaryConnInfo);
  							continue;
  						}
  					}
--- 9497,9503 ----
  						{
  							RequestXLogStreaming(
  									  fetching_ckpt ? RedoStartLSN : *RecPtr,
! 												 PrimaryConnInfo, StandbyName);
  							continue;
  						}
  					}
***************
*** 9681,9683 **** CheckForStandbyTrigger(void)
--- 9711,9727 ----
  	}
  	return false;
  }
+ 
+ /*
+  * Look up replication mode value by name.
+  */
+ ReplicationMode
+ ReplicationModeNameGetValue(char *name)
+ {
+ 	ReplicationMode	mode;
+ 
+ 	for (mode = 0; mode <= MAXREPLICATIONMODE; mode++)
+ 		if (strcmp(ReplicationModeNames[mode], name) == 0)
+ 			return mode;
+ 	return InvalidReplicationMode;
+ }
*** a/src/backend/libpq/hba.c
--- b/src/backend/libpq/hba.c
***************
*** 38,46 ****
  #define atooid(x)  ((Oid) strtoul((x), NULL, 10))
  #define atoxid(x)  ((TransactionId) strtoul((x), NULL, 10))
  
- /* This is used to separate values in multi-valued column strings */
- #define MULTI_VALUE_SEP "\001"
- 
  #define MAX_TOKEN	256
  
  /* callback data for check_network_callback */
--- 38,43 ----
***************
*** 54,59 **** typedef struct check_network_data
--- 51,59 ----
  /* pre-parsed content of HBA config file: list of HbaLine structs */
  static List *parsed_hba_lines = NIL;
  
+ static const char *hba_keywords[] = {"all", "sameuser", "samegroup", "samerole",
+ 									 "replication", NULL};
+ 
  /*
   * These variables hold the pre-parsed contents of the ident usermap
   * configuration file.	ident_lines is a list of sublists, one sublist for
***************
*** 67,76 **** static List *ident_lines = NIL;
  static List *ident_line_nums = NIL;
  
  
- static void tokenize_file(const char *filename, FILE *file,
- 			  List **lines, List **line_nums);
  static char *tokenize_inc_file(const char *outer_filename,
! 				  const char *inc_filename);
  
  /*
   * isblank() exists in the ISO C99 spec, but it's not very portable yet,
--- 67,74 ----
  static List *ident_line_nums = NIL;
  
  
  static char *tokenize_inc_file(const char *outer_filename,
! 				  const char *inc_filename, const char **keywords);
  
  /*
   * isblank() exists in the ISO C99 spec, but it's not very portable yet,
***************
*** 108,114 **** pg_isblank(const char c)
   * token.
   */
  static bool
! next_token(FILE *fp, char *buf, int bufsz, bool *initial_quote)
  {
  	int			c;
  	char	   *start_buf = buf;
--- 106,113 ----
   * token.
   */
  static bool
! next_token(const char *filename, FILE *fp, char *buf, int bufsz,
! 		   bool *initial_quote, const char **keywords)
  {
  	int			c;
  	char	   *start_buf = buf;
***************
*** 155,162 **** next_token(FILE *fp, char *buf, int bufsz, bool *initial_quote)
  			*buf = '\0';
  			ereport(LOG,
  					(errcode(ERRCODE_CONFIG_FILE_ERROR),
! 			   errmsg("authentication file token too long, skipping: \"%s\"",
! 					  start_buf)));
  			/* Discard remainder of line */
  			while ((c = getc(fp)) != EOF && c != '\n')
  				;
--- 154,161 ----
  			*buf = '\0';
  			ereport(LOG,
  					(errcode(ERRCODE_CONFIG_FILE_ERROR),
! 			   errmsg("configuration file \"%s\" token too long, skipping: \"%s\"",
! 					  filename, start_buf)));
  			/* Discard remainder of line */
  			while ((c = getc(fp)) != EOF && c != '\n')
  				;
***************
*** 196,211 **** next_token(FILE *fp, char *buf, int bufsz, bool *initial_quote)
  
  	*buf = '\0';
  
! 	if (!saw_quote &&
! 		(strcmp(start_buf, "all") == 0 ||
! 		 strcmp(start_buf, "sameuser") == 0 ||
! 		 strcmp(start_buf, "samegroup") == 0 ||
! 		 strcmp(start_buf, "samerole") == 0 ||
! 		 strcmp(start_buf, "replication") == 0))
  	{
! 		/* append newline to a magical keyword */
! 		*buf++ = '\n';
! 		*buf = '\0';
  	}
  
  	return (saw_quote || buf > start_buf);
--- 195,214 ----
  
  	*buf = '\0';
  
! 	if (!saw_quote)
  	{
! 		const char	**entry;
! 
! 		for (entry = keywords; *entry != NULL; entry++)
! 		{
! 			if (strcmp(start_buf, *entry) == 0)
! 			{
! 				/* append newline to a magical keyword */
! 				*buf++ = '\n';
! 				*buf = '\0';
! 				break;
! 			}
! 		}
  	}
  
  	return (saw_quote || buf > start_buf);
***************
*** 219,225 **** next_token(FILE *fp, char *buf, int bufsz, bool *initial_quote)
   * The result is a palloc'd string, or NULL if we have reached EOL.
   */
  static char *
! next_token_expand(const char *filename, FILE *file)
  {
  	char		buf[MAX_TOKEN];
  	char	   *comma_str = pstrdup("");
--- 222,228 ----
   * The result is a palloc'd string, or NULL if we have reached EOL.
   */
  static char *
! next_token_expand(const char *filename, FILE *file, const char **keywords)
  {
  	char		buf[MAX_TOKEN];
  	char	   *comma_str = pstrdup("");
***************
*** 231,237 **** next_token_expand(const char *filename, FILE *file)
  
  	do
  	{
! 		if (!next_token(file, buf, sizeof(buf), &initial_quote))
  			break;
  
  		got_something = true;
--- 234,241 ----
  
  	do
  	{
! 		if (!next_token(filename, file, buf, sizeof(buf), &initial_quote,
! 						keywords))
  			break;
  
  		got_something = true;
***************
*** 246,252 **** next_token_expand(const char *filename, FILE *file)
  
  		/* Is this referencing a file? */
  		if (!initial_quote && buf[0] == '@' && buf[1] != '\0')
! 			incbuf = tokenize_inc_file(filename, buf + 1);
  		else
  			incbuf = pstrdup(buf);
  
--- 250,256 ----
  
  		/* Is this referencing a file? */
  		if (!initial_quote && buf[0] == '@' && buf[1] != '\0')
! 			incbuf = tokenize_inc_file(filename, buf + 1, keywords);
  		else
  			incbuf = pstrdup(buf);
  
***************
*** 273,279 **** next_token_expand(const char *filename, FILE *file)
  /*
   * Free memory used by lines/tokens (i.e., structure built by tokenize_file)
   */
! static void
  free_lines(List **lines, List **line_nums)
  {
  	/*
--- 277,283 ----
  /*
   * Free memory used by lines/tokens (i.e., structure built by tokenize_file)
   */
! void
  free_lines(List **lines, List **line_nums)
  {
  	/*
***************
*** 318,324 **** free_lines(List **lines, List **line_nums)
  
  static char *
  tokenize_inc_file(const char *outer_filename,
! 				  const char *inc_filename)
  {
  	char	   *inc_fullname;
  	FILE	   *inc_file;
--- 322,328 ----
  
  static char *
  tokenize_inc_file(const char *outer_filename,
! 				  const char *inc_filename, const char **keywords)
  {
  	char	   *inc_fullname;
  	FILE	   *inc_file;
***************
*** 348,354 **** tokenize_inc_file(const char *outer_filename,
  	{
  		ereport(LOG,
  				(errcode_for_file_access(),
! 				 errmsg("could not open secondary authentication file \"@%s\" as \"%s\": %m",
  						inc_filename, inc_fullname)));
  		pfree(inc_fullname);
  
--- 352,358 ----
  	{
  		ereport(LOG,
  				(errcode_for_file_access(),
! 				 errmsg("could not open secondary configuration file \"@%s\" as \"%s\": %m",
  						inc_filename, inc_fullname)));
  		pfree(inc_fullname);
  
***************
*** 357,363 **** tokenize_inc_file(const char *outer_filename,
  	}
  
  	/* There is possible recursion here if the file contains @ */
! 	tokenize_file(inc_fullname, inc_file, &inc_lines, &inc_line_nums);
  
  	FreeFile(inc_file);
  	pfree(inc_fullname);
--- 361,368 ----
  	}
  
  	/* There is possible recursion here if the file contains @ */
! 	tokenize_file(inc_fullname, inc_file, &inc_lines, &inc_line_nums,
! 				  keywords);
  
  	FreeFile(inc_file);
  	pfree(inc_fullname);
***************
*** 404,412 **** tokenize_inc_file(const char *outer_filename,
   *
   * filename must be the absolute path to the target file.
   */
! static void
  tokenize_file(const char *filename, FILE *file,
! 			  List **lines, List **line_nums)
  {
  	List	   *current_line = NIL;
  	int			line_number = 1;
--- 409,417 ----
   *
   * filename must be the absolute path to the target file.
   */
! void
  tokenize_file(const char *filename, FILE *file,
! 			  List **lines, List **line_nums, const char **keywords)
  {
  	List	   *current_line = NIL;
  	int			line_number = 1;
***************
*** 416,422 **** tokenize_file(const char *filename, FILE *file,
  
  	while (!feof(file) && !ferror(file))
  	{
! 		buf = next_token_expand(filename, file);
  
  		/* add token to list, unless we are at EOL or comment start */
  		if (buf)
--- 421,427 ----
  
  	while (!feof(file) && !ferror(file))
  	{
! 		buf = next_token_expand(filename, file, keywords);
  
  		/* add token to list, unless we are at EOL or comment start */
  		if (buf)
***************
*** 1490,1496 **** load_hba(void)
  		return false;
  	}
  
! 	tokenize_file(HbaFileName, file, &hba_lines, &hba_line_nums);
  	FreeFile(file);
  
  	/* Now parse all the lines */
--- 1495,1501 ----
  		return false;
  	}
  
! 	tokenize_file(HbaFileName, file, &hba_lines, &hba_line_nums, hba_keywords);
  	FreeFile(file);
  
  	/* Now parse all the lines */
***************
*** 1809,1815 **** load_ident(void)
  	}
  	else
  	{
! 		tokenize_file(IdentFileName, file, &ident_lines, &ident_line_nums);
  		FreeFile(file);
  	}
  }
--- 1814,1821 ----
  	}
  	else
  	{
! 		tokenize_file(IdentFileName, file, &ident_lines, &ident_line_nums,
! 					  hba_keywords);
  		FreeFile(file);
  	}
  }
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***************
*** 1063,1069 **** PostmasterMain(int argc, char *argv[])
  	autovac_init();
  
  	/*
! 	 * Load configuration files for client authentication.
  	 */
  	if (!load_hba())
  	{
--- 1063,1069 ----
  	autovac_init();
  
  	/*
! 	 * Load configuration files for client authentication and replication.
  	 */
  	if (!load_hba())
  	{
***************
*** 1075,1080 **** PostmasterMain(int argc, char *argv[])
--- 1075,1085 ----
  				(errmsg("could not load pg_hba.conf")));
  	}
  	load_ident();
+ 	if (max_wal_senders > 0 && !load_standbys())
+ 	{
+ 		ereport(FATAL,
+ 				(errmsg("could not load standbys.conf")));
+ 	}
  
  	/*
  	 * Remember postmaster startup time
***************
*** 1713,1718 **** retry1:
--- 1718,1725 ----
  							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  							 errmsg("invalid value for boolean option \"replication\"")));
  			}
+ 			else if (strcmp(nameptr, "standby_name") == 0)
+ 			    standby_name = pstrdup(valptr);
  			else
  			{
  				/* Assume it's a generic GUC option */
***************
*** 2129,2134 **** SIGHUP_handler(SIGNAL_ARGS)
--- 2136,2146 ----
  
  		load_ident();
  
+ 		/* Reload standbys configuration file too */
+ 		if (max_wal_senders > 0 && !load_standbys())
+ 			ereport(WARNING,
+ 					(errmsg("standbys.conf not reloaded")));
+ 
  #ifdef EXEC_BACKEND
  		/* Update the starting-point file for future children */
  		write_nondefault_variables(PGC_SIGHUP);
*** a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
--- b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
***************
*** 47,55 **** static bool justconnected = false;
  static char *recvBuf = NULL;
  
  /* Prototypes for interface functions */
! static bool libpqrcv_connect(char *conninfo, XLogRecPtr startpoint);
  static bool libpqrcv_receive(int timeout, unsigned char *type,
  				 char **buffer, int *len);
  static void libpqrcv_disconnect(void);
  
  /* Prototypes for private functions */
--- 47,57 ----
  static char *recvBuf = NULL;
  
  /* Prototypes for interface functions */
! static bool libpqrcv_connect(char *conninfo, XLogRecPtr startpoint,
! 							 char *standbyName);
  static bool libpqrcv_receive(int timeout, unsigned char *type,
  				 char **buffer, int *len);
+ static void libpqrcv_send(const char *buffer, int nbytes);
  static void libpqrcv_disconnect(void);
  
  /* Prototypes for private functions */
***************
*** 64,73 **** _PG_init(void)
  {
  	/* Tell walreceiver how to reach us */
  	if (walrcv_connect != NULL || walrcv_receive != NULL ||
! 		walrcv_disconnect != NULL)
  		elog(ERROR, "libpqwalreceiver already loaded");
  	walrcv_connect = libpqrcv_connect;
  	walrcv_receive = libpqrcv_receive;
  	walrcv_disconnect = libpqrcv_disconnect;
  }
  
--- 66,76 ----
  {
  	/* Tell walreceiver how to reach us */
  	if (walrcv_connect != NULL || walrcv_receive != NULL ||
! 		walrcv_send != NULL || walrcv_disconnect != NULL)
  		elog(ERROR, "libpqwalreceiver already loaded");
  	walrcv_connect = libpqrcv_connect;
  	walrcv_receive = libpqrcv_receive;
+ 	walrcv_send = libpqrcv_send;
  	walrcv_disconnect = libpqrcv_disconnect;
  }
  
***************
*** 75,98 **** _PG_init(void)
   * Establish the connection to the primary server for XLOG streaming
   */
  static bool
! libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
  {
! 	char		conninfo_repl[MAXCONNINFO + 37];
  	char	   *primary_sysid;
  	char		standby_sysid[32];
  	TimeLineID	primary_tli;
  	TimeLineID	standby_tli;
  	PGresult   *res;
  	char		cmd[64];
  
  	/*
! 	 * Connect using deliberately undocumented parameter: replication. The
! 	 * database name is ignored by the server in replication mode, but specify
! 	 * "replication" for .pgpass lookup.
  	 */
! 	snprintf(conninfo_repl, sizeof(conninfo_repl),
! 			 "%s dbname=replication replication=true",
! 			 conninfo);
  
  	streamConn = PQconnectdb(conninfo_repl);
  	if (PQstatus(streamConn) != CONNECTION_OK)
--- 78,107 ----
   * Establish the connection to the primary server for XLOG streaming
   */
  static bool
! libpqrcv_connect(char *conninfo, XLogRecPtr startpoint, char *standbyName)
  {
! 	char		conninfo_repl[MAXCONNINFO + MAXSTANDBYNAME + 37];
  	char	   *primary_sysid;
  	char		standby_sysid[32];
  	TimeLineID	primary_tli;
  	TimeLineID	standby_tli;
+ 	char	   *primary_rplMode;
  	PGresult   *res;
  	char		cmd[64];
  
  	/*
! 	 * Connect using deliberately undocumented parameter: replication
! 	 * and standby_name. The database name is ignored by the server in
! 	 * replication mode, but specify "replication" for .pgpass lookup.
  	 */
! 	if (standbyName)
! 		snprintf(conninfo_repl, sizeof(conninfo_repl),
! 				 "%s dbname=replication replication=true standby_name=%s",
! 				 conninfo, standbyName);
! 	else
! 		snprintf(conninfo_repl, sizeof(conninfo_repl),
! 				 "%s dbname=replication replication=true",
! 				 conninfo);
  
  	streamConn = PQconnectdb(conninfo_repl);
  	if (PQstatus(streamConn) != CONNECTION_OK)
***************
*** 109,119 **** libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
  	{
  		PQclear(res);
  		ereport(ERROR,
! 				(errmsg("could not receive database system identifier and timeline ID from "
! 						"the primary server: %s",
  						PQerrorMessage(streamConn))));
  	}
! 	if (PQnfields(res) != 2 || PQntuples(res) != 1)
  	{
  		int			ntuples = PQntuples(res);
  		int			nfields = PQnfields(res);
--- 118,128 ----
  	{
  		PQclear(res);
  		ereport(ERROR,
! 				(errmsg("could not receive database system identifier, timeline ID and "
! 						"replication mode from the primary server: %s",
  						PQerrorMessage(streamConn))));
  	}
! 	if (PQnfields(res) != 3 || PQntuples(res) != 1)
  	{
  		int			ntuples = PQntuples(res);
  		int			nfields = PQnfields(res);
***************
*** 121,131 **** libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
  		PQclear(res);
  		ereport(ERROR,
  				(errmsg("invalid response from primary server"),
! 				 errdetail("Expected 1 tuple with 2 fields, got %d tuples with %d fields.",
  						   ntuples, nfields)));
  	}
  	primary_sysid = PQgetvalue(res, 0, 0);
  	primary_tli = pg_atoi(PQgetvalue(res, 0, 1), 4, 0);
  
  	/*
  	 * Confirm that the system identifier of the primary is the same as ours.
--- 130,141 ----
  		PQclear(res);
  		ereport(ERROR,
  				(errmsg("invalid response from primary server"),
! 				 errdetail("Expected 1 tuple with 3 fields, got %d tuples with %d fields.",
  						   ntuples, nfields)));
  	}
  	primary_sysid = PQgetvalue(res, 0, 0);
  	primary_tli = pg_atoi(PQgetvalue(res, 0, 1), 4, 0);
+ 	primary_rplMode = PQgetvalue(res, 0, 2);
  
  	/*
  	 * Confirm that the system identifier of the primary is the same as ours.
***************
*** 146,158 **** libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
  	 * recovery target timeline.
  	 */
  	standby_tli = GetRecoveryTargetTLI();
- 	PQclear(res);
  	if (primary_tli != standby_tli)
  		ereport(ERROR,
  				(errmsg("timeline %u of the primary does not match recovery target timeline %u",
  						primary_tli, standby_tli)));
  	ThisTimeLineID = primary_tli;
  
  	/* Start streaming from the point requested by startup process */
  	snprintf(cmd, sizeof(cmd), "START_REPLICATION %X/%X",
  			 startpoint.xlogid, startpoint.xrecoff);
--- 156,180 ----
  	 * recovery target timeline.
  	 */
  	standby_tli = GetRecoveryTargetTLI();
  	if (primary_tli != standby_tli)
+ 	{
+ 		PQclear(res);
  		ereport(ERROR,
  				(errmsg("timeline %u of the primary does not match recovery target timeline %u",
  						primary_tli, standby_tli)));
+ 	}
  	ThisTimeLineID = primary_tli;
  
+ 	/*
+ 	 * Confirm that the passed replication mode is valid.
+ 	 */
+ 	rplMode = ReplicationModeNameGetValue(primary_rplMode);
+ 	PQclear(res);
+ 	if (rplMode == InvalidReplicationMode)
+ 		ereport(ERROR,
+ 				(errmsg("invalid replication mode \"%s\"",
+ 						primary_rplMode)));
+ 
  	/* Start streaming from the point requested by startup process */
  	snprintf(cmd, sizeof(cmd), "START_REPLICATION %X/%X",
  			 startpoint.xlogid, startpoint.xrecoff);
***************
*** 398,400 **** libpqrcv_receive(int timeout, unsigned char *type, char **buffer, int *len)
--- 420,437 ----
  
  	return true;
  }
+ 
+ /*
+  * Send a message to XLOG stream.
+  *
+  * ereports on error.
+  */
+ static void
+ libpqrcv_send(const char *buffer, int nbytes)
+ {
+ 	if (PQputCopyData(streamConn, buffer, nbytes) <= 0 ||
+ 		PQflush(streamConn))
+ 		ereport(ERROR,
+ 				(errmsg("could not send data to WAL stream: %s",
+ 						PQerrorMessage(streamConn))));
+ }
*** /dev/null
--- b/src/backend/replication/standbys.conf.sample
***************
*** 0 ****
--- 1,35 ----
+ # PostgreSQL Standbys Configuration File
+ # ===================================================
+ #
+ # Refer to the "Streaming Replication" section in the PostgreSQL
+ # documentation for a complete description of this file.  A short
+ # synopsis follows.
+ #
+ # This file controls which replication mode each standby uses.
+ # Records are of the form:
+ #
+ # STANDBY-NAME  REPLICATION-MODE
+ #
+ # (The uppercase items must be replaced by actual values.)
+ #
+ # STANDBY-NAME can be "all", standby name, or a comma-separated list
+ # thereof.
+ #
+ # REPLICATION-MODE specifies how long transaction commit waits for
+ # replication before the commit command returns a "success" to a
+ # client. The valid modes are "async", "recv", "fsync" and "replay".
+ #
+ # Standby name containing spaces, commas, quotes and other special
+ # characters must be quoted.  Quoting one of the keyword "all" makes
+ # the name lose its special character, and just match standby with
+ # that name.
+ #
+ # This file is read on server startup and when the postmaster receives
+ # a SIGHUP signal.  If you edit the file on a running system, you have
+ # to SIGHUP the postmaster for the changes to take effect.  You can
+ # use "pg_ctl reload" to do that.
+ 
+ # Put your actual configuration here
+ # ----------------------------------
+ 
+ # STANDBY-NAME       REPLICATION-MODE
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***************
*** 57,62 **** bool		am_walreceiver;
--- 57,63 ----
  /* libpqreceiver hooks to these when loaded */
  walrcv_connect_type walrcv_connect = NULL;
  walrcv_receive_type walrcv_receive = NULL;
+ walrcv_send_type walrcv_send = NULL;
  walrcv_disconnect_type walrcv_disconnect = NULL;
  
  #define NAPTIME_PER_CYCLE 100	/* max sleep time between cycles (100ms) */
***************
*** 113,118 **** static void WalRcvDie(int code, Datum arg);
--- 114,120 ----
  static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
  static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
  static void XLogWalRcvFlush(void);
+ static void XLogWalRcvSendRecPtr(XLogRecPtr recptr);
  
  /* Signal handlers */
  static void WalRcvSigHupHandler(SIGNAL_ARGS);
***************
*** 158,164 **** void
--- 160,168 ----
  WalReceiverMain(void)
  {
  	char		conninfo[MAXCONNINFO];
+ 	char		standbyName[MAXSTANDBYNAME];
  	XLogRecPtr	startpoint;
+ 	XLogRecPtr	ackedpoint = {0, 0};
  
  	/* use volatile pointer to prevent code rearrangement */
  	volatile WalRcvData *walrcv = WalRcv;
***************
*** 206,211 **** WalReceiverMain(void)
--- 210,216 ----
  
  	/* Fetch information required to start streaming */
  	strlcpy(conninfo, (char *) walrcv->conninfo, MAXCONNINFO);
+ 	strlcpy(standbyName, (char *) walrcv->standbyName, MAXSTANDBYNAME);
  	startpoint = walrcv->receivedUpto;
  	SpinLockRelease(&walrcv->mutex);
  
***************
*** 247,253 **** WalReceiverMain(void)
  	/* Load the libpq-specific functions */
  	load_file("libpqwalreceiver", false);
  	if (walrcv_connect == NULL || walrcv_receive == NULL ||
! 		walrcv_disconnect == NULL)
  		elog(ERROR, "libpqwalreceiver didn't initialize correctly");
  
  	/*
--- 252,258 ----
  	/* Load the libpq-specific functions */
  	load_file("libpqwalreceiver", false);
  	if (walrcv_connect == NULL || walrcv_receive == NULL ||
! 		walrcv_send == NULL || walrcv_disconnect == NULL)
  		elog(ERROR, "libpqwalreceiver didn't initialize correctly");
  
  	/*
***************
*** 261,267 **** WalReceiverMain(void)
  
  	/* Establish the connection to the primary for XLOG streaming */
  	EnableWalRcvImmediateExit();
! 	walrcv_connect(conninfo, startpoint);
  	DisableWalRcvImmediateExit();
  
  	/* Loop until end-of-streaming or error */
--- 266,272 ----
  
  	/* Establish the connection to the primary for XLOG streaming */
  	EnableWalRcvImmediateExit();
! 	walrcv_connect(conninfo, startpoint, standbyName);
  	DisableWalRcvImmediateExit();
  
  	/* Loop until end-of-streaming or error */
***************
*** 311,316 **** WalReceiverMain(void)
--- 316,340 ----
  			 */
  			XLogWalRcvFlush();
  		}
+ 
+ 		/*
+ 		 * If replication_mode is "replay", send the last WAL replay location
+ 		 * to the primary, to acknowledge that replication has been completed
+ 		 * up to that. This occurs only when WAL records were replayed since
+ 		 * the last acknowledgement.
+ 		 */
+ 		if (rplMode == REPLICATION_MODE_REPLAY &&
+ 			XLByteLT(ackedpoint, LogstreamResult.Flush))
+ 		{
+ 			XLogRecPtr	recptr;
+ 
+ 			recptr = GetReplayRecPtr();
+ 			if (XLByteLT(ackedpoint, recptr))
+ 			{
+ 				XLogWalRcvSendRecPtr(recptr);
+ 				ackedpoint = recptr;
+ 			}
+ 		}
  	}
  }
  
***************
*** 406,411 **** XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
--- 430,448 ----
  				buf += sizeof(WalDataMessageHeader);
  				len -= sizeof(WalDataMessageHeader);
  
+ 				/*
+ 				 * If replication_mode is "recv", send the last WAL receive
+ 				 * location to the primary, to acknowledge that replication
+ 				 * has been completed up to that.
+ 				 */
+ 				if (rplMode == REPLICATION_MODE_RECV)
+ 				{
+ 					XLogRecPtr	endptr = msghdr.dataStart;
+ 
+ 					XLByteAdvance(endptr, len);
+ 					XLogWalRcvSendRecPtr(endptr);
+ 				}
+ 
  				XLogWalRcvWrite(buf, len, msghdr.dataStart);
  				break;
  			}
***************
*** 523,528 **** XLogWalRcvFlush(void)
--- 560,573 ----
  
  		LogstreamResult.Flush = LogstreamResult.Write;
  
+ 		/*
+ 		 * If replication_mode is "fsync", send the last WAL flush
+ 		 * location to the primary, to acknowledge that replication
+ 		 * has been completed up to that.
+ 		 */
+ 		if (rplMode == REPLICATION_MODE_FSYNC)
+ 			XLogWalRcvSendRecPtr(LogstreamResult.Flush);
+ 
  		/* Update shared-memory status */
  		SpinLockAcquire(&walrcv->mutex);
  		walrcv->latestChunkStart = walrcv->receivedUpto;
***************
*** 541,543 **** XLogWalRcvFlush(void)
--- 586,609 ----
  		}
  	}
  }
+ 
+ /* Send the lsn to the primary server */
+ static void
+ XLogWalRcvSendRecPtr(XLogRecPtr recptr)
+ {
+ 	static char	   *msgbuf = NULL;
+ 	WalAckMessageData	msgdata;
+ 
+ 	/*
+ 	 * Allocate buffer that will be used for each output message if first
+ 	 * time through.  We do this just once to reduce palloc overhead.
+ 	 * The buffer must be made large enough for maximum-sized messages.
+ 	 */
+ 	if (msgbuf == NULL)
+ 		msgbuf = palloc(1 + sizeof(WalAckMessageData));
+ 
+ 	msgbuf[0] = 'l';
+ 	msgdata.ackEnd = recptr;
+ 	memcpy(msgbuf + 1, &msgdata, sizeof(WalAckMessageData));
+ 	walrcv_send(msgbuf, 1 + sizeof(WalAckMessageData));
+ }
*** a/src/backend/replication/walreceiverfuncs.c
--- b/src/backend/replication/walreceiverfuncs.c
***************
*** 168,178 **** ShutdownWalRcv(void)
  /*
   * Request postmaster to start walreceiver.
   *
!  * recptr indicates the position where streaming should begin, and conninfo
!  * is a libpq connection string to use.
   */
  void
! RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile WalRcvData *walrcv = WalRcv;
--- 168,180 ----
  /*
   * Request postmaster to start walreceiver.
   *
!  * recptr indicates the position where streaming should begin, conninfo
!  * is a libpq connection string to use, and standbyName is name of this
!  * standby.
   */
  void
! RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo,
! 					 const char *standbyName)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile WalRcvData *walrcv = WalRcv;
***************
*** 196,201 **** RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo)
--- 198,207 ----
  		strlcpy((char *) walrcv->conninfo, conninfo, MAXCONNINFO);
  	else
  		walrcv->conninfo[0] = '\0';
+ 	if (standbyName != NULL)
+ 		strlcpy((char *) walrcv->standbyName, standbyName, MAXSTANDBYNAME);
+ 	else
+ 		walrcv->standbyName[0] = '\0';
  	walrcv->walRcvState = WALRCV_STARTING;
  	walrcv->startTime = now;
  
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 38,43 ****
--- 38,44 ----
  
  #include "access/xlog_internal.h"
  #include "catalog/pg_type.h"
+ #include "libpq/hba.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "libpq/pqsignal.h"
***************
*** 61,66 **** static WalSnd *MyWalSnd = NULL;
--- 62,68 ----
  
  /* Global state */
  bool		am_walsender = false;		/* Am I a walsender process ? */
+ char	   *standby_name = NULL;		/* Name of connected standby */
  
  /* User-settable parameters for walsender */
  int			max_wal_senders = 0;	/* the maximum number of concurrent walsenders */
***************
*** 84,94 **** static uint32 sendOff = 0;
--- 86,114 ----
   */
  static XLogRecPtr sentPtr = {0, 0};
  
+ /*
+  * How far have we completed replication already? This is also
+  * advertised in MyWalSnd->ackdPtr. This is not used in asynchronous
+  * replication case.
+  */
+ static XLogRecPtr ackdPtr = {0, 0};
+ 
  /* Flags set by signal handlers for later service in main loop */
  static volatile sig_atomic_t got_SIGHUP = false;
  static volatile sig_atomic_t shutdown_requested = false;
  static volatile sig_atomic_t ready_to_stop = false;
  
+ /*
+  * pre-parsed content of standbys configuration file: list of
+  * StandbysLine structs
+  */
+ static List *parsed_standbys_lines = NIL;
+ 
+ static const char *standbys_keywords[] = {"all", NULL};
+ 
+ /* Path of standbys configuration file (relative to $PGDATA) */
+ #define STANDBYS_FILE		"standbys.conf"
+ 
  /* Signal handlers */
  static void WalSndSigHupHandler(SIGNAL_ARGS);
  static void WalSndShutdownHandler(SIGNAL_ARGS);
***************
*** 102,108 **** static void WalSndHandshake(void);
  static void WalSndKill(int code, Datum arg);
  static void XLogRead(char *buf, XLogRecPtr recptr, Size nbytes);
  static bool XLogSend(char *msgbuf, bool *caughtup);
! static void CheckClosedConnection(void);
  
  
  /* Main entry point for walsender process */
--- 122,132 ----
  static void WalSndKill(int code, Datum arg);
  static void XLogRead(char *buf, XLogRecPtr recptr, Size nbytes);
  static bool XLogSend(char *msgbuf, bool *caughtup);
! static void ProcessStreamMsgs(StringInfo inMsg);
! 
! static bool parse_standbys_line(List *line, int line_num, StandbysLine *parsedline);
! static void free_standbys_record(StandbysLine *record);
! static void clean_standbys_list(List *lines);
  
  
  /* Main entry point for walsender process */
***************
*** 209,227 **** WalSndHandshake(void)
  						StringInfoData buf;
  						char		sysid[32];
  						char		tli[11];
  
  						/*
! 						 * Reply with a result set with one row, two columns.
! 						 * First col is system ID, and second is timeline ID
  						 */
  
  						snprintf(sysid, sizeof(sysid), UINT64_FORMAT,
  								 GetSystemIdentifier());
  						snprintf(tli, sizeof(tli), "%u", ThisTimeLineID);
  
  						/* Send a RowDescription message */
  						pq_beginmessage(&buf, 'T');
! 						pq_sendint(&buf, 2, 2); /* 2 fields */
  
  						/* first field */
  						pq_sendstring(&buf, "systemid");		/* col name */
--- 233,254 ----
  						StringInfoData buf;
  						char		sysid[32];
  						char		tli[11];
+ 						char		mode[8];
  
  						/*
! 						 * Reply with a result set with one row, three columns.
! 						 * First col is system ID, second is timeline ID, and
! 						 * third is replication mode.
  						 */
  
  						snprintf(sysid, sizeof(sysid), UINT64_FORMAT,
  								 GetSystemIdentifier());
  						snprintf(tli, sizeof(tli), "%u", ThisTimeLineID);
+ 						snprintf(mode, sizeof(mode), "%s", ReplicationModeNames[rplMode]);
  
  						/* Send a RowDescription message */
  						pq_beginmessage(&buf, 'T');
! 						pq_sendint(&buf, 3, 2); /* 3 fields */
  
  						/* first field */
  						pq_sendstring(&buf, "systemid");		/* col name */
***************
*** 240,254 **** WalSndHandshake(void)
  						pq_sendint(&buf, 4, 2); /* typlen */
  						pq_sendint(&buf, 0, 4); /* typmod */
  						pq_sendint(&buf, 0, 2); /* format code */
  						pq_endmessage(&buf);
  
  						/* Send a DataRow message */
  						pq_beginmessage(&buf, 'D');
! 						pq_sendint(&buf, 2, 2); /* # of columns */
  						pq_sendint(&buf, strlen(sysid), 4);		/* col1 len */
  						pq_sendbytes(&buf, (char *) &sysid, strlen(sysid));
  						pq_sendint(&buf, strlen(tli), 4);		/* col2 len */
  						pq_sendbytes(&buf, (char *) tli, strlen(tli));
  						pq_endmessage(&buf);
  
  						/* Send CommandComplete and ReadyForQuery messages */
--- 267,292 ----
  						pq_sendint(&buf, 4, 2); /* typlen */
  						pq_sendint(&buf, 0, 4); /* typmod */
  						pq_sendint(&buf, 0, 2); /* format code */
+ 
+ 						/* third field */
+ 						pq_sendstring(&buf, "replication_mode");	/* col name */
+ 						pq_sendint(&buf, 0, 4); /* table oid */
+ 						pq_sendint(&buf, 0, 2); /* attnum */
+ 						pq_sendint(&buf, TEXTOID, 4);	/* type oid */
+ 						pq_sendint(&buf, -1, 2);		/* typlen */
+ 						pq_sendint(&buf, 0, 4); /* typmod */
+ 						pq_sendint(&buf, 0, 2); /* format code */
  						pq_endmessage(&buf);
  
  						/* Send a DataRow message */
  						pq_beginmessage(&buf, 'D');
! 						pq_sendint(&buf, 3, 2); /* # of columns */
  						pq_sendint(&buf, strlen(sysid), 4);		/* col1 len */
  						pq_sendbytes(&buf, (char *) &sysid, strlen(sysid));
  						pq_sendint(&buf, strlen(tli), 4);		/* col2 len */
  						pq_sendbytes(&buf, (char *) tli, strlen(tli));
+ 						pq_sendint(&buf, strlen(mode), 4);	/* col3 len */
+ 						pq_sendbytes(&buf, (char *) &mode, strlen(mode));
  						pq_endmessage(&buf);
  
  						/* Send CommandComplete and ReadyForQuery messages */
***************
*** 286,295 **** WalSndHandshake(void)
  						pq_flush();
  
  						/*
! 						 * Initialize position to the received one, then the
  						 * xlog records begin to be shipped from that position
  						 */
! 						sentPtr = recptr;
  
  						/* break out of the loop */
  						replication_started = true;
--- 324,333 ----
  						pq_flush();
  
  						/*
! 						 * Initialize positions to the received one, then the
  						 * xlog records begin to be shipped from that position
  						 */
! 						sentPtr = ackdPtr = recptr;
  
  						/* break out of the loop */
  						replication_started = true;
***************
*** 323,375 **** WalSndHandshake(void)
  }
  
  /*
!  * Check if the remote end has closed the connection.
   */
  static void
! CheckClosedConnection(void)
  {
! 	unsigned char firstchar;
! 	int			r;
  
! 	r = pq_getbyte_if_available(&firstchar);
! 	if (r < 0)
! 	{
! 		/* unexpected error or EOF */
! 		ereport(COMMERROR,
! 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
! 				 errmsg("unexpected EOF on standby connection")));
! 		proc_exit(0);
! 	}
! 	if (r == 0)
  	{
! 		/* no data available without blocking */
! 		return;
! 	}
  
- 	/* Handle the very limited subset of commands expected in this phase */
- 	switch (firstchar)
- 	{
  			/*
  			 * 'X' means that the standby is closing down the socket.
  			 */
! 		case 'X':
! 			proc_exit(0);
  
! 		default:
! 			ereport(FATAL,
! 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
! 					 errmsg("invalid standby closing message type %d",
! 							firstchar)));
  	}
  }
  
  /* Main loop of walsender process */
  static int
  WalSndLoop(void)
  {
  	char	   *output_message;
  	bool		caughtup = false;
  
  	/*
  	 * Allocate buffer that will be used for each output message.  We do this
  	 * just once to reduce palloc overhead.  The buffer must be made large
--- 361,482 ----
  }
  
  /*
!  * Process messages received from the standby.
!  *
!  * ereports on error.
   */
  static void
! ProcessStreamMsgs(StringInfo inMsg)
  {
! 	bool	acked = false;
  
! 	/* Loop to process successive complete messages available */
! 	for (;;)
  	{
! 		unsigned char firstchar;
! 		int			r;
! 
! 		r = pq_getbyte_if_available(&firstchar);
! 		if (r < 0)
! 		{
! 			/* unexpected error or EOF */
! 			ereport(COMMERROR,
! 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
! 					 errmsg("unexpected EOF on standby connection")));
! 			proc_exit(0);
! 		}
! 		if (r == 0)
! 		{
! 			/* no data available without blocking */
! 			break;
! 		}
! 
! 		/* Handle the very limited subset of commands expected in this phase */
! 		switch (firstchar)
! 		{
! 			case 'd':       /* CopyData message */
! 			{
! 				unsigned char	rpltype;
! 
! 				/*
! 				 * Read the message contents. This is expected to be done without
! 				 * blocking because we've been able to get message type code.
! 				 */
! 				if (pq_getmessage(inMsg, 0))
! 					proc_exit(0);		/* suitable message already logged */
! 
! 				/* Read the replication message type from CopyData message */
! 				rpltype = pq_getmsgbyte(inMsg);
! 				switch (rpltype)
! 				{
! 					case 'l':
! 					{
! 						WalAckMessageData  *msgdata;
! 
! 						msgdata = (WalAckMessageData *) pq_getmsgbytes(inMsg, sizeof(WalAckMessageData));
! 
! 						/*
! 						 * Update local status.
! 						 *
! 						 * The ackd ptr received from standby should not
! 						 * go backwards.
! 						 */
! 						if (XLByteLE(ackdPtr, msgdata->ackEnd))
! 							ackdPtr = msgdata->ackEnd;
! 						else
! 							ereport(FATAL,
! 									(errmsg("replication completion location went back from "
! 											"%X/%X to %X/%X",
! 											ackdPtr.xlogid, ackdPtr.xrecoff,
! 											msgdata->ackEnd.xlogid, msgdata->ackEnd.xrecoff)));
! 
! 						acked = true;	/* also need to update shared position */
! 						break;
! 					}
! 					default:
! 						ereport(FATAL,
! 								(errcode(ERRCODE_PROTOCOL_VIOLATION),
! 								 errmsg("invalid replication message type %d",
! 										rpltype)));
! 				}
! 				break;
! 			}
  
  			/*
  			 * 'X' means that the standby is closing down the socket.
  			 */
! 			case 'X':
! 				proc_exit(0);
  
! 			default:
! 				ereport(FATAL,
! 						(errcode(ERRCODE_PROTOCOL_VIOLATION),
! 						 errmsg("invalid standby closing message type %d",
! 								firstchar)));
! 		}
  	}
+ 
+ 	if (acked)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile WalSnd *walsnd = MyWalSnd;
+ 
+ 		SpinLockAcquire(&walsnd->mutex);
+ 		walsnd->ackdPtr = ackdPtr;
+ 		SpinLockRelease(&walsnd->mutex);
+  	}
  }
  
  /* Main loop of walsender process */
  static int
  WalSndLoop(void)
  {
+ 	StringInfoData	input_message;
  	char	   *output_message;
  	bool		caughtup = false;
  
+ 	initStringInfo(&input_message);
+ 
  	/*
  	 * Allocate buffer that will be used for each output message.  We do this
  	 * just once to reduce palloc overhead.  The buffer must be made large
***************
*** 438,444 **** WalSndLoop(void)
  
  				/* Sleep and check that the connection is still alive */
  				pg_usleep(remain > NAPTIME_PER_CYCLE ? NAPTIME_PER_CYCLE : remain);
! 				CheckClosedConnection();
  
  				remain -= NAPTIME_PER_CYCLE;
  			}
--- 545,551 ----
  
  				/* Sleep and check that the connection is still alive */
  				pg_usleep(remain > NAPTIME_PER_CYCLE ? NAPTIME_PER_CYCLE : remain);
! 				ProcessStreamMsgs(&input_message);
  
  				remain -= NAPTIME_PER_CYCLE;
  			}
***************
*** 497,502 **** InitWalSnd(void)
--- 604,611 ----
  			MyWalSnd = (WalSnd *) walsnd;
  			walsnd->pid = MyProcPid;
  			MemSet(&MyWalSnd->sentPtr, 0, sizeof(XLogRecPtr));
+ 			MemSet(&MyWalSnd->ackdPtr, 0, sizeof(XLogRecPtr));
+ 			walsnd->rplMode = rplMode;
  			SpinLockRelease(&walsnd->mutex);
  			break;
  		}
***************
*** 523,528 **** WalSndKill(int code, Datum arg)
--- 632,638 ----
  	 * for this.
  	 */
  	MyWalSnd->pid = 0;
+ 	MyWalSnd->rplMode = InvalidReplicationMode;
  
  	/* WalSnd struct isn't mine anymore */
  	MyWalSnd = NULL;
***************
*** 896,938 **** WalSndShmemInit(void)
  }
  
  /*
!  * This isn't currently used for anything. Monitoring tools might be
!  * interested in the future, and we'll need something like this in the
!  * future for synchronous replication.
   */
! #ifdef NOT_USED
  /*
!  * Returns the oldest Send position among walsenders. Or InvalidXLogRecPtr
!  * if none.
   */
! XLogRecPtr
! GetOldestWALSendPointer(void)
  {
! 	XLogRecPtr	oldest = {0, 0};
! 	int			i;
! 	bool		found = false;
  
! 	for (i = 0; i < max_wal_senders; i++)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
! 		XLogRecPtr	recptr;
  
! 		if (walsnd->pid == 0)
! 			continue;
  
! 		SpinLockAcquire(&walsnd->mutex);
! 		recptr = walsnd->sentPtr;
! 		SpinLockRelease(&walsnd->mutex);
  
! 		if (recptr.xlogid == 0 && recptr.xrecoff == 0)
! 			continue;
  
! 		if (!found || XLByteLT(recptr, oldest))
! 			oldest = recptr;
! 		found = true;
  	}
! 	return oldest;
  }
  
! #endif
--- 1006,1253 ----
  }
  
  /*
!  * Ensure that all xlog records through the given position is
!  * replicated to the standby servers.
!  *
!  * XXX: We should replace the poll loop in this function with a latch.
!  */
! void
! WaitXLogSend(XLogRecPtr record)
! {
! 	Assert(max_wal_senders > 0);
! 
! 	/*
! 	 * XXX: We should track the number of currently connected standbys
! 	 * and skip waiting if it's zero.
! 	 */
! 
! 	for (;;)
! 	{
! 		int		i;
! 		bool	ackd = true;
! 
! 		for (i = 0; i < max_wal_senders; i++)
! 		{
! 			/* use volatile pointer to prevent code rearrangement */
! 			volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
! 			XLogRecPtr		recptr;
! 
! 			/* Don't wait for unconnected and asynchronous standbys */
! 			if (walsnd->pid == 0 || walsnd->rplMode <= REPLICATION_MODE_ASYNC)
! 				continue;
! 
! 			SpinLockAcquire(&walsnd->mutex);
! 			recptr = walsnd->ackdPtr;
! 			SpinLockRelease(&walsnd->mutex);
! 
! 			if (XLByteLT(recptr, record))
! 			{
! 				ackd = false;
! 				break;
! 			}
! 		}
! 
! 		if (ackd)
! 			return;
! 
! 		pg_usleep(100000L);     /* 100ms */
! 	}
! }
! 
! 
! /* ----------
!  * Routines to handle standbys configuration file
!  * ----------
!  */
! 
! /*
!  * Scan the (pre-parsed) standbys configuration file line by line,
!  * looking for a match to the standby name passed from the standby.
   */
! bool
! check_standbys(void)
! {
! 	ListCell   *line;
! 	StandbysLine *standbys;
! 
! 	foreach(line, parsed_standbys_lines)
! 	{
! 		char	   *tok;
! 
! 		standbys = (StandbysLine *) lfirst(line);
! 
! 		/* Check standby name */
! 		for (tok = strtok(standbys->standbyName, MULTI_VALUE_SEP);
! 			 tok != NULL;
! 			 tok = strtok(NULL, MULTI_VALUE_SEP))
! 		{
! 			if (strcmp(tok, "all\n") == 0 ||
! 				(standby_name != NULL &&
! 				 strcmp(tok, standby_name) == 0))
! 			{
! 				rplMode = standbys->rplMode;
! 				return true;
! 			}
! 		}
! 	}
! 	return false;
! }
! 
  /*
!  * Parse one line in the standbys configuration file and store
!  * the result in a StandbysLine structure.
   */
! static bool
! parse_standbys_line(List *line, int line_num, StandbysLine *parsedline)
  {
! 	char	   *token;
! 	ListCell   *line_item;
  
! 	line_item = list_head(line);
! 
! 	parsedline->linenumber = line_num;
! 
! 	/* Get the standby name. */
! 	parsedline->standbyName = pstrdup(lfirst(line_item));
! 
! 	/* Get the mode. */
! 	line_item = lnext(line_item);
! 	if (!line_item)
  	{
! 		ereport(LOG,
! 				(errcode(ERRCODE_CONFIG_FILE_ERROR),
! 				 errmsg("end-of-line before mode specification"),
! 				 errcontext("line %d of configuration file \"%s\"",
! 							line_num, STANDBYS_FILE)));
! 		return false;
! 	}
! 	token = lfirst(line_item);
  
! 	parsedline->rplMode = ReplicationModeNameGetValue(token);
! 	if (parsedline->rplMode == InvalidReplicationMode)
! 	{
! 		ereport(LOG,
! 				(errcode(ERRCODE_CONFIG_FILE_ERROR),
! 				 errmsg("invalid replication mode \"%s\"",
! 						token),
! 				 errcontext("line %d of configuration file \"%s\"",
! 							line_num, STANDBYS_FILE)));
! 		return false;
! 	}
  
! 	/* Ignore remaining tokens */
  
! 	return true;
! }
! 
! /*
!  * Free an StandbysLine structure
!  */
! static void
! free_standbys_record(StandbysLine *record)
! {
! 	if (record->standbyName)
! 		pfree(record->standbyName);
! 	pfree(record);
! }
! 
! /*
!  * Free all records on the parsed Standbys list
!  */
! static void
! clean_standbys_list(List *lines)
! {
! 	ListCell   *line;
! 
! 	foreach(line, lines)
! 	{
! 		StandbysLine    *parsed = (StandbysLine *) lfirst(line);
  
! 		if (parsed)
! 			free_standbys_record(parsed);
  	}
! 	list_free(lines);
  }
  
! /*
!  * Read the config file and create a List of StandbysLine records for the contents.
!  *
!  * The configuration is read into a temporary list, and if any parse error occurs
!  * the old list is kept in place and false is returned. Only if the whole file
!  * parses Ok is the list replaced, and the function returns true.
!  */
! bool
! load_standbys(void)
! {
! 	FILE	   *file;
! 	List	   *standbys_lines = NIL;
! 	List	   *standbys_line_nums = NIL;
! 	ListCell   *line,
! 			   *line_num;
! 	List	   *new_parsed_lines = NIL;
! 	bool		ok = true;
! 
! 	/* Ignore standbys.conf if replication is not enabled */
! 	if (max_wal_senders <= 0)
! 		return true;
! 
! 	file = AllocateFile(STANDBYS_FILE, "r");
! 	if (file == NULL)
! 	{
! 		ereport(LOG,
! 				(errcode_for_file_access(),
! 				 errmsg("could not open configuration file \"%s\": %m",
! 						STANDBYS_FILE)));
! 
! 		/*
! 		 * Caller will take care of making this a FATAL error in case this is
! 		 * the initial startup. If it happens on reload, we just keep the old
! 		 * version around.
! 		 */
! 		return false;
! 	}
! 
! 	tokenize_file(STANDBYS_FILE, file, &standbys_lines, &standbys_line_nums,
! 				  standbys_keywords);
! 	FreeFile(file);
! 
! 	/* Now parse all the lines */
! 	forboth(line, standbys_lines, line_num, standbys_line_nums)
! 	{
! 		StandbysLine    *newline;
! 
! 		newline = palloc0(sizeof(StandbysLine));
! 
! 		if (!parse_standbys_line(lfirst(line), lfirst_int(line_num), newline))
! 		{
! 			/* Parse error in the file, so indicate there's a problem */
! 			free_standbys_record(newline);
! 			ok = false;
! 
! 			/*
! 			 * Keep parsing the rest of the file so we can report errors on
! 			 * more than the first row. Error has already been reported in the
! 			 * parsing function, so no need to log it here.
! 			 */
! 			continue;
! 		}
! 
! 		new_parsed_lines = lappend(new_parsed_lines, newline);
! 	}
! 
! 	/* Free the temporary lists */
! 	free_lines(&standbys_lines, &standbys_line_nums);
! 
! 	if (!ok)
! 	{
! 		/* Parsing failed at one or more rows, so bail out */
! 		clean_standbys_list(new_parsed_lines);
! 		return false;
! 	}
! 
! 	/* Loaded new file successfully, replace the one we use */
! 	clean_standbys_list(parsed_standbys_lines);
! 	parsed_standbys_lines = new_parsed_lines;
! 
! 	return true;
! }
*** a/src/backend/utils/init/postinit.c
--- b/src/backend/utils/init/postinit.c
***************
*** 661,666 **** InitPostgres(const char *in_dbname, Oid dboid, const char *username,
--- 661,688 ----
  			ereport(FATAL,
  					(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  					 errmsg("must be superuser to start walsender")));
+ 
+ 		/*
+ 		 * In EXEC_BACKEND case, we didn't inherit the contents of standbys.conf
+ 		 * etcetera from the postmaster, and have to load them ourselves.  Note we
+ 		 * are loading them into the startup transaction's memory context, not
+ 		 * PostmasterContext, but that shouldn't matter.
+ 		 *
+ 		 * FIXME: [fork/exec] Ugh.	Is there a way around this overhead?
+ 		 */
+ #ifdef EXEC_BACKEND
+ 		if (!load_standbys())
+ 		{
+ 			ereport(FATAL,
+ 					(errmsg("could not load standbys.conf")));
+ 		}
+ #endif
+ 
+ 		if (!check_standbys())
+ 			ereport(FATAL,
+ 					(errmsg("no standbys.conf entry for standby name \"%s\"",
+ 							standby_name)));
+ 
  		/* report this backend in the PgBackendStatus array */
  		pgstat_bestart();
  		/* close the transaction we started above */
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 189,194 **** typedef enum
--- 189,229 ----
  
  extern XLogRecPtr XactLastRecEnd;
  
+ /*
+  * Replication mode. This is used to identify how long transaction
+  * commit should wait for replication.
+  *
+  * REPLICATION_MODE_ASYNC doesn't make transaction commit wait for
+  * replication, i.e., asynchronous replication.
+  *
+  * REPLICATION_MODE_RECV makes transaction commit wait for XLOG
+  * records to be received on the standby.
+  *
+  * REPLICATION_MODE_FSYNC makes transaction commit wait for XLOG
+  * records to be received and fsync'd on the standby.
+  *
+  * REPLICATION_MODE_REPLAY makes transaction commit wait for XLOG
+  * records to be received, fsync'd and replayed on the standby.
+  */
+ typedef enum ReplicationMode
+ {
+ 	InvalidReplicationMode = -1,
+ 	REPLICATION_MODE_ASYNC = 0,
+ 	REPLICATION_MODE_RECV,
+ 	REPLICATION_MODE_FSYNC,
+ 	REPLICATION_MODE_REPLAY
+ 
+ 	/*
+ 	 * NOTE: if you add a new mode, change MAXREPLICATIONMODE below
+ 	 * and update the ReplicationModeNames array in xlog.c
+ 	 */
+ } ReplicationMode;
+ 
+ #define MAXREPLICATIONMODE		REPLICATION_MODE_REPLAY
+ 
+ extern const char *ReplicationModeNames[];
+ extern ReplicationMode	rplMode;
+ 
  /* these variables are GUC parameters related to XLOG */
  extern int	CheckPointSegments;
  extern int	wal_keep_segments;
***************
*** 298,307 **** extern void XLogPutNextOid(Oid nextOid);
--- 333,345 ----
  extern XLogRecPtr GetRedoRecPtr(void);
  extern XLogRecPtr GetInsertRecPtr(void);
  extern XLogRecPtr GetFlushRecPtr(void);
+ extern XLogRecPtr GetReplayRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
  extern TimeLineID GetRecoveryTargetTLI(void);
  
  extern void HandleStartupProcInterrupts(void);
  extern void StartupProcessMain(void);
  
+ extern ReplicationMode ReplicationModeNameGetValue(char *name);
+ 
  #endif   /* XLOG_H */
*** a/src/include/libpq/hba.h
--- b/src/include/libpq/hba.h
***************
*** 15,20 ****
--- 15,24 ----
  #include "libpq/pqcomm.h"
  
  
+ /* This is used to separate values in multi-valued column strings */
+ #define MULTI_VALUE_SEP "\001"
+ 
+ 
  typedef enum UserAuth
  {
  	uaReject,
***************
*** 89,93 **** extern int check_usermap(const char *usermap_name,
--- 93,100 ----
  			  const char *pg_role, const char *auth_user,
  			  bool case_sensitive);
  extern bool pg_isblank(const char c);
+ extern void tokenize_file(const char *filename, FILE *file,
+ 			  List **lines, List **line_nums, const char **keywords);
+ extern void free_lines(List **lines, List **line_nums);
  
  #endif   /* HBA_H */
*** a/src/include/replication/walprotocol.h
--- b/src/include/replication/walprotocol.h
***************
*** 50,53 **** typedef struct
--- 50,63 ----
   */
  #define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
  
+ /*
+  * Body for a WAL acknowledgment message (message type 'l'). This is wrapped
+  * within a CopyData message at the FE/BE protocol level.
+  */
+ typedef struct
+ {
+ 	/* End of WAL replicated to the standby */
+ 	XLogRecPtr	ackEnd;
+ } WalAckMessageData;
+ 
  #endif   /* _WALPROTOCOL_H */
*** a/src/include/replication/walreceiver.h
--- b/src/include/replication/walreceiver.h
***************
*** 26,31 **** extern bool am_walreceiver;
--- 26,38 ----
  #define MAXCONNINFO		1024
  
  /*
+  * MAXSTANDBYNAME: maximum size of standby name.
+  *
+  * XXX: Should this move to pg_config_manual.h?
+  */
+ #define MAXSTANDBYNAME	64
+ 
+ /*
   * Values for WalRcv->walRcvState.
   */
  typedef enum
***************
*** 71,89 **** typedef struct
  	 */
  	char		conninfo[MAXCONNINFO];
  
  	slock_t		mutex;			/* locks shared variables shown above */
  } WalRcvData;
  
  extern WalRcvData *WalRcv;
  
  /* libpqwalreceiver hooks */
! typedef bool (*walrcv_connect_type) (char *conninfo, XLogRecPtr startpoint);
  extern PGDLLIMPORT walrcv_connect_type walrcv_connect;
  
  typedef bool (*walrcv_receive_type) (int timeout, unsigned char *type,
  												 char **buffer, int *len);
  extern PGDLLIMPORT walrcv_receive_type walrcv_receive;
  
  typedef void (*walrcv_disconnect_type) (void);
  extern PGDLLIMPORT walrcv_disconnect_type walrcv_disconnect;
  
--- 78,106 ----
  	 */
  	char		conninfo[MAXCONNINFO];
  
+ 	/*
+ 	 * standby name; is used for the master to determine replication mode
+ 	 * from standbys configuration file.
+ 	 */
+ 	char		standbyName[MAXSTANDBYNAME];
+ 
  	slock_t		mutex;			/* locks shared variables shown above */
  } WalRcvData;
  
  extern WalRcvData *WalRcv;
  
  /* libpqwalreceiver hooks */
! typedef bool (*walrcv_connect_type) (char *conninfo, XLogRecPtr startpoint,
! 									 char *standbyName);
  extern PGDLLIMPORT walrcv_connect_type walrcv_connect;
  
  typedef bool (*walrcv_receive_type) (int timeout, unsigned char *type,
  												 char **buffer, int *len);
  extern PGDLLIMPORT walrcv_receive_type walrcv_receive;
  
+ typedef void (*walrcv_send_type) (const char *buffer, int nbytes);
+ extern PGDLLIMPORT walrcv_send_type walrcv_send;
+ 
  typedef void (*walrcv_disconnect_type) (void);
  extern PGDLLIMPORT walrcv_disconnect_type walrcv_disconnect;
  
***************
*** 93,99 **** extern void WalRcvShmemInit(void);
  extern void ShutdownWalRcv(void);
  extern bool WalRcvInProgress(void);
  extern XLogRecPtr WaitNextXLogAvailable(XLogRecPtr recptr, bool *finished);
! extern void RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo);
  extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart);
  
  #endif   /* _WALRECEIVER_H */
--- 110,117 ----
  extern void ShutdownWalRcv(void);
  extern bool WalRcvInProgress(void);
  extern XLogRecPtr WaitNextXLogAvailable(XLogRecPtr recptr, bool *finished);
! extern void RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo,
! 								 const char *standbyName);
  extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart);
  
  #endif   /* _WALRECEIVER_H */
*** a/src/include/replication/walsender.h
--- b/src/include/replication/walsender.h
***************
*** 22,27 **** typedef struct WalSnd
--- 22,30 ----
  {
  	pid_t		pid;			/* this walsender's process id, or 0 */
  	XLogRecPtr	sentPtr;		/* WAL has been sent up to this point */
+ 	XLogRecPtr	ackdPtr;		/* WAL has been replicated up to this point */
+ 
+ 	ReplicationMode	rplMode;	/* replication mode */
  
  	slock_t		mutex;			/* locks shared variables shown above */
  } WalSnd;
***************
*** 36,49 **** extern WalSndCtlData *WalSndCtl;
--- 39,65 ----
  
  /* global state */
  extern bool am_walsender;
+ extern char *standby_name;
  
  /* user-settable parameters */
  extern int	WalSndDelay;
  extern int	max_wal_senders;
  
+ /* struct definition for standbys configuration file */
+ typedef struct
+ {
+ 	int			linenumber;
+ 	char	   *standbyName;
+ 	ReplicationMode	rplMode;
+ } StandbysLine;
+ 
  extern int	WalSenderMain(void);
  extern void WalSndSignals(void);
  extern Size WalSndShmemSize(void);
  extern void WalSndShmemInit(void);
+ extern void WaitXLogSend(XLogRecPtr record);
+ 
+ extern bool check_standbys(void);
+ extern bool load_standbys(void);
  
  #endif   /* _WALSENDER_H */
*** a/src/interfaces/libpq/fe-connect.c
--- b/src/interfaces/libpq/fe-connect.c
***************
*** 254,259 **** static const PQconninfoOption PQconninfoOptions[] = {
--- 254,262 ----
  	{"replication", NULL, NULL, NULL,
  	"Replication", "D", 5},
  
+ 	{"standby_name", NULL, NULL, NULL,
+ 	"Standby-Name", "D", 64},
+ 
  	/* Terminating entry --- MUST BE LAST */
  	{NULL, NULL, NULL, NULL,
  	NULL, NULL, 0}
***************
*** 613,618 **** fillPGconn(PGconn *conn, PQconninfoOption *connOptions)
--- 616,623 ----
  #endif
  	tmp = conninfo_getval(connOptions, "replication");
  	conn->replication = tmp ? strdup(tmp) : NULL;
+ 	tmp = conninfo_getval(connOptions, "standby_name");
+ 	conn->standbyName = tmp ? strdup(tmp) : NULL;
  }
  
  /*
***************
*** 2622,2627 **** freePGconn(PGconn *conn)
--- 2627,2634 ----
  		free(conn->dbName);
  	if (conn->replication)
  		free(conn->replication);
+ 	if (conn->standbyName)
+ 		free(conn->standbyName);
  	if (conn->pguser)
  		free(conn->pguser);
  	if (conn->pgpass)
*** a/src/interfaces/libpq/fe-exec.c
--- b/src/interfaces/libpq/fe-exec.c
***************
*** 2002,2007 **** PQnotifies(PGconn *conn)
--- 2002,2010 ----
  /*
   * PQputCopyData - send some data to the backend during COPY IN
   *
+  * This function can be called by walreceiver even during COPY OUT
+  * to send a message to the master.
+  *
   * Returns 1 if successful, 0 if data could not be sent (only possible
   * in nonblock mode), or -1 if an error occurs.
   */
***************
*** 2010,2016 **** PQputCopyData(PGconn *conn, const char *buffer, int nbytes)
  {
  	if (!conn)
  		return -1;
! 	if (conn->asyncStatus != PGASYNC_COPY_IN)
  	{
  		printfPQExpBuffer(&conn->errorMessage,
  						  libpq_gettext("no COPY in progress\n"));
--- 2013,2020 ----
  {
  	if (!conn)
  		return -1;
! 	if (conn->asyncStatus != PGASYNC_COPY_IN &&
! 		conn->asyncStatus != PGASYNC_COPY_OUT)
  	{
  		printfPQExpBuffer(&conn->errorMessage,
  						  libpq_gettext("no COPY in progress\n"));
*** a/src/interfaces/libpq/fe-protocol3.c
--- b/src/interfaces/libpq/fe-protocol3.c
***************
*** 1911,1916 **** build_startup_packet(const PGconn *conn, char *packet,
--- 1911,1918 ----
  		ADD_STARTUP_OPTION("database", conn->dbName);
  	if (conn->replication && conn->replication[0])
  		ADD_STARTUP_OPTION("replication", conn->replication);
+ 	if (conn->standbyName && conn->standbyName[0])
+ 		ADD_STARTUP_OPTION("standby_name", conn->standbyName);
  	if (conn->pgoptions && conn->pgoptions[0])
  		ADD_STARTUP_OPTION("options", conn->pgoptions);
  	if (conn->send_appname)
*** a/src/interfaces/libpq/libpq-int.h
--- b/src/interfaces/libpq/libpq-int.h
***************
*** 297,302 **** struct pg_conn
--- 297,303 ----
  	char	   *fbappname;		/* fallback application name */
  	char	   *dbName;			/* database name */
  	char	   *replication;	/* connect as the replication standby? */
+ 	char	   *standbyName;	/* standby name */
  	char	   *pguser;			/* Postgres username and password, if any */
  	char	   *pgpass;
  	char	   *keepalives;		/* use TCP keepalives? */

#53

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Fujii Masao (#52)

Re: Synchronous replication - patch status inquiry

On Fri, Sep 10, 2010 at 11:52 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

The attached patch implements the above and simple synchronous replication
feature, which doesn't include quorum commit capability. The replication
mode (async, recv, fsync, replay) can be specified on a per-standby basis,
in standbys.conf.

The patch still uses a poll loop in the backend, walsender, startup process
and walreceiver. If a latch feature Heikki proposed will have been committed,
I'll replace that with a latch.

The documentation has not fully updated yet. I'll work on the document until
the deadline of the next CF.

BTW, the latest code is available in my git repository too:

git://git.postgresql.org/git/users/fujii/postgres.git
branch: synchrep

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#54

David Fetter

david@fetter.org

over 15 years ago

In reply to: Fujii Masao (#52)

Re: Synchronous replication - patch status inquiry

On Fri, Sep 10, 2010 at 11:52:20AM +0900, Fujii Masao wrote:

On Fri, Sep 3, 2010 at 3:42 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Here is the proposed detailed design:

standbys.conf
=============
# This is not initialized by initdb, so users need to create it under $PGDATA.
ï¿½ ï¿½* The template is located in the PREFIX/share directory.

# This is read by postmaster at the startup as well as pg_hba.conf is.
ï¿½ ï¿½* In EXEC_BACKEND environement, each walsender must read it at the startup.
ï¿½ ï¿½* This is ignored when max_wal_senders is zero.
ï¿½ ï¿½* FATAL is emitted when standbys.conf doesn't exist even if max_wal_senders
ï¿½ ï¿½ ï¿½is positive.

# SIGHUP makes only postmaser re-read the standbys.conf.
ï¿½ ï¿½* New configuration doesn't affect the existing connections to the standbys,
ï¿½ ï¿½ ï¿½i.e., it's used only for subsequent connections.
ï¿½ ï¿½* XXX: Should the existing connections react to new configuration? What if
ï¿½ ï¿½ ï¿½new standbys.conf doesn't have the standby_name of the existing
connection?

# The connection from the standby is rejected if its standby_name is not listed
ï¿½in standbys.conf.
ï¿½ ï¿½* Multiple standbys with the same name are allowed.

# The valid values of SYNCHRONOUS field are async, recv, fsync and replay.

standby_name
============
# This is new string-typed parameter in recovery.conf.
ï¿½ ï¿½* XXX: Should standby_name and standby_mode be merged?

# Walreceiver sends this to the master when establishing the connection.

The attached patch implements the above and simple synchronous replication
feature, which doesn't include quorum commit capability. The replication
mode (async, recv, fsync, replay) can be specified on a per-standby basis,
in standbys.conf.

The patch still uses a poll loop in the backend, walsender, startup process
and walreceiver. If a latch feature Heikki proposed will have been committed,
I'll replace that with a latch.

Now that the latch patch is in, when do you think you'll be able to use it
instead of the poll loop?

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#55

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: David Fetter (#54)

1 attachment(s)

Re: Synchronous replication - patch status inquiry

On Wed, Sep 15, 2010 at 6:38 AM, David Fetter <david@fetter.org> wrote:

Now that the latch patch is in, when do you think you'll be able to use it
instead of the poll loop?

Here is the updated version, which uses a latch in communication from
walsender to backend. I've not changed the others. Because walsender
already uses it in HEAD, and Heikki already proposed the patch which
replaced the poll loop between walreceiver and startup process with
a latch.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

synchrep_0915.patchapplication/octet-stream; name=synchrep_0915.patchDownload

*** a/doc/src/sgml/protocol.sgml
--- b/doc/src/sgml/protocol.sgml
***************
*** 1291,1298 ****
  To initiate streaming replication, the frontend sends the
  <literal>replication</> parameter in the startup message. This tells the
  backend to go into walsender mode, wherein a small set of replication commands
! can be issued instead of SQL statements. Only the simple query protocol can be
! used in walsender mode.
  
  The commands accepted in walsender mode are:
  
--- 1291,1299 ----
  To initiate streaming replication, the frontend sends the
  <literal>replication</> parameter in the startup message. This tells the
  backend to go into walsender mode, wherein a small set of replication commands
! can be issued instead of SQL statements. Also the startup message includes
! <literal>standby_name</> parameter if it's supplied in <filename>recovery.conf</>.
! Only the simple query protocol can be used in walsender mode.
  
  The commands accepted in walsender mode are:
  
***************
*** 1360,1365 **** The commands accepted in walsender mode are:
--- 1361,1401 ----
        <variablelist>
        <varlistentry>
        <term>
+           XLogRecPtr (F)
+       </term>
+       <listitem>
+       <para>
+       <variablelist>
+       <varlistentry>
+       <term>
+           Byte1('l')
+       </term>
+       <listitem>
+       <para>
+           Identifies the message as an acknowledgment of replication.
+       </para>
+       </listitem>
+       </varlistentry>
+       <varlistentry>
+       <term>
+           Byte8
+       </term>
+       <listitem>
+       <para>
+           The end of the WAL data replicated to the standby, given in
+           XLogRecPtr format.
+       </para>
+       </listitem>
+       </varlistentry>
+       </variablelist>
+       </para>
+       </listitem>
+       </varlistentry>
+       </variablelist>
+ 
+       <variablelist>
+       <varlistentry>
+       <term>
            XLogData (B)
        </term>
        <listitem>
*** a/doc/src/sgml/recovery-config.sgml
--- b/doc/src/sgml/recovery-config.sgml
***************
*** 243,248 **** restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
--- 243,259 ----
           </para>
          </listitem>
         </varlistentry>
+        <varlistentry id="standby-name" xreflabel="standby_name">
+         <term><varname>standby_name</varname> (<type>string</type>)</term>
+         <indexterm>
+           <primary><varname>standby_name</> recovery parameter</primary>
+         </indexterm>
+         <listitem>
+          <para>
+           Specifies a name of the standby server.
+          </para>
+         </listitem>
+        </varlistentry>
         <varlistentry id="primary-conninfo" xreflabel="primary_conninfo">
          <term><varname>primary_conninfo</varname> (<type>string</type>)</term>
          <indexterm>
*** a/src/backend/Makefile
--- b/src/backend/Makefile
***************
*** 208,213 **** endif
--- 208,214 ----
  	$(INSTALL_DATA) $(srcdir)/libpq/pg_ident.conf.sample '$(DESTDIR)$(datadir)/pg_ident.conf.sample'
  	$(INSTALL_DATA) $(srcdir)/utils/misc/postgresql.conf.sample '$(DESTDIR)$(datadir)/postgresql.conf.sample'
  	$(INSTALL_DATA) $(srcdir)/access/transam/recovery.conf.sample '$(DESTDIR)$(datadir)/recovery.conf.sample'
+ 	$(INSTALL_DATA) $(srcdir)/replication/standbys.conf.sample '$(DESTDIR)$(datadir)/standbys.conf.sample'
  
  install-bin: postgres $(POSTGRES_IMP) installdirs
  	$(INSTALL_PROGRAM) postgres$(X) '$(DESTDIR)$(bindir)/postgres$(X)'
***************
*** 262,268 **** endif
  	rm -f '$(DESTDIR)$(datadir)/pg_hba.conf.sample' \
  	      '$(DESTDIR)$(datadir)/pg_ident.conf.sample' \
                '$(DESTDIR)$(datadir)/postgresql.conf.sample' \
! 	      '$(DESTDIR)$(datadir)/recovery.conf.sample'
  
  
  ##########################################################################
--- 263,270 ----
  	rm -f '$(DESTDIR)$(datadir)/pg_hba.conf.sample' \
  	      '$(DESTDIR)$(datadir)/pg_ident.conf.sample' \
                '$(DESTDIR)$(datadir)/postgresql.conf.sample' \
! 	      '$(DESTDIR)$(datadir)/recovery.conf.sample' \
! 	      '$(DESTDIR)$(datadir)/standbys.conf.sample'
  
  
  ##########################################################################
*** a/src/backend/access/transam/recovery.conf.sample
--- b/src/backend/access/transam/recovery.conf.sample
***************
*** 91,102 ****
  #---------------------------------------------------------------------------
  #
  # When standby_mode is enabled, the PostgreSQL server will work as
! # a standby. It tries to connect to the primary according to the
! # connection settings primary_conninfo, and receives XLOG records
! # continuously.
  #
  #standby_mode = 'off'
  #
  #primary_conninfo = ''		# e.g. 'host=localhost port=5432'
  #
  #
--- 91,104 ----
  #---------------------------------------------------------------------------
  #
  # When standby_mode is enabled, the PostgreSQL server will work as
! # a standby under the name of standby_name. It tries to connect to
! # the primary according to the connection settings primary_conninfo,
! # and receives XLOG records continuously.
  #
  #standby_mode = 'off'
  #
+ #standby_name = ''
+ #
  #primary_conninfo = ''		# e.g. 'host=localhost port=5432'
  #
  #
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***************
*** 1070,1075 **** EndPrepare(GlobalTransaction gxact)
--- 1070,1087 ----
  
  	END_CRIT_SECTION();
  
+ 	/*
+ 	 * Wait for WAL to be replicated up to the PREPARE record
+ 	 * if replication is enabled. This operation has to be performed
+ 	 * after the PREPARE record is generated and before other
+ 	 * transactions know that this one has already been prepared.
+ 	 *
+ 	 * XXX: Since the caller prevents cancel/die interrupt, we cannot
+ 	 * process that while waiting. Should we remove this restriction?
+ 	 */
+ 	if (max_wal_senders > 0)
+ 		WaitXLogSend(gxact->prepare_lsn);
+ 
  	records.tail = records.head = NULL;
  }
  
***************
*** 2027,2032 **** RecordTransactionCommitPrepared(TransactionId xid,
--- 2039,2053 ----
  	MyProc->inCommit = false;
  
  	END_CRIT_SECTION();
+ 
+ 	/*
+ 	 * Wait for WAL to be replicated up to the COMMIT PREPARED record
+ 	 * if replication is enabled. This operation has to be performed
+ 	 * after the COMMIT PREPARED record is generated and before other
+ 	 * transactions know that this one has already been committed.
+ 	 */
+ 	if (max_wal_senders > 0)
+ 		WaitXLogSend(recptr);
  }
  
  /*
***************
*** 2106,2109 **** RecordTransactionAbortPrepared(TransactionId xid,
--- 2127,2139 ----
  	TransactionIdAbortTree(xid, nchildren, children);
  
  	END_CRIT_SECTION();
+ 
+ 	/*
+ 	 * Wait for WAL to be replicated up to the ABORT PREPARED record
+ 	 * if replication is enabled. This operation has to be performed
+ 	 * after the ABORT PREPARED record is generated and before other
+ 	 * transactions know that this one has already been aborted.
+ 	 */
+ 	if (max_wal_senders > 0)
+ 		WaitXLogSend(recptr);
  }
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 1118,1123 **** RecordTransactionCommit(void)
--- 1118,1135 ----
  	/* Compute latestXid while we have the child XIDs handy */
  	latestXid = TransactionIdLatest(xid, nchildren, children);
  
+ 	/*
+ 	 * Wait for WAL to be replicated up to the COMMIT record if replication
+ 	 * is enabled. This operation has to be performed after the COMMIT record
+ 	 * is generated and before other transactions know that this one has
+ 	 * already been committed.
+ 	 *
+ 	 * XXX: Since the caller prevents cancel/die interrupt, we cannot
+ 	 * process that while waiting. Should we remove this restriction?
+ 	 */
+ 	if (max_wal_senders > 0)
+ 		WaitXLogSend(XactLastRecEnd);
+ 
  	/* Reset XactLastRecEnd until the next transaction writes something */
  	XactLastRecEnd.xrecoff = 0;
  
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 189,194 **** static TimestampTz recoveryTargetTime;
--- 189,195 ----
  static bool StandbyMode = false;
  static char *PrimaryConnInfo = NULL;
  static char *TriggerFile = NULL;
+ static char *StandbyName = NULL;
  
  /* if recoveryStopsHere returns true, it saves actual stop xid/time here */
  static TransactionId recoveryStopXid;
***************
*** 532,537 **** typedef struct xl_parameter_change
--- 533,548 ----
  	int			wal_level;
  } xl_parameter_change;
  
+ /* Replication mode names */
+ const char *ReplicationModeNames[] = {
+ 	"async",				/* REPLICATION_MODE_ASYNC */
+ 	"recv",				/* REPLICATION_MODE_RECV */
+ 	"fsync",				/* REPLICATION_MODE_FSYNC */
+ 	"replay"				/* REPLICATION_MODE_REPLAY */
+ };
+ 
+ ReplicationMode		rplMode = InvalidReplicationMode;
+ 
  /*
   * Flags set by interrupt handlers for later service in the redo loop.
   */
***************
*** 5258,5263 **** readRecoveryCommandFile(void)
--- 5269,5281 ----
  					(errmsg("trigger_file = '%s'",
  							TriggerFile)));
  		}
+ 		else if (strcmp(tok1, "standby_name") == 0)
+ 		{
+ 			StandbyName = pstrdup(tok2);
+ 			ereport(DEBUG2,
+ 					(errmsg("standby_name = '%s'",
+ 							StandbyName)));
+ 		}
  		else
  			ereport(FATAL,
  					(errmsg("unrecognized recovery parameter \"%s\"",
***************
*** 6867,6872 **** GetFlushRecPtr(void)
--- 6885,6907 ----
  }
  
  /*
+  * GetReplayRecPtr -- Returns the last replay position.
+  */
+ XLogRecPtr
+ GetReplayRecPtr(void)
+ {
+ 	/* use volatile pointer to prevent code rearrangement */
+ 	volatile XLogCtlData *xlogctl = XLogCtl;
+ 	XLogRecPtr	recptr;
+ 
+ 	SpinLockAcquire(&xlogctl->info_lck);
+ 	recptr = xlogctl->recoveryLastRecPtr;
+ 	SpinLockRelease(&xlogctl->info_lck);
+ 
+ 	return recptr;
+ }
+ 
+ /*
   * Get the time of the last xlog segment switch
   */
  pg_time_t
***************
*** 8828,8842 **** pg_last_xlog_receive_location(PG_FUNCTION_ARGS)
  Datum
  pg_last_xlog_replay_location(PG_FUNCTION_ARGS)
  {
- 	/* use volatile pointer to prevent code rearrangement */
- 	volatile XLogCtlData *xlogctl = XLogCtl;
  	XLogRecPtr	recptr;
  	char		location[MAXFNAMELEN];
  
! 	SpinLockAcquire(&xlogctl->info_lck);
! 	recptr = xlogctl->recoveryLastRecPtr;
! 	SpinLockRelease(&xlogctl->info_lck);
! 
  	if (recptr.xlogid == 0 && recptr.xrecoff == 0)
  		PG_RETURN_NULL();
  
--- 8863,8872 ----
  Datum
  pg_last_xlog_replay_location(PG_FUNCTION_ARGS)
  {
  	XLogRecPtr	recptr;
  	char		location[MAXFNAMELEN];
  
! 	recptr = GetReplayRecPtr();
  	if (recptr.xlogid == 0 && recptr.xrecoff == 0)
  		PG_RETURN_NULL();
  
***************
*** 9467,9473 **** retry:
  						{
  							RequestXLogStreaming(
  									  fetching_ckpt ? RedoStartLSN : *RecPtr,
! 												 PrimaryConnInfo);
  							continue;
  						}
  					}
--- 9497,9503 ----
  						{
  							RequestXLogStreaming(
  									  fetching_ckpt ? RedoStartLSN : *RecPtr,
! 												 PrimaryConnInfo, StandbyName);
  							continue;
  						}
  					}
***************
*** 9681,9683 **** CheckForStandbyTrigger(void)
--- 9711,9727 ----
  	}
  	return false;
  }
+ 
+ /*
+  * Look up replication mode value by name.
+  */
+ ReplicationMode
+ ReplicationModeNameGetValue(char *name)
+ {
+ 	ReplicationMode	mode;
+ 
+ 	for (mode = 0; mode <= MAXREPLICATIONMODE; mode++)
+ 		if (strcmp(ReplicationModeNames[mode], name) == 0)
+ 			return mode;
+ 	return InvalidReplicationMode;
+ }
*** a/src/backend/libpq/hba.c
--- b/src/backend/libpq/hba.c
***************
*** 38,46 ****
  #define atooid(x)  ((Oid) strtoul((x), NULL, 10))
  #define atoxid(x)  ((TransactionId) strtoul((x), NULL, 10))
  
- /* This is used to separate values in multi-valued column strings */
- #define MULTI_VALUE_SEP "\001"
- 
  #define MAX_TOKEN	256
  
  /* callback data for check_network_callback */
--- 38,43 ----
***************
*** 54,59 **** typedef struct check_network_data
--- 51,59 ----
  /* pre-parsed content of HBA config file: list of HbaLine structs */
  static List *parsed_hba_lines = NIL;
  
+ static const char *hba_keywords[] = {"all", "sameuser", "samegroup", "samerole",
+ 									 "replication", NULL};
+ 
  /*
   * These variables hold the pre-parsed contents of the ident usermap
   * configuration file.	ident_lines is a list of sublists, one sublist for
***************
*** 67,76 **** static List *ident_lines = NIL;
  static List *ident_line_nums = NIL;
  
  
- static void tokenize_file(const char *filename, FILE *file,
- 			  List **lines, List **line_nums);
  static char *tokenize_inc_file(const char *outer_filename,
! 				  const char *inc_filename);
  
  /*
   * isblank() exists in the ISO C99 spec, but it's not very portable yet,
--- 67,74 ----
  static List *ident_line_nums = NIL;
  
  
  static char *tokenize_inc_file(const char *outer_filename,
! 				  const char *inc_filename, const char **keywords);
  
  /*
   * isblank() exists in the ISO C99 spec, but it's not very portable yet,
***************
*** 108,114 **** pg_isblank(const char c)
   * token.
   */
  static bool
! next_token(FILE *fp, char *buf, int bufsz, bool *initial_quote)
  {
  	int			c;
  	char	   *start_buf = buf;
--- 106,113 ----
   * token.
   */
  static bool
! next_token(const char *filename, FILE *fp, char *buf, int bufsz,
! 		   bool *initial_quote, const char **keywords)
  {
  	int			c;
  	char	   *start_buf = buf;
***************
*** 155,162 **** next_token(FILE *fp, char *buf, int bufsz, bool *initial_quote)
  			*buf = '\0';
  			ereport(LOG,
  					(errcode(ERRCODE_CONFIG_FILE_ERROR),
! 			   errmsg("authentication file token too long, skipping: \"%s\"",
! 					  start_buf)));
  			/* Discard remainder of line */
  			while ((c = getc(fp)) != EOF && c != '\n')
  				;
--- 154,161 ----
  			*buf = '\0';
  			ereport(LOG,
  					(errcode(ERRCODE_CONFIG_FILE_ERROR),
! 			   errmsg("configuration file \"%s\" token too long, skipping: \"%s\"",
! 					  filename, start_buf)));
  			/* Discard remainder of line */
  			while ((c = getc(fp)) != EOF && c != '\n')
  				;
***************
*** 196,211 **** next_token(FILE *fp, char *buf, int bufsz, bool *initial_quote)
  
  	*buf = '\0';
  
! 	if (!saw_quote &&
! 		(strcmp(start_buf, "all") == 0 ||
! 		 strcmp(start_buf, "sameuser") == 0 ||
! 		 strcmp(start_buf, "samegroup") == 0 ||
! 		 strcmp(start_buf, "samerole") == 0 ||
! 		 strcmp(start_buf, "replication") == 0))
  	{
! 		/* append newline to a magical keyword */
! 		*buf++ = '\n';
! 		*buf = '\0';
  	}
  
  	return (saw_quote || buf > start_buf);
--- 195,214 ----
  
  	*buf = '\0';
  
! 	if (!saw_quote)
  	{
! 		const char	**entry;
! 
! 		for (entry = keywords; *entry != NULL; entry++)
! 		{
! 			if (strcmp(start_buf, *entry) == 0)
! 			{
! 				/* append newline to a magical keyword */
! 				*buf++ = '\n';
! 				*buf = '\0';
! 				break;
! 			}
! 		}
  	}
  
  	return (saw_quote || buf > start_buf);
***************
*** 219,225 **** next_token(FILE *fp, char *buf, int bufsz, bool *initial_quote)
   * The result is a palloc'd string, or NULL if we have reached EOL.
   */
  static char *
! next_token_expand(const char *filename, FILE *file)
  {
  	char		buf[MAX_TOKEN];
  	char	   *comma_str = pstrdup("");
--- 222,228 ----
   * The result is a palloc'd string, or NULL if we have reached EOL.
   */
  static char *
! next_token_expand(const char *filename, FILE *file, const char **keywords)
  {
  	char		buf[MAX_TOKEN];
  	char	   *comma_str = pstrdup("");
***************
*** 231,237 **** next_token_expand(const char *filename, FILE *file)
  
  	do
  	{
! 		if (!next_token(file, buf, sizeof(buf), &initial_quote))
  			break;
  
  		got_something = true;
--- 234,241 ----
  
  	do
  	{
! 		if (!next_token(filename, file, buf, sizeof(buf), &initial_quote,
! 						keywords))
  			break;
  
  		got_something = true;
***************
*** 246,252 **** next_token_expand(const char *filename, FILE *file)
  
  		/* Is this referencing a file? */
  		if (!initial_quote && buf[0] == '@' && buf[1] != '\0')
! 			incbuf = tokenize_inc_file(filename, buf + 1);
  		else
  			incbuf = pstrdup(buf);
  
--- 250,256 ----
  
  		/* Is this referencing a file? */
  		if (!initial_quote && buf[0] == '@' && buf[1] != '\0')
! 			incbuf = tokenize_inc_file(filename, buf + 1, keywords);
  		else
  			incbuf = pstrdup(buf);
  
***************
*** 273,279 **** next_token_expand(const char *filename, FILE *file)
  /*
   * Free memory used by lines/tokens (i.e., structure built by tokenize_file)
   */
! static void
  free_lines(List **lines, List **line_nums)
  {
  	/*
--- 277,283 ----
  /*
   * Free memory used by lines/tokens (i.e., structure built by tokenize_file)
   */
! void
  free_lines(List **lines, List **line_nums)
  {
  	/*
***************
*** 318,324 **** free_lines(List **lines, List **line_nums)
  
  static char *
  tokenize_inc_file(const char *outer_filename,
! 				  const char *inc_filename)
  {
  	char	   *inc_fullname;
  	FILE	   *inc_file;
--- 322,328 ----
  
  static char *
  tokenize_inc_file(const char *outer_filename,
! 				  const char *inc_filename, const char **keywords)
  {
  	char	   *inc_fullname;
  	FILE	   *inc_file;
***************
*** 348,354 **** tokenize_inc_file(const char *outer_filename,
  	{
  		ereport(LOG,
  				(errcode_for_file_access(),
! 				 errmsg("could not open secondary authentication file \"@%s\" as \"%s\": %m",
  						inc_filename, inc_fullname)));
  		pfree(inc_fullname);
  
--- 352,358 ----
  	{
  		ereport(LOG,
  				(errcode_for_file_access(),
! 				 errmsg("could not open secondary configuration file \"@%s\" as \"%s\": %m",
  						inc_filename, inc_fullname)));
  		pfree(inc_fullname);
  
***************
*** 357,363 **** tokenize_inc_file(const char *outer_filename,
  	}
  
  	/* There is possible recursion here if the file contains @ */
! 	tokenize_file(inc_fullname, inc_file, &inc_lines, &inc_line_nums);
  
  	FreeFile(inc_file);
  	pfree(inc_fullname);
--- 361,368 ----
  	}
  
  	/* There is possible recursion here if the file contains @ */
! 	tokenize_file(inc_fullname, inc_file, &inc_lines, &inc_line_nums,
! 				  keywords);
  
  	FreeFile(inc_file);
  	pfree(inc_fullname);
***************
*** 404,412 **** tokenize_inc_file(const char *outer_filename,
   *
   * filename must be the absolute path to the target file.
   */
! static void
  tokenize_file(const char *filename, FILE *file,
! 			  List **lines, List **line_nums)
  {
  	List	   *current_line = NIL;
  	int			line_number = 1;
--- 409,417 ----
   *
   * filename must be the absolute path to the target file.
   */
! void
  tokenize_file(const char *filename, FILE *file,
! 			  List **lines, List **line_nums, const char **keywords)
  {
  	List	   *current_line = NIL;
  	int			line_number = 1;
***************
*** 416,422 **** tokenize_file(const char *filename, FILE *file,
  
  	while (!feof(file) && !ferror(file))
  	{
! 		buf = next_token_expand(filename, file);
  
  		/* add token to list, unless we are at EOL or comment start */
  		if (buf)
--- 421,427 ----
  
  	while (!feof(file) && !ferror(file))
  	{
! 		buf = next_token_expand(filename, file, keywords);
  
  		/* add token to list, unless we are at EOL or comment start */
  		if (buf)
***************
*** 1490,1496 **** load_hba(void)
  		return false;
  	}
  
! 	tokenize_file(HbaFileName, file, &hba_lines, &hba_line_nums);
  	FreeFile(file);
  
  	/* Now parse all the lines */
--- 1495,1501 ----
  		return false;
  	}
  
! 	tokenize_file(HbaFileName, file, &hba_lines, &hba_line_nums, hba_keywords);
  	FreeFile(file);
  
  	/* Now parse all the lines */
***************
*** 1809,1815 **** load_ident(void)
  	}
  	else
  	{
! 		tokenize_file(IdentFileName, file, &ident_lines, &ident_line_nums);
  		FreeFile(file);
  	}
  }
--- 1814,1821 ----
  	}
  	else
  	{
! 		tokenize_file(IdentFileName, file, &ident_lines, &ident_line_nums,
! 					  hba_keywords);
  		FreeFile(file);
  	}
  }
*** a/src/backend/port/unix_latch.c
--- b/src/backend/port/unix_latch.c
***************
*** 310,315 **** SetLatch(volatile Latch *latch)
--- 310,325 ----
  }
  
  /*
+  * Signal the given reason, in addition to SetLatch.
+  */
+ void
+ SetProcLatch(volatile Latch *latch, ProcSignalReason reason, BackendId backendId)
+ {
+ 	SetProcSignalReason(latch->owner_pid, reason, backendId);
+ 	SetLatch(latch);
+ }
+ 
+ /*
   * Clear the latch. Calling WaitLatch after this will sleep, unless
   * the latch is set again before the WaitLatch call.
   */
*** a/src/backend/port/win32_latch.c
--- b/src/backend/port/win32_latch.c
***************
*** 49,54 **** InitLatch(volatile Latch *latch)
--- 49,55 ----
  	latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
  	latch->is_shared = false;
  	latch->is_set = false;
+ 	latch->owner_pid = MyProcPid;
  }
  
  void
***************
*** 57,62 **** InitSharedLatch(volatile Latch *latch)
--- 58,64 ----
  	latch->is_shared = true;
  	latch->is_set = false;
  	latch->event = NULL;
+ 	latch->owner_pid = 0;
  }
  
  void
***************
*** 81,86 **** OwnLatch(volatile Latch *latch)
--- 83,89 ----
  	SpinLockRelease(&sharedHandles->mutex);
  
  	latch->event = event;
+ 	latch->owner_pid = MyProcPid;
  }
  
  void
***************
*** 88,93 **** DisownLatch(volatile Latch *latch)
--- 91,97 ----
  {
  	Assert(latch->is_shared);
  	Assert(latch->event != NULL);
+ 	Assert(latch->owner_pid == MyProcPid);
  
  	/* Put the event handle back to the pool */
  	SpinLockAcquire(&sharedHandles->mutex);
***************
*** 101,106 **** DisownLatch(volatile Latch *latch)
--- 105,111 ----
  	SpinLockRelease(&sharedHandles->mutex);
  
  	latch->event = NULL;
+ 	latch->owner_pid = 0;
  }
  
  bool
***************
*** 119,124 **** WaitLatchOrSocket(volatile Latch *latch, SOCKET sock, long timeout)
--- 124,132 ----
  	int			numevents;
  	int			result = 0;
  
+ 	if (latch->owner_pid != MyProcPid)
+ 		elog(ERROR, "cannot wait on a latch owned by another process");
+ 
  	latchevent = latch->event;
  
  	events[0] = latchevent;
***************
*** 212,220 **** SetLatch(volatile Latch *latch)
--- 220,241 ----
  	}
  }
  
+ /*
+  * Signal the given reason, in addition to SetLatch.
+  */
+ void
+ SetProcLatch(volatile Latch *latch, ProcSignalReason reason, BackendId backendId)
+ {
+ 	SetProcSignalReason(latch->owner_pid, reason, backendId);
+ 	SetLatch(latch);
+ }
+ 
  void
  ResetLatch(volatile Latch *latch)
  {
+ 	/* Only the owner should reset the latch */
+ 	Assert(latch->owner_pid == MyProcPid);
+ 
  	latch->is_set = false;
  }
  
***************
*** 231,236 **** NumSharedLatches(void)
--- 252,260 ----
  	/* Each walsender needs one latch */
  	numLatches += max_wal_senders;
  
+ 	/* Each backend needs one latch */
+ 	numLatches += MaxBackends;
+ 
  	return numLatches;
  }
  
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***************
*** 1063,1069 **** PostmasterMain(int argc, char *argv[])
  	autovac_init();
  
  	/*
! 	 * Load configuration files for client authentication.
  	 */
  	if (!load_hba())
  	{
--- 1063,1069 ----
  	autovac_init();
  
  	/*
! 	 * Load configuration files for client authentication and replication.
  	 */
  	if (!load_hba())
  	{
***************
*** 1075,1080 **** PostmasterMain(int argc, char *argv[])
--- 1075,1085 ----
  				(errmsg("could not load pg_hba.conf")));
  	}
  	load_ident();
+ 	if (max_wal_senders > 0 && !load_standbys())
+ 	{
+ 		ereport(FATAL,
+ 				(errmsg("could not load standbys.conf")));
+ 	}
  
  	/*
  	 * Remember postmaster startup time
***************
*** 1713,1718 **** retry1:
--- 1718,1725 ----
  							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  							 errmsg("invalid value for boolean option \"replication\"")));
  			}
+ 			else if (strcmp(nameptr, "standby_name") == 0)
+ 			    standby_name = pstrdup(valptr);
  			else
  			{
  				/* Assume it's a generic GUC option */
***************
*** 2129,2134 **** SIGHUP_handler(SIGNAL_ARGS)
--- 2136,2146 ----
  
  		load_ident();
  
+ 		/* Reload standbys configuration file too */
+ 		if (max_wal_senders > 0 && !load_standbys())
+ 			ereport(WARNING,
+ 					(errmsg("standbys.conf not reloaded")));
+ 
  #ifdef EXEC_BACKEND
  		/* Update the starting-point file for future children */
  		write_nondefault_variables(PGC_SIGHUP);
*** a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
--- b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
***************
*** 47,55 **** static bool justconnected = false;
  static char *recvBuf = NULL;
  
  /* Prototypes for interface functions */
! static bool libpqrcv_connect(char *conninfo, XLogRecPtr startpoint);
  static bool libpqrcv_receive(int timeout, unsigned char *type,
  				 char **buffer, int *len);
  static void libpqrcv_disconnect(void);
  
  /* Prototypes for private functions */
--- 47,57 ----
  static char *recvBuf = NULL;
  
  /* Prototypes for interface functions */
! static bool libpqrcv_connect(char *conninfo, XLogRecPtr startpoint,
! 							 char *standbyName);
  static bool libpqrcv_receive(int timeout, unsigned char *type,
  				 char **buffer, int *len);
+ static void libpqrcv_send(const char *buffer, int nbytes);
  static void libpqrcv_disconnect(void);
  
  /* Prototypes for private functions */
***************
*** 64,73 **** _PG_init(void)
  {
  	/* Tell walreceiver how to reach us */
  	if (walrcv_connect != NULL || walrcv_receive != NULL ||
! 		walrcv_disconnect != NULL)
  		elog(ERROR, "libpqwalreceiver already loaded");
  	walrcv_connect = libpqrcv_connect;
  	walrcv_receive = libpqrcv_receive;
  	walrcv_disconnect = libpqrcv_disconnect;
  }
  
--- 66,76 ----
  {
  	/* Tell walreceiver how to reach us */
  	if (walrcv_connect != NULL || walrcv_receive != NULL ||
! 		walrcv_send != NULL || walrcv_disconnect != NULL)
  		elog(ERROR, "libpqwalreceiver already loaded");
  	walrcv_connect = libpqrcv_connect;
  	walrcv_receive = libpqrcv_receive;
+ 	walrcv_send = libpqrcv_send;
  	walrcv_disconnect = libpqrcv_disconnect;
  }
  
***************
*** 75,98 **** _PG_init(void)
   * Establish the connection to the primary server for XLOG streaming
   */
  static bool
! libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
  {
! 	char		conninfo_repl[MAXCONNINFO + 37];
  	char	   *primary_sysid;
  	char		standby_sysid[32];
  	TimeLineID	primary_tli;
  	TimeLineID	standby_tli;
  	PGresult   *res;
  	char		cmd[64];
  
  	/*
! 	 * Connect using deliberately undocumented parameter: replication. The
! 	 * database name is ignored by the server in replication mode, but specify
! 	 * "replication" for .pgpass lookup.
  	 */
! 	snprintf(conninfo_repl, sizeof(conninfo_repl),
! 			 "%s dbname=replication replication=true",
! 			 conninfo);
  
  	streamConn = PQconnectdb(conninfo_repl);
  	if (PQstatus(streamConn) != CONNECTION_OK)
--- 78,107 ----
   * Establish the connection to the primary server for XLOG streaming
   */
  static bool
! libpqrcv_connect(char *conninfo, XLogRecPtr startpoint, char *standbyName)
  {
! 	char		conninfo_repl[MAXCONNINFO + MAXSTANDBYNAME + 37];
  	char	   *primary_sysid;
  	char		standby_sysid[32];
  	TimeLineID	primary_tli;
  	TimeLineID	standby_tli;
+ 	char	   *primary_rplMode;
  	PGresult   *res;
  	char		cmd[64];
  
  	/*
! 	 * Connect using deliberately undocumented parameter: replication
! 	 * and standby_name. The database name is ignored by the server in
! 	 * replication mode, but specify "replication" for .pgpass lookup.
  	 */
! 	if (standbyName && standbyName[0] != '\0')
! 		snprintf(conninfo_repl, sizeof(conninfo_repl),
! 				 "%s dbname=replication replication=true standby_name='%s'",
! 				 conninfo, standbyName);
! 	else
! 		snprintf(conninfo_repl, sizeof(conninfo_repl),
! 				 "%s dbname=replication replication=true",
! 				 conninfo);
  
  	streamConn = PQconnectdb(conninfo_repl);
  	if (PQstatus(streamConn) != CONNECTION_OK)
***************
*** 109,119 **** libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
  	{
  		PQclear(res);
  		ereport(ERROR,
! 				(errmsg("could not receive database system identifier and timeline ID from "
! 						"the primary server: %s",
  						PQerrorMessage(streamConn))));
  	}
! 	if (PQnfields(res) != 2 || PQntuples(res) != 1)
  	{
  		int			ntuples = PQntuples(res);
  		int			nfields = PQnfields(res);
--- 118,128 ----
  	{
  		PQclear(res);
  		ereport(ERROR,
! 				(errmsg("could not receive database system identifier, timeline ID and "
! 						"replication mode from the primary server: %s",
  						PQerrorMessage(streamConn))));
  	}
! 	if (PQnfields(res) != 3 || PQntuples(res) != 1)
  	{
  		int			ntuples = PQntuples(res);
  		int			nfields = PQnfields(res);
***************
*** 121,131 **** libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
  		PQclear(res);
  		ereport(ERROR,
  				(errmsg("invalid response from primary server"),
! 				 errdetail("Expected 1 tuple with 2 fields, got %d tuples with %d fields.",
  						   ntuples, nfields)));
  	}
  	primary_sysid = PQgetvalue(res, 0, 0);
  	primary_tli = pg_atoi(PQgetvalue(res, 0, 1), 4, 0);
  
  	/*
  	 * Confirm that the system identifier of the primary is the same as ours.
--- 130,141 ----
  		PQclear(res);
  		ereport(ERROR,
  				(errmsg("invalid response from primary server"),
! 				 errdetail("Expected 1 tuple with 3 fields, got %d tuples with %d fields.",
  						   ntuples, nfields)));
  	}
  	primary_sysid = PQgetvalue(res, 0, 0);
  	primary_tli = pg_atoi(PQgetvalue(res, 0, 1), 4, 0);
+ 	primary_rplMode = PQgetvalue(res, 0, 2);
  
  	/*
  	 * Confirm that the system identifier of the primary is the same as ours.
***************
*** 146,158 **** libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
  	 * recovery target timeline.
  	 */
  	standby_tli = GetRecoveryTargetTLI();
- 	PQclear(res);
  	if (primary_tli != standby_tli)
  		ereport(ERROR,
  				(errmsg("timeline %u of the primary does not match recovery target timeline %u",
  						primary_tli, standby_tli)));
  	ThisTimeLineID = primary_tli;
  
  	/* Start streaming from the point requested by startup process */
  	snprintf(cmd, sizeof(cmd), "START_REPLICATION %X/%X",
  			 startpoint.xlogid, startpoint.xrecoff);
--- 156,180 ----
  	 * recovery target timeline.
  	 */
  	standby_tli = GetRecoveryTargetTLI();
  	if (primary_tli != standby_tli)
+ 	{
+ 		PQclear(res);
  		ereport(ERROR,
  				(errmsg("timeline %u of the primary does not match recovery target timeline %u",
  						primary_tli, standby_tli)));
+ 	}
  	ThisTimeLineID = primary_tli;
  
+ 	/*
+ 	 * Confirm that the passed replication mode is valid.
+ 	 */
+ 	rplMode = ReplicationModeNameGetValue(primary_rplMode);
+ 	PQclear(res);
+ 	if (rplMode == InvalidReplicationMode)
+ 		ereport(ERROR,
+ 				(errmsg("invalid replication mode \"%s\"",
+ 						primary_rplMode)));
+ 
  	/* Start streaming from the point requested by startup process */
  	snprintf(cmd, sizeof(cmd), "START_REPLICATION %X/%X",
  			 startpoint.xlogid, startpoint.xrecoff);
***************
*** 398,400 **** libpqrcv_receive(int timeout, unsigned char *type, char **buffer, int *len)
--- 420,437 ----
  
  	return true;
  }
+ 
+ /*
+  * Send a message to XLOG stream.
+  *
+  * ereports on error.
+  */
+ static void
+ libpqrcv_send(const char *buffer, int nbytes)
+ {
+ 	if (PQputCopyData(streamConn, buffer, nbytes) <= 0 ||
+ 		PQflush(streamConn))
+ 		ereport(ERROR,
+ 				(errmsg("could not send data to WAL stream: %s",
+ 						PQerrorMessage(streamConn))));
+ }
*** /dev/null
--- b/src/backend/replication/standbys.conf.sample
***************
*** 0 ****
--- 1,35 ----
+ # PostgreSQL Standbys Configuration File
+ # ===================================================
+ #
+ # Refer to the "Streaming Replication" section in the PostgreSQL
+ # documentation for a complete description of this file.  A short
+ # synopsis follows.
+ #
+ # This file controls which replication mode each standby uses.
+ # Records are of the form:
+ #
+ # STANDBY-NAME  REPLICATION-MODE
+ #
+ # (The uppercase items must be replaced by actual values.)
+ #
+ # STANDBY-NAME can be "all", standby name, or a comma-separated list
+ # thereof.
+ #
+ # REPLICATION-MODE specifies how long transaction commit waits for
+ # replication before the commit command returns a "success" to a
+ # client. The valid modes are "async", "recv", "fsync" and "replay".
+ #
+ # Standby name containing spaces, commas, quotes and other special
+ # characters must be quoted.  Quoting one of the keyword "all" makes
+ # the name lose its special character, and just match standby with
+ # that name.
+ #
+ # This file is read on server startup and when the postmaster receives
+ # a SIGHUP signal.  If you edit the file on a running system, you have
+ # to SIGHUP the postmaster for the changes to take effect.  You can
+ # use "pg_ctl reload" to do that.
+ 
+ # Put your actual configuration here
+ # ----------------------------------
+ 
+ # STANDBY-NAME       REPLICATION-MODE
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***************
*** 57,62 **** bool		am_walreceiver;
--- 57,63 ----
  /* libpqreceiver hooks to these when loaded */
  walrcv_connect_type walrcv_connect = NULL;
  walrcv_receive_type walrcv_receive = NULL;
+ walrcv_send_type walrcv_send = NULL;
  walrcv_disconnect_type walrcv_disconnect = NULL;
  
  #define NAPTIME_PER_CYCLE 100	/* max sleep time between cycles (100ms) */
***************
*** 113,118 **** static void WalRcvDie(int code, Datum arg);
--- 114,120 ----
  static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
  static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
  static void XLogWalRcvFlush(void);
+ static void XLogWalRcvSendRecPtr(XLogRecPtr recptr);
  
  /* Signal handlers */
  static void WalRcvSigHupHandler(SIGNAL_ARGS);
***************
*** 158,164 **** void
--- 160,168 ----
  WalReceiverMain(void)
  {
  	char		conninfo[MAXCONNINFO];
+ 	char		standbyName[MAXSTANDBYNAME];
  	XLogRecPtr	startpoint;
+ 	XLogRecPtr	ackedpoint = {0, 0};
  
  	/* use volatile pointer to prevent code rearrangement */
  	volatile WalRcvData *walrcv = WalRcv;
***************
*** 206,211 **** WalReceiverMain(void)
--- 210,216 ----
  
  	/* Fetch information required to start streaming */
  	strlcpy(conninfo, (char *) walrcv->conninfo, MAXCONNINFO);
+ 	strlcpy(standbyName, (char *) walrcv->standbyName, MAXSTANDBYNAME);
  	startpoint = walrcv->receivedUpto;
  	SpinLockRelease(&walrcv->mutex);
  
***************
*** 247,253 **** WalReceiverMain(void)
  	/* Load the libpq-specific functions */
  	load_file("libpqwalreceiver", false);
  	if (walrcv_connect == NULL || walrcv_receive == NULL ||
! 		walrcv_disconnect == NULL)
  		elog(ERROR, "libpqwalreceiver didn't initialize correctly");
  
  	/*
--- 252,258 ----
  	/* Load the libpq-specific functions */
  	load_file("libpqwalreceiver", false);
  	if (walrcv_connect == NULL || walrcv_receive == NULL ||
! 		walrcv_send == NULL || walrcv_disconnect == NULL)
  		elog(ERROR, "libpqwalreceiver didn't initialize correctly");
  
  	/*
***************
*** 261,267 **** WalReceiverMain(void)
  
  	/* Establish the connection to the primary for XLOG streaming */
  	EnableWalRcvImmediateExit();
! 	walrcv_connect(conninfo, startpoint);
  	DisableWalRcvImmediateExit();
  
  	/* Loop until end-of-streaming or error */
--- 266,272 ----
  
  	/* Establish the connection to the primary for XLOG streaming */
  	EnableWalRcvImmediateExit();
! 	walrcv_connect(conninfo, startpoint, standbyName);
  	DisableWalRcvImmediateExit();
  
  	/* Loop until end-of-streaming or error */
***************
*** 311,316 **** WalReceiverMain(void)
--- 316,340 ----
  			 */
  			XLogWalRcvFlush();
  		}
+ 
+ 		/*
+ 		 * If replication_mode is "replay", send the last WAL replay location
+ 		 * to the primary, to acknowledge that replication has been completed
+ 		 * up to that. This occurs only when WAL records were replayed since
+ 		 * the last acknowledgement.
+ 		 */
+ 		if (rplMode == REPLICATION_MODE_REPLAY &&
+ 			XLByteLT(ackedpoint, LogstreamResult.Flush))
+ 		{
+ 			XLogRecPtr	recptr;
+ 
+ 			recptr = GetReplayRecPtr();
+ 			if (XLByteLT(ackedpoint, recptr))
+ 			{
+ 				XLogWalRcvSendRecPtr(recptr);
+ 				ackedpoint = recptr;
+ 			}
+ 		}
  	}
  }
  
***************
*** 406,411 **** XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
--- 430,448 ----
  				buf += sizeof(WalDataMessageHeader);
  				len -= sizeof(WalDataMessageHeader);
  
+ 				/*
+ 				 * If replication_mode is "recv", send the last WAL receive
+ 				 * location to the primary, to acknowledge that replication
+ 				 * has been completed up to that.
+ 				 */
+ 				if (rplMode == REPLICATION_MODE_RECV)
+ 				{
+ 					XLogRecPtr	endptr = msghdr.dataStart;
+ 
+ 					XLByteAdvance(endptr, len);
+ 					XLogWalRcvSendRecPtr(endptr);
+ 				}
+ 
  				XLogWalRcvWrite(buf, len, msghdr.dataStart);
  				break;
  			}
***************
*** 523,528 **** XLogWalRcvFlush(void)
--- 560,573 ----
  
  		LogstreamResult.Flush = LogstreamResult.Write;
  
+ 		/*
+ 		 * If replication_mode is "fsync", send the last WAL flush
+ 		 * location to the primary, to acknowledge that replication
+ 		 * has been completed up to that.
+ 		 */
+ 		if (rplMode == REPLICATION_MODE_FSYNC)
+ 			XLogWalRcvSendRecPtr(LogstreamResult.Flush);
+ 
  		/* Update shared-memory status */
  		SpinLockAcquire(&walrcv->mutex);
  		walrcv->latestChunkStart = walrcv->receivedUpto;
***************
*** 541,543 **** XLogWalRcvFlush(void)
--- 586,609 ----
  		}
  	}
  }
+ 
+ /* Send the lsn to the primary server */
+ static void
+ XLogWalRcvSendRecPtr(XLogRecPtr recptr)
+ {
+ 	static char	   *msgbuf = NULL;
+ 	WalAckMessageData	msgdata;
+ 
+ 	/*
+ 	 * Allocate buffer that will be used for each output message if first
+ 	 * time through.  We do this just once to reduce palloc overhead.
+ 	 * The buffer must be made large enough for maximum-sized messages.
+ 	 */
+ 	if (msgbuf == NULL)
+ 		msgbuf = palloc(1 + sizeof(WalAckMessageData));
+ 
+ 	msgbuf[0] = 'l';
+ 	msgdata.ackEnd = recptr;
+ 	memcpy(msgbuf + 1, &msgdata, sizeof(WalAckMessageData));
+ 	walrcv_send(msgbuf, 1 + sizeof(WalAckMessageData));
+ }
*** a/src/backend/replication/walreceiverfuncs.c
--- b/src/backend/replication/walreceiverfuncs.c
***************
*** 168,178 **** ShutdownWalRcv(void)
  /*
   * Request postmaster to start walreceiver.
   *
!  * recptr indicates the position where streaming should begin, and conninfo
!  * is a libpq connection string to use.
   */
  void
! RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile WalRcvData *walrcv = WalRcv;
--- 168,180 ----
  /*
   * Request postmaster to start walreceiver.
   *
!  * recptr indicates the position where streaming should begin, conninfo
!  * is a libpq connection string to use, and standbyName is name of this
!  * standby.
   */
  void
! RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo,
! 					 const char *standbyName)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile WalRcvData *walrcv = WalRcv;
***************
*** 196,201 **** RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo)
--- 198,207 ----
  		strlcpy((char *) walrcv->conninfo, conninfo, MAXCONNINFO);
  	else
  		walrcv->conninfo[0] = '\0';
+ 	if (standbyName != NULL)
+ 		strlcpy((char *) walrcv->standbyName, standbyName, MAXSTANDBYNAME);
+ 	else
+ 		walrcv->standbyName[0] = '\0';
  	walrcv->walRcvState = WALRCV_STARTING;
  	walrcv->startTime = now;
  
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 39,44 ****
--- 39,45 ----
  
  #include "access/xlog_internal.h"
  #include "catalog/pg_type.h"
+ #include "libpq/hba.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "libpq/pqsignal.h"
***************
*** 48,53 ****
--- 49,55 ----
  #include "storage/fd.h"
  #include "storage/ipc.h"
  #include "storage/pmsignal.h"
+ #include "storage/proc.h"
  #include "tcop/tcopprot.h"
  #include "utils/guc.h"
  #include "utils/memutils.h"
***************
*** 60,67 **** WalSndCtlData *WalSndCtl = NULL;
--- 62,73 ----
  /* My slot in the shared memory array */
  static WalSnd *MyWalSnd = NULL;
  
+ /* Array of WalSndWaiter in shared memory */
+ static WalSndWaiter  *WalSndWaiters;
+ 
  /* Global state */
  bool		am_walsender = false;		/* Am I a walsender process ? */
+ char	   *standby_name = NULL;		/* Name of connected standby */
  
  /* User-settable parameters for walsender */
  int			max_wal_senders = 0;	/* the maximum number of concurrent walsenders */
***************
*** 82,92 **** static uint32 sendOff = 0;
--- 88,125 ----
   */
  static XLogRecPtr sentPtr = {0, 0};
  
+ /*
+  * How far have we completed replication already? This is also
+  * advertised in MyWalSnd->ackdPtr. This is not used in asynchronous
+  * replication case.
+  */
+ static XLogRecPtr ackdPtr = {0, 0};
+ 
  /* Flags set by signal handlers for later service in main loop */
  static volatile sig_atomic_t got_SIGHUP = false;
  static volatile sig_atomic_t shutdown_requested = false;
  static volatile sig_atomic_t ready_to_stop = false;
  
+ /* Flag set by signal handler of backends for replication */
+ static volatile sig_atomic_t replication_done = false;
+ 
+ /*
+  * pre-parsed content of standbys configuration file: list of
+  * StandbysLine structs
+  */
+ static List *parsed_standbys_lines = NIL;
+ 
+ static const char *standbys_keywords[] = {"all", NULL};
+ 
+ /*
+  * Path of standbys configuration file (relative to $PGDATA).
+  *
+  * XXX: We should support the GUC parameter specifying the path of
+  * standbys configuration file?
+  */
+ #define STANDBYS_FILENAME	"standbys.conf"
+ static char	*StandbysFileName = NULL;
+ 
  /* Signal handlers */
  static void WalSndSigHupHandler(SIGNAL_ARGS);
  static void WalSndShutdownHandler(SIGNAL_ARGS);
***************
*** 101,107 **** static void WalSndHandshake(void);
  static void WalSndKill(int code, Datum arg);
  static void XLogRead(char *buf, XLogRecPtr recptr, Size nbytes);
  static bool XLogSend(char *msgbuf, bool *caughtup);
! static void CheckClosedConnection(void);
  
  
  /* Main entry point for walsender process */
--- 134,149 ----
  static void WalSndKill(int code, Datum arg);
  static void XLogRead(char *buf, XLogRecPtr recptr, Size nbytes);
  static bool XLogSend(char *msgbuf, bool *caughtup);
! static void ProcessStreamMsgs(StringInfo inMsg);
! 
! static void RegisterWalSndWaiter(BackendId backendId, XLogRecPtr record,
! 								 Latch *latch);
! static void WakeupWalSndWaiters(XLogRecPtr record);
! static XLogRecPtr GetOldestAckdPtr(void);
! 
! static bool parse_standbys_line(List *line, int line_num, StandbysLine *parsedline);
! static void free_standbys_record(StandbysLine *record);
! static void clean_standbys_list(List *lines);
  
  
  /* Main entry point for walsender process */
***************
*** 218,236 **** WalSndHandshake(void)
  						StringInfoData buf;
  						char		sysid[32];
  						char		tli[11];
  
  						/*
! 						 * Reply with a result set with one row, two columns.
! 						 * First col is system ID, and second is timeline ID
  						 */
  
  						snprintf(sysid, sizeof(sysid), UINT64_FORMAT,
  								 GetSystemIdentifier());
  						snprintf(tli, sizeof(tli), "%u", ThisTimeLineID);
  
  						/* Send a RowDescription message */
  						pq_beginmessage(&buf, 'T');
! 						pq_sendint(&buf, 2, 2); /* 2 fields */
  
  						/* first field */
  						pq_sendstring(&buf, "systemid");		/* col name */
--- 260,281 ----
  						StringInfoData buf;
  						char		sysid[32];
  						char		tli[11];
+ 						char		mode[8];
  
  						/*
! 						 * Reply with a result set with one row, three columns.
! 						 * First col is system ID, second is timeline ID, and
! 						 * third is replication mode.
  						 */
  
  						snprintf(sysid, sizeof(sysid), UINT64_FORMAT,
  								 GetSystemIdentifier());
  						snprintf(tli, sizeof(tli), "%u", ThisTimeLineID);
+ 						snprintf(mode, sizeof(mode), "%s", ReplicationModeNames[rplMode]);
  
  						/* Send a RowDescription message */
  						pq_beginmessage(&buf, 'T');
! 						pq_sendint(&buf, 3, 2); /* 3 fields */
  
  						/* first field */
  						pq_sendstring(&buf, "systemid");		/* col name */
***************
*** 249,263 **** WalSndHandshake(void)
  						pq_sendint(&buf, 4, 2); /* typlen */
  						pq_sendint(&buf, 0, 4); /* typmod */
  						pq_sendint(&buf, 0, 2); /* format code */
  						pq_endmessage(&buf);
  
  						/* Send a DataRow message */
  						pq_beginmessage(&buf, 'D');
! 						pq_sendint(&buf, 2, 2); /* # of columns */
  						pq_sendint(&buf, strlen(sysid), 4);		/* col1 len */
  						pq_sendbytes(&buf, (char *) &sysid, strlen(sysid));
  						pq_sendint(&buf, strlen(tli), 4);		/* col2 len */
  						pq_sendbytes(&buf, (char *) tli, strlen(tli));
  						pq_endmessage(&buf);
  
  						/* Send CommandComplete and ReadyForQuery messages */
--- 294,319 ----
  						pq_sendint(&buf, 4, 2); /* typlen */
  						pq_sendint(&buf, 0, 4); /* typmod */
  						pq_sendint(&buf, 0, 2); /* format code */
+ 
+ 						/* third field */
+ 						pq_sendstring(&buf, "replication_mode");	/* col name */
+ 						pq_sendint(&buf, 0, 4); /* table oid */
+ 						pq_sendint(&buf, 0, 2); /* attnum */
+ 						pq_sendint(&buf, TEXTOID, 4);	/* type oid */
+ 						pq_sendint(&buf, -1, 2);		/* typlen */
+ 						pq_sendint(&buf, 0, 4); /* typmod */
+ 						pq_sendint(&buf, 0, 2); /* format code */
  						pq_endmessage(&buf);
  
  						/* Send a DataRow message */
  						pq_beginmessage(&buf, 'D');
! 						pq_sendint(&buf, 3, 2); /* # of columns */
  						pq_sendint(&buf, strlen(sysid), 4);		/* col1 len */
  						pq_sendbytes(&buf, (char *) &sysid, strlen(sysid));
  						pq_sendint(&buf, strlen(tli), 4);		/* col2 len */
  						pq_sendbytes(&buf, (char *) tli, strlen(tli));
+ 						pq_sendint(&buf, strlen(mode), 4);	/* col3 len */
+ 						pq_sendbytes(&buf, (char *) &mode, strlen(mode));
  						pq_endmessage(&buf);
  
  						/* Send CommandComplete and ReadyForQuery messages */
***************
*** 295,304 **** WalSndHandshake(void)
  						pq_flush();
  
  						/*
! 						 * Initialize position to the received one, then the
  						 * xlog records begin to be shipped from that position
  						 */
! 						sentPtr = recptr;
  
  						/* break out of the loop */
  						replication_started = true;
--- 351,360 ----
  						pq_flush();
  
  						/*
! 						 * Initialize positions to the received one, then the
  						 * xlog records begin to be shipped from that position
  						 */
! 						sentPtr = ackdPtr = recptr;
  
  						/* break out of the loop */
  						replication_started = true;
***************
*** 332,384 **** WalSndHandshake(void)
  }
  
  /*
!  * Check if the remote end has closed the connection.
   */
  static void
! CheckClosedConnection(void)
  {
! 	unsigned char firstchar;
! 	int			r;
  
! 	r = pq_getbyte_if_available(&firstchar);
! 	if (r < 0)
! 	{
! 		/* unexpected error or EOF */
! 		ereport(COMMERROR,
! 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
! 				 errmsg("unexpected EOF on standby connection")));
! 		proc_exit(0);
! 	}
! 	if (r == 0)
  	{
! 		/* no data available without blocking */
! 		return;
! 	}
  
- 	/* Handle the very limited subset of commands expected in this phase */
- 	switch (firstchar)
- 	{
  			/*
  			 * 'X' means that the standby is closing down the socket.
  			 */
! 		case 'X':
! 			proc_exit(0);
  
! 		default:
! 			ereport(FATAL,
! 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
! 					 errmsg("invalid standby closing message type %d",
! 							firstchar)));
  	}
  }
  
  /* Main loop of walsender process */
  static int
  WalSndLoop(void)
  {
  	char	   *output_message;
  	bool		caughtup = false;
  
  	/*
  	 * Allocate buffer that will be used for each output message.  We do this
  	 * just once to reduce palloc overhead.  The buffer must be made large
--- 388,512 ----
  }
  
  /*
!  * Process messages received from the standby.
!  *
!  * ereports on error.
   */
  static void
! ProcessStreamMsgs(StringInfo inMsg)
  {
! 	bool	acked = false;
  
! 	/* Loop to process successive complete messages available */
! 	for (;;)
  	{
! 		unsigned char firstchar;
! 		int			r;
! 
! 		r = pq_getbyte_if_available(&firstchar);
! 		if (r < 0)
! 		{
! 			/* unexpected error or EOF */
! 			ereport(COMMERROR,
! 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
! 					 errmsg("unexpected EOF on standby connection")));
! 			proc_exit(0);
! 		}
! 		if (r == 0)
! 		{
! 			/* no data available without blocking */
! 			break;
! 		}
! 
! 		/* Handle the very limited subset of commands expected in this phase */
! 		switch (firstchar)
! 		{
! 			case 'd':       /* CopyData message */
! 			{
! 				unsigned char	rpltype;
! 
! 				/*
! 				 * Read the message contents. This is expected to be done without
! 				 * blocking because we've been able to get message type code.
! 				 */
! 				if (pq_getmessage(inMsg, 0))
! 					proc_exit(0);		/* suitable message already logged */
! 
! 				/* Read the replication message type from CopyData message */
! 				rpltype = pq_getmsgbyte(inMsg);
! 				switch (rpltype)
! 				{
! 					case 'l':
! 					{
! 						WalAckMessageData  *msgdata;
! 
! 						msgdata = (WalAckMessageData *) pq_getmsgbytes(inMsg, sizeof(WalAckMessageData));
! 
! 						/*
! 						 * Update local status.
! 						 *
! 						 * The ackd ptr received from standby should not
! 						 * go backwards.
! 						 */
! 						if (XLByteLE(ackdPtr, msgdata->ackEnd))
! 							ackdPtr = msgdata->ackEnd;
! 						else
! 							ereport(FATAL,
! 									(errmsg("replication completion location went back from "
! 											"%X/%X to %X/%X",
! 											ackdPtr.xlogid, ackdPtr.xrecoff,
! 											msgdata->ackEnd.xlogid, msgdata->ackEnd.xrecoff)));
! 
! 						acked = true;	/* also need to update shared position */
! 						break;
! 					}
! 					default:
! 						ereport(FATAL,
! 								(errcode(ERRCODE_PROTOCOL_VIOLATION),
! 								 errmsg("invalid replication message type %d",
! 										rpltype)));
! 				}
! 				break;
! 			}
  
  			/*
  			 * 'X' means that the standby is closing down the socket.
  			 */
! 			case 'X':
! 				proc_exit(0);
  
! 			default:
! 				ereport(FATAL,
! 						(errcode(ERRCODE_PROTOCOL_VIOLATION),
! 						 errmsg("invalid standby closing message type %d",
! 								firstchar)));
! 		}
  	}
+ 
+ 	if (acked)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile WalSnd *walsnd = MyWalSnd;
+ 
+ 		SpinLockAcquire(&walsnd->mutex);
+ 		walsnd->ackdPtr = ackdPtr;
+ 		SpinLockRelease(&walsnd->mutex);
+  	}
+ 
+ 	/* Wake up the backends that this walsender had been blocking */
+ 	WakeupWalSndWaiters(GetOldestAckdPtr());
  }
  
  /* Main loop of walsender process */
  static int
  WalSndLoop(void)
  {
+ 	StringInfoData	input_message;
  	char	   *output_message;
  	bool		caughtup = false;
  
+ 	initStringInfo(&input_message);
+ 
  	/*
  	 * Allocate buffer that will be used for each output message.  We do this
  	 * just once to reduce palloc overhead.  The buffer must be made large
***************
*** 455,462 **** WalSndLoop(void)
  								  WalSndDelay * 1000L);
  			}
  
! 			/* Check if the connection was closed */
! 			CheckClosedConnection();
  		}
  		else
  		{
--- 583,590 ----
  								  WalSndDelay * 1000L);
  			}
  
! 			/* Process messages received from the standby */
! 			ProcessStreamMsgs(&input_message);
  		}
  		else
  		{
***************
*** 515,520 **** InitWalSnd(void)
--- 643,650 ----
  			 */
  			walsnd->pid = MyProcPid;
  			MemSet(&walsnd->sentPtr, 0, sizeof(XLogRecPtr));
+ 			MemSet(&walsnd->ackdPtr, 0, sizeof(XLogRecPtr));
+ 			walsnd->rplMode = rplMode;
  			SpinLockRelease(&walsnd->mutex);
  			/* don't need the lock anymore */
  			OwnLatch((Latch *) &walsnd->latch);
***************
*** 540,545 **** WalSndKill(int code, Datum arg)
--- 670,679 ----
  {
  	Assert(MyWalSnd != NULL);
  
+ 	/* Wake up the backends that this walsender had been blocking */
+ 	MyWalSnd->rplMode = InvalidReplicationMode;
+ 	WakeupWalSndWaiters(GetOldestAckdPtr());
+ 
  	/*
  	 * Mark WalSnd struct no longer in use. Assume that no lock is required
  	 * for this.
***************
*** 904,909 **** WalSndShmemSize(void)
--- 1038,1050 ----
  	size = offsetof(WalSndCtlData, walsnds);
  	size = add_size(size, mul_size(max_wal_senders, sizeof(WalSnd)));
  
+ 	/*
+ 	 * If replication is enabled, we have a data structure called
+ 	 * WalSndWaiters, created in shared memory.
+ 	 */
+ 	if (max_wal_senders > 0)
+ 		size = add_size(size, mul_size(MaxBackends, sizeof(WalSndWaiter)));
+ 
  	return size;
  }
  
***************
*** 913,926 **** WalSndShmemInit(void)
  {
  	bool		found;
  	int			i;
  
  	WalSndCtl = (WalSndCtlData *)
! 		ShmemInitStruct("Wal Sender Ctl", WalSndShmemSize(), &found);
  
  	if (!found)
  	{
  		/* First time through, so initialize */
! 		MemSet(WalSndCtl, 0, WalSndShmemSize());
  
  		for (i = 0; i < max_wal_senders; i++)
  		{
--- 1054,1069 ----
  {
  	bool		found;
  	int			i;
+ 	Size		size = add_size(offsetof(WalSndCtlData, walsnds),
+ 								mul_size(max_wal_senders, sizeof(WalSnd)));
  
  	WalSndCtl = (WalSndCtlData *)
! 		ShmemInitStruct("Wal Sender Ctl", size, &found);
  
  	if (!found)
  	{
  		/* First time through, so initialize */
! 		MemSet(WalSndCtl, 0, size);
  
  		for (i = 0; i < max_wal_senders; i++)
  		{
***************
*** 930,935 **** WalSndShmemInit(void)
--- 1073,1088 ----
  			InitSharedLatch(&walsnd->latch);
  		}
  	}
+ 
+ 	/* Create or attach to the WalSndWaiters array too, if needed */
+ 	if (max_wal_senders > 0)
+ 	{
+ 		WalSndWaiters = (WalSndWaiter *)
+ 			ShmemInitStruct("WalSndWaiters",
+ 							mul_size(MaxBackends, sizeof(WalSndWaiter)),
+ 							&found);
+ 		WalSndCtl->maxWaiters = MaxBackends;
+ 	}
  }
  
  /* Wake up all walsenders */
***************
*** 943,977 **** WalSndWakeup(void)
  }
  
  /*
!  * This isn't currently used for anything. Monitoring tools might be
!  * interested in the future, and we'll need something like this in the
!  * future for synchronous replication.
   */
! #ifdef NOT_USED
  /*
!  * Returns the oldest Send position among walsenders. Or InvalidXLogRecPtr
!  * if none.
   */
! XLogRecPtr
! GetOldestWALSendPointer(void)
  {
  	XLogRecPtr	oldest = {0, 0};
! 	int			i;
! 	bool		found = false;
  
  	for (i = 0; i < max_wal_senders; i++)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
! 		XLogRecPtr	recptr;
  
! 		if (walsnd->pid == 0)
  			continue;
  
  		SpinLockAcquire(&walsnd->mutex);
! 		recptr = walsnd->sentPtr;
  		SpinLockRelease(&walsnd->mutex);
  
  		if (recptr.xlogid == 0 && recptr.xrecoff == 0)
  			continue;
  
--- 1096,1259 ----
  }
  
  /*
!  * Ensure that replication has been completed up to the given position.
   */
! void
! WaitXLogSend(XLogRecPtr record)
! {
! 	int		i;
! 
! 	Assert(max_wal_senders > 0);
! 
! 	/*
! 	 * Register myself into the wait list and sleep until replication has
! 	 * been completed up to the given position and the walsender signals me.
! 	 *
! 	 * If replication has been completed up to the latest position before
! 	 * the registration, walsender might be unable to send the signal
! 	 * immediately. We must wake up the walsender after the registration.
! 	 */
! 	ResetLatch(&MyProc->latch);
! 	RegisterWalSndWaiter(MyBackendId, record, &MyProc->latch);
! 	WalSndWakeup();
! 
! 	for (;;)
! 	{
! 		WaitLatch(&MyProc->latch, 1000000L);
! 		if (replication_done)
! 		{
! 			replication_done = false;
! 			return;
! 		}
! 	}
! }
! 
! /*
!  * Register the given backend into the wait list.
!  */
! static void
! RegisterWalSndWaiter(BackendId backendId, XLogRecPtr record, Latch *latch)
! {
! 	/* use volatile pointer to prevent code rearrangement */
! 	volatile WalSndCtlData	*walsndctl = WalSndCtl;
! 	int		i;
! 	int		count = 0;
! 
! 	LWLockAcquire(WalSndWaiterLock, LW_EXCLUSIVE);
! 
! 	/* Out of slots. This should not happen. */
! 	if (walsndctl->numWaiters + 1 > walsndctl->maxWaiters)
! 		elog(PANIC, "out of replication waiters slots");
! 
! 	/*
! 	 * The given position is expected to be relatively new in the
! 	 * wait list. Since the entries in the list are sorted in an
! 	 * increasing order of XLogRecPtr, we can shorten the time it
! 	 * takes to find an insert slot by scanning the list backwards.
! 	 */
! 	for (i = walsndctl->numWaiters; i > 0; i--)
! 	{
! 		if (XLByteLE(WalSndWaiters[i - 1].record, record))
! 			break;
! 		count++;
!  	}
! 
! 	/* Shuffle the list if needed */
! 	if (count > 0)
! 		memmove(&WalSndWaiters[i + 1], &WalSndWaiters[i],
! 				count * sizeof(WalSndWaiter));
! 
! 	WalSndWaiters[i].backendId = backendId;
! 	WalSndWaiters[i].record = record;
! 	WalSndWaiters[i].latch = latch;
! 	walsndctl->numWaiters++;
! 
! 	LWLockRelease(WalSndWaiterLock);
! }
! 
  /*
!  * Wake up the backends waiting until replication has been completed
!  * up to the position older than or equal to the given one.
!  *
!  * Wake up all waiters if InvalidXLogRecPtr is given.
   */
! static void
! WakeupWalSndWaiters(XLogRecPtr record)
! {
! 	/* use volatile pointer to prevent code rearrangement */
! 	volatile WalSndCtlData	*walsndctl = WalSndCtl;
! 	int		i;
! 	int		count = 0;
! 	bool	all_wakeup = (record.xlogid == 0 && record.xrecoff == 0);
! 
! 	LWLockAcquire(WalSndWaiterLock, LW_EXCLUSIVE);
! 
! 	for (i = 0; i < walsndctl->numWaiters; i++)
! 	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile WalSndWaiter  *waiter = &WalSndWaiters[i];
! 
! 		if (all_wakeup || XLByteLE(waiter->record, record))
! 		{
! 			SetProcLatch(waiter->latch, PROCSIG_REPLICATION_INTERRUPT,
! 						 waiter->backendId);
! 			count++;
! 		}
! 		else
! 		{
! 			/*
! 			 * If the backend waiting for the Ack position newer than
! 			 * the given one is found, we don't need to search the wait
! 			 * list any more. This is because the waiters in the list
! 			 * are guaranteed to be sorted in an increasing order of
! 			 * XLogRecPtr.
! 			 */
! 			break;
! 		}
! 	}
! 
! 	/* If there are still some waiters, left-justify them in the list */
! 	walsndctl->numWaiters -= count;
! 	if (walsndctl->numWaiters > 0 && count > 0)
! 		memmove(&WalSndWaiters, &WalSndWaiters[i],
! 				walsndctl->numWaiters * sizeof(WalSndWaiter));
! 
! 	LWLockRelease(WalSndWaiterLock);
! }
! 
! /*
!  * Returns the oldest Ack position in synchronous walsenders. Or
!  * InvalidXLogRecPtr if none.
!  */
! static XLogRecPtr
! GetOldestAckdPtr(void)
  {
  	XLogRecPtr	oldest = {0, 0};
! 	int		i;
! 	bool	found = false;
  
  	for (i = 0; i < max_wal_senders; i++)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
! 		XLogRecPtr		recptr;
  
! 		/*
! 		 * Ignore the Ack position that asynchronous walsender has
! 		 * since it has never received any Ack.
! 		 */
! 		if (walsnd->pid == 0 ||
! 			walsnd->rplMode <= REPLICATION_MODE_ASYNC)
  			continue;
  
  		SpinLockAcquire(&walsnd->mutex);
! 		recptr = walsnd->ackdPtr;
  		SpinLockRelease(&walsnd->mutex);
  
+ 		/*
+ 		 * Ignore the Ack position that the walsender which has not
+ 		 * received any Ack yet has.
+ 		 */
  		if (recptr.xlogid == 0 && recptr.xrecoff == 0)
  			continue;
  
***************
*** 982,985 **** GetOldestWALSendPointer(void)
  	return oldest;
  }
  
! #endif
--- 1264,1476 ----
  	return oldest;
  }
  
! /*
!  * This is called when PROCSIG_REPLICATION_INTERRUPT is received.
!  */
! void
! HandleReplicationInterrupt(void)
! {
! 	replication_done = true;
! }
! 
! 
! /* ----------
!  * Routines to handle standbys configuration file
!  * ----------
!  */
! 
! /*
!  * Scan the (pre-parsed) standbys configuration file line by line,
!  * looking for a match to the standby name passed from the standby.
!  */
! bool
! check_standbys(void)
! {
! 	ListCell   *line;
! 	StandbysLine *standbys;
! 
! 	foreach(line, parsed_standbys_lines)
! 	{
! 		char	   *tok;
! 
! 		standbys = (StandbysLine *) lfirst(line);
! 
! 		/* Check standby name */
! 		for (tok = strtok(standbys->standbyName, MULTI_VALUE_SEP);
! 			 tok != NULL;
! 			 tok = strtok(NULL, MULTI_VALUE_SEP))
! 		{
! 			if (strcmp(tok, "all\n") == 0 ||
! 				(standby_name != NULL &&
! 				 strcmp(tok, standby_name) == 0))
! 			{
! 				rplMode = standbys->rplMode;
! 				return true;
! 			}
! 		}
! 	}
! 	return false;
! }
! 
! /*
!  * Parse one line in the standbys configuration file and store
!  * the result in a StandbysLine structure.
!  */
! static bool
! parse_standbys_line(List *line, int line_num, StandbysLine *parsedline)
! {
! 	char	   *token;
! 	ListCell   *line_item;
! 
! 	line_item = list_head(line);
! 
! 	parsedline->linenumber = line_num;
! 
! 	/* Get the standby name. */
! 	parsedline->standbyName = pstrdup(lfirst(line_item));
! 
! 	/* Get the mode. */
! 	line_item = lnext(line_item);
! 	if (!line_item)
! 	{
! 		ereport(LOG,
! 				(errcode(ERRCODE_CONFIG_FILE_ERROR),
! 				 errmsg("end-of-line before mode specification"),
! 				 errcontext("line %d of configuration file \"%s\"",
! 							line_num, StandbysFileName)));
! 		return false;
! 	}
! 	token = lfirst(line_item);
! 
! 	parsedline->rplMode = ReplicationModeNameGetValue(token);
! 	if (parsedline->rplMode == InvalidReplicationMode)
! 	{
! 		ereport(LOG,
! 				(errcode(ERRCODE_CONFIG_FILE_ERROR),
! 				 errmsg("invalid replication mode \"%s\"",
! 						token),
! 				 errcontext("line %d of configuration file \"%s\"",
! 							line_num, StandbysFileName)));
! 		return false;
! 	}
! 
! 	/* Ignore remaining tokens */
! 
! 	return true;
! }
! 
! /*
!  * Free an StandbysLine structure
!  */
! static void
! free_standbys_record(StandbysLine *record)
! {
! 	if (record->standbyName)
! 		pfree(record->standbyName);
! 	pfree(record);
! }
! 
! /*
!  * Free all records on the parsed Standbys list
!  */
! static void
! clean_standbys_list(List *lines)
! {
! 	ListCell   *line;
! 
! 	foreach(line, lines)
! 	{
! 		StandbysLine    *parsed = (StandbysLine *) lfirst(line);
! 
! 		if (parsed)
! 			free_standbys_record(parsed);
! 	}
! 	list_free(lines);
! }
! 
! /*
!  * Read the config file and create a List of StandbysLine records for the contents.
!  *
!  * The configuration is read into a temporary list, and if any parse error occurs
!  * the old list is kept in place and false is returned. Only if the whole file
!  * parses Ok is the list replaced, and the function returns true.
!  */
! bool
! load_standbys(void)
! {
! 	FILE	   *file;
! 	List	   *standbys_lines = NIL;
! 	List	   *standbys_line_nums = NIL;
! 	ListCell   *line,
! 			   *line_num;
! 	List	   *new_parsed_lines = NIL;
! 	bool		ok = true;
! 
! 	/* Ignore standbys.conf if replication is not enabled */
! 	if (max_wal_senders <= 0)
! 		return true;
! 
! 	/* If first time through, convert relative path to absolute */
! 	if (StandbysFileName == NULL)
! 		StandbysFileName = make_absolute_path(STANDBYS_FILENAME);
! 
! 	file = AllocateFile(StandbysFileName, "r");
! 	if (file == NULL)
! 	{
! 		ereport(LOG,
! 				(errcode_for_file_access(),
! 				 errmsg("could not open configuration file \"%s\": %m",
! 						StandbysFileName)));
! 
! 		/*
! 		 * Caller will take care of making this a FATAL error in case this is
! 		 * the initial startup. If it happens on reload, we just keep the old
! 		 * version around.
! 		 */
! 		return false;
! 	}
! 
! 	tokenize_file(StandbysFileName, file, &standbys_lines, &standbys_line_nums,
! 				  standbys_keywords);
! 	FreeFile(file);
! 
! 	/* Now parse all the lines */
! 	forboth(line, standbys_lines, line_num, standbys_line_nums)
! 	{
! 		StandbysLine    *newline;
! 
! 		newline = palloc0(sizeof(StandbysLine));
! 
! 		if (!parse_standbys_line(lfirst(line), lfirst_int(line_num), newline))
! 		{
! 			/* Parse error in the file, so indicate there's a problem */
! 			free_standbys_record(newline);
! 			ok = false;
! 
! 			/*
! 			 * Keep parsing the rest of the file so we can report errors on
! 			 * more than the first row. Error has already been reported in the
! 			 * parsing function, so no need to log it here.
! 			 */
! 			continue;
! 		}
! 
! 		new_parsed_lines = lappend(new_parsed_lines, newline);
! 	}
! 
! 	/* Free the temporary lists */
! 	free_lines(&standbys_lines, &standbys_line_nums);
! 
! 	if (!ok)
! 	{
! 		/* Parsing failed at one or more rows, so bail out */
! 		clean_standbys_list(new_parsed_lines);
! 		return false;
! 	}
! 
! 	/* Loaded new file successfully, replace the one we use */
! 	clean_standbys_list(parsed_standbys_lines);
! 	parsed_standbys_lines = new_parsed_lines;
! 
! 	return true;
! }
*** a/src/backend/storage/ipc/procsignal.c
--- b/src/backend/storage/ipc/procsignal.c
***************
*** 20,25 ****
--- 20,26 ----
  #include "bootstrap/bootstrap.h"
  #include "commands/async.h"
  #include "miscadmin.h"
+ #include "replication/walsender.h"
  #include "storage/ipc.h"
  #include "storage/latch.h"
  #include "storage/procsignal.h"
***************
*** 172,177 **** CleanupProcSignalState(int status, Datum arg)
--- 173,192 ----
  int
  SendProcSignal(pid_t pid, ProcSignalReason reason, BackendId backendId)
  {
+ 	if (SetProcSignalReason(pid, reason, backendId))
+ 		return kill(pid, SIGUSR1);		/* Send signal */
+ 
+ 	errno = ESRCH;
+ 	return -1;
+ }
+ 
+ /*
+  * SetProcSignalReason
+  *		Set the reason flag
+  */
+ bool
+ SetProcSignalReason(pid_t pid, ProcSignalReason reason, BackendId backendId)
+ {
  	volatile ProcSignalSlot *slot;
  
  	if (backendId != InvalidBackendId)
***************
*** 190,197 **** SendProcSignal(pid_t pid, ProcSignalReason reason, BackendId backendId)
  		{
  			/* Atomically set the proper flag */
  			slot->pss_signalFlags[reason] = true;
! 			/* Send signal */
! 			return kill(pid, SIGUSR1);
  		}
  	}
  	else
--- 205,211 ----
  		{
  			/* Atomically set the proper flag */
  			slot->pss_signalFlags[reason] = true;
! 			return true;
  		}
  	}
  	else
***************
*** 214,227 **** SendProcSignal(pid_t pid, ProcSignalReason reason, BackendId backendId)
  
  				/* Atomically set the proper flag */
  				slot->pss_signalFlags[reason] = true;
! 				/* Send signal */
! 				return kill(pid, SIGUSR1);
  			}
  		}
  	}
! 
! 	errno = ESRCH;
! 	return -1;
  }
  
  /*
--- 228,238 ----
  
  				/* Atomically set the proper flag */
  				slot->pss_signalFlags[reason] = true;
! 				return true;
  			}
  		}
  	}
! 	return false;
  }
  
  /*
***************
*** 279,284 **** procsignal_sigusr1_handler(SIGNAL_ARGS)
--- 290,298 ----
  	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
  		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
  
+ 	if (CheckProcSignal(PROCSIG_REPLICATION_INTERRUPT))
+ 		HandleReplicationInterrupt();
+ 
  	latch_sigusr1_handler();
  
  	errno = save_errno;
*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
***************
*** 196,201 **** InitProcGlobal(void)
--- 196,202 ----
  		PGSemaphoreCreate(&(procs[i].sem));
  		procs[i].links.next = (SHM_QUEUE *) ProcGlobal->freeProcs;
  		ProcGlobal->freeProcs = &procs[i];
+ 		InitSharedLatch(&procs[i].latch);
  	}
  
  	/*
***************
*** 214,219 **** InitProcGlobal(void)
--- 215,221 ----
  		PGSemaphoreCreate(&(procs[i].sem));
  		procs[i].links.next = (SHM_QUEUE *) ProcGlobal->autovacFreeProcs;
  		ProcGlobal->autovacFreeProcs = &procs[i];
+ 		InitSharedLatch(&procs[i].latch);
  	}
  
  	/*
***************
*** 325,330 **** InitProcess(void)
--- 327,333 ----
  	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
  		SHMQueueInit(&(MyProc->myProcLocks[i]));
  	MyProc->recoveryConflictPending = false;
+ 	OwnLatch(&MyProc->latch);
  
  	/*
  	 * We might be reusing a semaphore that belonged to a failed process. So
***************
*** 688,693 **** ProcKill(int code, Datum arg)
--- 691,697 ----
  	}
  
  	/* PGPROC struct isn't mine anymore */
+ 	DisownLatch(&MyProc->latch);
  	MyProc = NULL;
  
  	/* Update shared estimate of spins_per_delay */
*** a/src/backend/utils/init/postinit.c
--- b/src/backend/utils/init/postinit.c
***************
*** 664,669 **** InitPostgres(const char *in_dbname, Oid dboid, const char *username,
--- 664,690 ----
  					(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  					 errmsg("must be superuser to start walsender")));
  
+ 		/*
+ 		 * In EXEC_BACKEND case, we didn't inherit the contents of standbys.conf
+ 		 * etcetera from the postmaster, and have to load them ourselves.  Note we
+ 		 * are loading them into the startup transaction's memory context, not
+ 		 * PostmasterContext, but that shouldn't matter.
+ 		 *
+ 		 * FIXME: [fork/exec] Ugh.	Is there a way around this overhead?
+ 		 */
+ #ifdef EXEC_BACKEND
+ 		if (!load_standbys())
+ 		{
+ 			ereport(FATAL,
+ 					(errmsg("could not load standbys.conf")));
+ 		}
+ #endif
+ 
+ 		if (!check_standbys())
+ 			ereport(FATAL,
+ 					(errmsg("no standbys.conf entry for standby name \"%s\"",
+ 							standby_name)));
+ 
  		/* process any options passed in the startup packet */
  		if (MyProcPort != NULL)
  			process_startup_options(MyProcPort, am_superuser);
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 189,194 **** typedef enum
--- 189,229 ----
  
  extern XLogRecPtr XactLastRecEnd;
  
+ /*
+  * Replication mode. This is used to identify how long transaction
+  * commit should wait for replication.
+  *
+  * REPLICATION_MODE_ASYNC doesn't make transaction commit wait for
+  * replication, i.e., asynchronous replication.
+  *
+  * REPLICATION_MODE_RECV makes transaction commit wait for XLOG
+  * records to be received on the standby.
+  *
+  * REPLICATION_MODE_FSYNC makes transaction commit wait for XLOG
+  * records to be received and fsync'd on the standby.
+  *
+  * REPLICATION_MODE_REPLAY makes transaction commit wait for XLOG
+  * records to be received, fsync'd and replayed on the standby.
+  */
+ typedef enum ReplicationMode
+ {
+ 	InvalidReplicationMode = -1,
+ 	REPLICATION_MODE_ASYNC = 0,
+ 	REPLICATION_MODE_RECV,
+ 	REPLICATION_MODE_FSYNC,
+ 	REPLICATION_MODE_REPLAY
+ 
+ 	/*
+ 	 * NOTE: if you add a new mode, change MAXREPLICATIONMODE below
+ 	 * and update the ReplicationModeNames array in xlog.c
+ 	 */
+ } ReplicationMode;
+ 
+ #define MAXREPLICATIONMODE		REPLICATION_MODE_REPLAY
+ 
+ extern const char *ReplicationModeNames[];
+ extern ReplicationMode	rplMode;
+ 
  /* these variables are GUC parameters related to XLOG */
  extern int	CheckPointSegments;
  extern int	wal_keep_segments;
***************
*** 298,307 **** extern void XLogPutNextOid(Oid nextOid);
--- 333,345 ----
  extern XLogRecPtr GetRedoRecPtr(void);
  extern XLogRecPtr GetInsertRecPtr(void);
  extern XLogRecPtr GetFlushRecPtr(void);
+ extern XLogRecPtr GetReplayRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
  extern TimeLineID GetRecoveryTargetTLI(void);
  
  extern void HandleStartupProcInterrupts(void);
  extern void StartupProcessMain(void);
  
+ extern ReplicationMode ReplicationModeNameGetValue(char *name);
+ 
  #endif   /* XLOG_H */
*** a/src/include/libpq/hba.h
--- b/src/include/libpq/hba.h
***************
*** 15,20 ****
--- 15,24 ----
  #include "libpq/pqcomm.h"
  
  
+ /* This is used to separate values in multi-valued column strings */
+ #define MULTI_VALUE_SEP "\001"
+ 
+ 
  typedef enum UserAuth
  {
  	uaReject,
***************
*** 89,93 **** extern int check_usermap(const char *usermap_name,
--- 93,100 ----
  			  const char *pg_role, const char *auth_user,
  			  bool case_sensitive);
  extern bool pg_isblank(const char c);
+ extern void tokenize_file(const char *filename, FILE *file,
+ 			  List **lines, List **line_nums, const char **keywords);
+ extern void free_lines(List **lines, List **line_nums);
  
  #endif   /* HBA_H */
*** a/src/include/replication/walprotocol.h
--- b/src/include/replication/walprotocol.h
***************
*** 50,53 **** typedef struct
--- 50,63 ----
   */
  #define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
  
+ /*
+  * Body for a WAL acknowledgment message (message type 'l'). This is wrapped
+  * within a CopyData message at the FE/BE protocol level.
+  */
+ typedef struct
+ {
+ 	/* End of WAL replicated to the standby */
+ 	XLogRecPtr	ackEnd;
+ } WalAckMessageData;
+ 
  #endif   /* _WALPROTOCOL_H */
*** a/src/include/replication/walreceiver.h
--- b/src/include/replication/walreceiver.h
***************
*** 26,31 **** extern bool am_walreceiver;
--- 26,38 ----
  #define MAXCONNINFO		1024
  
  /*
+  * MAXSTANDBYNAME: maximum size of standby name.
+  *
+  * XXX: Should this move to pg_config_manual.h?
+  */
+ #define MAXSTANDBYNAME	64
+ 
+ /*
   * Values for WalRcv->walRcvState.
   */
  typedef enum
***************
*** 71,89 **** typedef struct
  	 */
  	char		conninfo[MAXCONNINFO];
  
  	slock_t		mutex;			/* locks shared variables shown above */
  } WalRcvData;
  
  extern WalRcvData *WalRcv;
  
  /* libpqwalreceiver hooks */
! typedef bool (*walrcv_connect_type) (char *conninfo, XLogRecPtr startpoint);
  extern PGDLLIMPORT walrcv_connect_type walrcv_connect;
  
  typedef bool (*walrcv_receive_type) (int timeout, unsigned char *type,
  												 char **buffer, int *len);
  extern PGDLLIMPORT walrcv_receive_type walrcv_receive;
  
  typedef void (*walrcv_disconnect_type) (void);
  extern PGDLLIMPORT walrcv_disconnect_type walrcv_disconnect;
  
--- 78,106 ----
  	 */
  	char		conninfo[MAXCONNINFO];
  
+ 	/*
+ 	 * standby name; is used for the master to determine replication mode
+ 	 * from standbys configuration file.
+ 	 */
+ 	char		standbyName[MAXSTANDBYNAME];
+ 
  	slock_t		mutex;			/* locks shared variables shown above */
  } WalRcvData;
  
  extern WalRcvData *WalRcv;
  
  /* libpqwalreceiver hooks */
! typedef bool (*walrcv_connect_type) (char *conninfo, XLogRecPtr startpoint,
! 									 char *standbyName);
  extern PGDLLIMPORT walrcv_connect_type walrcv_connect;
  
  typedef bool (*walrcv_receive_type) (int timeout, unsigned char *type,
  												 char **buffer, int *len);
  extern PGDLLIMPORT walrcv_receive_type walrcv_receive;
  
+ typedef void (*walrcv_send_type) (const char *buffer, int nbytes);
+ extern PGDLLIMPORT walrcv_send_type walrcv_send;
+ 
  typedef void (*walrcv_disconnect_type) (void);
  extern PGDLLIMPORT walrcv_disconnect_type walrcv_disconnect;
  
***************
*** 95,101 **** extern Size WalRcvShmemSize(void);
  extern void WalRcvShmemInit(void);
  extern void ShutdownWalRcv(void);
  extern bool WalRcvInProgress(void);
! extern void RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo);
  extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart);
  
  #endif   /* _WALRECEIVER_H */
--- 112,119 ----
  extern void WalRcvShmemInit(void);
  extern void ShutdownWalRcv(void);
  extern bool WalRcvInProgress(void);
! extern void RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo,
! 								 const char *standbyName);
  extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart);
  
  #endif   /* _WALRECEIVER_H */
*** a/src/include/replication/walsender.h
--- b/src/include/replication/walsender.h
***************
*** 23,28 **** typedef struct WalSnd
--- 23,31 ----
  {
  	pid_t		pid;			/* this walsender's process id, or 0 */
  	XLogRecPtr	sentPtr;		/* WAL has been sent up to this point */
+ 	XLogRecPtr	ackdPtr;		/* WAL has been replicated up to this point */
+ 
+ 	ReplicationMode	rplMode;	/* replication mode */
  
  	slock_t		mutex;			/* locks shared variables shown above */
  
***************
*** 36,57 **** typedef struct WalSnd
--- 39,91 ----
  /* There is one WalSndCtl struct for the whole database cluster */
  typedef struct
  {
+ 	/* Protected by WalSndWaiterLock */
+ 	int			numWaiters;	/* current # of WalSndWaiters */
+ 	int			maxWaiters;	/* allocated size of WalSndWaiters */
+ 
  	WalSnd		walsnds[1];		/* VARIABLE LENGTH ARRAY */
  } WalSndCtlData;
  
  extern WalSndCtlData *WalSndCtl;
  
+ /*
+  * Each waiter has a WalSndWaiter struct in shared memory.
+  */
+ typedef struct WalSndWaiter
+ {
+ 	BackendId	backendId;	/* this waiter's backend ID */
+ 	XLogRecPtr	record;		/* this waiter wants for replication to be
+ 							 * acked up to this point */
+ 	Latch	   *latch;		/* pointer to the latch used to wake up this
+ 							 * waiter */
+ } WalSndWaiter;
+ 
  /* global state */
  extern bool am_walsender;
+ extern char *standby_name;
  
  /* user-settable parameters */
  extern int	WalSndDelay;
  extern int	max_wal_senders;
  
+ /* struct definition for standbys configuration file */
+ typedef struct
+ {
+ 	int			linenumber;
+ 	char	   *standbyName;
+ 	ReplicationMode	rplMode;
+ } StandbysLine;
+ 
  extern int	WalSenderMain(void);
  extern void WalSndSignals(void);
  extern Size WalSndShmemSize(void);
  extern void WalSndShmemInit(void);
  extern void WalSndWakeup(void);
+ extern void WaitXLogSend(XLogRecPtr record);
+ 
+ extern void HandleReplicationInterrupt(void);
+ 
+ extern bool check_standbys(void);
+ extern bool load_standbys(void);
  
  #endif   /* _WALSENDER_H */
*** a/src/include/storage/latch.h
--- b/src/include/storage/latch.h
***************
*** 16,21 ****
--- 16,23 ----
  
  #include <signal.h>
  
+ #include "storage/procsignal.h"
+ 
  /*
   * Latch structure should be treated as opaque and only accessed through
   * the public functions. It is defined here to allow embedding Latches as
***************
*** 25,33 **** typedef struct
  {
  	sig_atomic_t	is_set;
  	bool			is_shared;
- #ifndef WIN32
  	int				owner_pid;
! #else
  	HANDLE			event;
  #endif
  } Latch;
--- 27,34 ----
  {
  	sig_atomic_t	is_set;
  	bool			is_shared;
  	int				owner_pid;
! #ifdef WIN32
  	HANDLE			event;
  #endif
  } Latch;
***************
*** 43,48 **** extern bool WaitLatch(volatile Latch *latch, long timeout);
--- 44,51 ----
  extern int	WaitLatchOrSocket(volatile Latch *latch, pgsocket sock,
  				  long timeout);
  extern void SetLatch(volatile Latch *latch);
+ extern void SetProcLatch(volatile Latch *latch,
+ 				  ProcSignalReason reason, BackendId backendId);
  extern void ResetLatch(volatile Latch *latch);
  #define TestLatch(latch) (((volatile Latch *) latch)->is_set)
  
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 70,75 **** typedef enum LWLockId
--- 70,76 ----
  	RelationMappingLock,
  	AsyncCtlLock,
  	AsyncQueueLock,
+ 	WalSndWaiterLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
*** a/src/include/storage/proc.h
--- b/src/include/storage/proc.h
***************
*** 14,19 ****
--- 14,20 ----
  #ifndef _PROC_H_
  #define _PROC_H_
  
+ #include "storage/latch.h"
  #include "storage/lock.h"
  #include "storage/pg_sema.h"
  #include "utils/timestamp.h"
***************
*** 116,121 **** struct PGPROC
--- 117,128 ----
  								 * lock object by this backend */
  
  	/*
+ 	 * Latch used by walsenders to wake up this backend when replication
+ 	 * has been done.
+ 	 */
+ 	Latch		latch;
+ 
+ 	/*
  	 * All PROCLOCK objects for locks held or awaited by this backend are
  	 * linked into one of these lists, according to the partition number of
  	 * their lock.
*** a/src/include/storage/procsignal.h
--- b/src/include/storage/procsignal.h
***************
*** 40,45 **** typedef enum
--- 40,47 ----
  	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
  	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
  
+ 	PROCSIG_REPLICATION_INTERRUPT,	/* replication interrupt */
+ 
  	NUM_PROCSIGNALS				/* Must be last! */
  } ProcSignalReason;
  
***************
*** 52,57 **** extern void ProcSignalShmemInit(void);
--- 54,61 ----
  extern void ProcSignalInit(int pss_idx);
  extern int SendProcSignal(pid_t pid, ProcSignalReason reason,
  			   BackendId backendId);
+ extern bool SetProcSignalReason(pid_t pid, ProcSignalReason reason,
+ 			   BackendId backendId);
  
  extern void procsignal_sigusr1_handler(SIGNAL_ARGS);
  
*** a/src/interfaces/libpq/fe-connect.c
--- b/src/interfaces/libpq/fe-connect.c
***************
*** 254,259 **** static const PQconninfoOption PQconninfoOptions[] = {
--- 254,262 ----
  	{"replication", NULL, NULL, NULL,
  	"Replication", "D", 5},
  
+ 	{"standby_name", NULL, NULL, NULL,
+ 	"Standby-Name", "D", 64},
+ 
  	/* Terminating entry --- MUST BE LAST */
  	{NULL, NULL, NULL, NULL,
  	NULL, NULL, 0}
***************
*** 613,618 **** fillPGconn(PGconn *conn, PQconninfoOption *connOptions)
--- 616,623 ----
  #endif
  	tmp = conninfo_getval(connOptions, "replication");
  	conn->replication = tmp ? strdup(tmp) : NULL;
+ 	tmp = conninfo_getval(connOptions, "standby_name");
+ 	conn->standbyName = tmp ? strdup(tmp) : NULL;
  }
  
  /*
***************
*** 2622,2627 **** freePGconn(PGconn *conn)
--- 2627,2634 ----
  		free(conn->dbName);
  	if (conn->replication)
  		free(conn->replication);
+ 	if (conn->standbyName)
+ 		free(conn->standbyName);
  	if (conn->pguser)
  		free(conn->pguser);
  	if (conn->pgpass)
*** a/src/interfaces/libpq/fe-exec.c
--- b/src/interfaces/libpq/fe-exec.c
***************
*** 2002,2007 **** PQnotifies(PGconn *conn)
--- 2002,2010 ----
  /*
   * PQputCopyData - send some data to the backend during COPY IN
   *
+  * This function can be called by walreceiver even during COPY OUT
+  * to send a message to the master.
+  *
   * Returns 1 if successful, 0 if data could not be sent (only possible
   * in nonblock mode), or -1 if an error occurs.
   */
***************
*** 2010,2016 **** PQputCopyData(PGconn *conn, const char *buffer, int nbytes)
  {
  	if (!conn)
  		return -1;
! 	if (conn->asyncStatus != PGASYNC_COPY_IN)
  	{
  		printfPQExpBuffer(&conn->errorMessage,
  						  libpq_gettext("no COPY in progress\n"));
--- 2013,2020 ----
  {
  	if (!conn)
  		return -1;
! 	if (conn->asyncStatus != PGASYNC_COPY_IN &&
! 		conn->asyncStatus != PGASYNC_COPY_OUT)
  	{
  		printfPQExpBuffer(&conn->errorMessage,
  						  libpq_gettext("no COPY in progress\n"));
*** a/src/interfaces/libpq/fe-protocol3.c
--- b/src/interfaces/libpq/fe-protocol3.c
***************
*** 1911,1916 **** build_startup_packet(const PGconn *conn, char *packet,
--- 1911,1918 ----
  		ADD_STARTUP_OPTION("database", conn->dbName);
  	if (conn->replication && conn->replication[0])
  		ADD_STARTUP_OPTION("replication", conn->replication);
+ 	if (conn->standbyName && conn->standbyName[0])
+ 		ADD_STARTUP_OPTION("standby_name", conn->standbyName);
  	if (conn->pgoptions && conn->pgoptions[0])
  		ADD_STARTUP_OPTION("options", conn->pgoptions);
  	if (conn->send_appname)
*** a/src/interfaces/libpq/libpq-int.h
--- b/src/interfaces/libpq/libpq-int.h
***************
*** 297,302 **** struct pg_conn
--- 297,303 ----
  	char	   *fbappname;		/* fallback application name */
  	char	   *dbName;			/* database name */
  	char	   *replication;	/* connect as the replication standby? */
+ 	char	   *standbyName;	/* standby name */
  	char	   *pguser;			/* Postgres username and password, if any */
  	char	   *pgpass;
  	char	   *keepalives;		/* use TCP keepalives? */

#56

Fujii Masao

masao.fujii@gmail.com

over 15 years ago

In reply to: Fujii Masao (#55)

1 attachment(s)

Re: Synchronous replication - patch status inquiry

On Wed, Sep 15, 2010 at 6:58 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Sep 15, 2010 at 6:38 AM, David Fetter <david@fetter.org> wrote:

Now that the latch patch is in, when do you think you'll be able to use it
instead of the poll loop?

Here is the updated version, which uses a latch in communication from
walsender to backend. I've not changed the others. Because walsender
already uses it in HEAD, and Heikki already proposed the patch which
replaced the poll loop between walreceiver and startup process with
a latch.

I rebased the patch against current HEAD because it conflicted with
recent commits about a latch.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

synchrep_0915-2.patchapplication/octet-stream; name=synchrep_0915-2.patchDownload

*** a/doc/src/sgml/protocol.sgml
--- b/doc/src/sgml/protocol.sgml
***************
*** 1291,1298 ****
  To initiate streaming replication, the frontend sends the
  <literal>replication</> parameter in the startup message. This tells the
  backend to go into walsender mode, wherein a small set of replication commands
! can be issued instead of SQL statements. Only the simple query protocol can be
! used in walsender mode.
  
  The commands accepted in walsender mode are:
  
--- 1291,1299 ----
  To initiate streaming replication, the frontend sends the
  <literal>replication</> parameter in the startup message. This tells the
  backend to go into walsender mode, wherein a small set of replication commands
! can be issued instead of SQL statements. Also the startup message includes
! <literal>standby_name</> parameter if it's supplied in <filename>recovery.conf</>.
! Only the simple query protocol can be used in walsender mode.
  
  The commands accepted in walsender mode are:
  
***************
*** 1360,1365 **** The commands accepted in walsender mode are:
--- 1361,1401 ----
        <variablelist>
        <varlistentry>
        <term>
+           XLogRecPtr (F)
+       </term>
+       <listitem>
+       <para>
+       <variablelist>
+       <varlistentry>
+       <term>
+           Byte1('l')
+       </term>
+       <listitem>
+       <para>
+           Identifies the message as an acknowledgment of replication.
+       </para>
+       </listitem>
+       </varlistentry>
+       <varlistentry>
+       <term>
+           Byte8
+       </term>
+       <listitem>
+       <para>
+           The end of the WAL data replicated to the standby, given in
+           XLogRecPtr format.
+       </para>
+       </listitem>
+       </varlistentry>
+       </variablelist>
+       </para>
+       </listitem>
+       </varlistentry>
+       </variablelist>
+ 
+       <variablelist>
+       <varlistentry>
+       <term>
            XLogData (B)
        </term>
        <listitem>
*** a/doc/src/sgml/recovery-config.sgml
--- b/doc/src/sgml/recovery-config.sgml
***************
*** 243,248 **** restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
--- 243,259 ----
           </para>
          </listitem>
         </varlistentry>
+        <varlistentry id="standby-name" xreflabel="standby_name">
+         <term><varname>standby_name</varname> (<type>string</type>)</term>
+         <indexterm>
+           <primary><varname>standby_name</> recovery parameter</primary>
+         </indexterm>
+         <listitem>
+          <para>
+           Specifies a name of the standby server.
+          </para>
+         </listitem>
+        </varlistentry>
         <varlistentry id="primary-conninfo" xreflabel="primary_conninfo">
          <term><varname>primary_conninfo</varname> (<type>string</type>)</term>
          <indexterm>
*** a/src/backend/Makefile
--- b/src/backend/Makefile
***************
*** 208,213 **** endif
--- 208,214 ----
  	$(INSTALL_DATA) $(srcdir)/libpq/pg_ident.conf.sample '$(DESTDIR)$(datadir)/pg_ident.conf.sample'
  	$(INSTALL_DATA) $(srcdir)/utils/misc/postgresql.conf.sample '$(DESTDIR)$(datadir)/postgresql.conf.sample'
  	$(INSTALL_DATA) $(srcdir)/access/transam/recovery.conf.sample '$(DESTDIR)$(datadir)/recovery.conf.sample'
+ 	$(INSTALL_DATA) $(srcdir)/replication/standbys.conf.sample '$(DESTDIR)$(datadir)/standbys.conf.sample'
  
  install-bin: postgres $(POSTGRES_IMP) installdirs
  	$(INSTALL_PROGRAM) postgres$(X) '$(DESTDIR)$(bindir)/postgres$(X)'
***************
*** 262,268 **** endif
  	rm -f '$(DESTDIR)$(datadir)/pg_hba.conf.sample' \
  	      '$(DESTDIR)$(datadir)/pg_ident.conf.sample' \
                '$(DESTDIR)$(datadir)/postgresql.conf.sample' \
! 	      '$(DESTDIR)$(datadir)/recovery.conf.sample'
  
  
  ##########################################################################
--- 263,270 ----
  	rm -f '$(DESTDIR)$(datadir)/pg_hba.conf.sample' \
  	      '$(DESTDIR)$(datadir)/pg_ident.conf.sample' \
                '$(DESTDIR)$(datadir)/postgresql.conf.sample' \
! 	      '$(DESTDIR)$(datadir)/recovery.conf.sample' \
! 	      '$(DESTDIR)$(datadir)/standbys.conf.sample'
  
  
  ##########################################################################
*** a/src/backend/access/transam/recovery.conf.sample
--- b/src/backend/access/transam/recovery.conf.sample
***************
*** 91,102 ****
  #---------------------------------------------------------------------------
  #
  # When standby_mode is enabled, the PostgreSQL server will work as
! # a standby. It tries to connect to the primary according to the
! # connection settings primary_conninfo, and receives XLOG records
! # continuously.
  #
  #standby_mode = 'off'
  #
  #primary_conninfo = ''		# e.g. 'host=localhost port=5432'
  #
  #
--- 91,104 ----
  #---------------------------------------------------------------------------
  #
  # When standby_mode is enabled, the PostgreSQL server will work as
! # a standby under the name of standby_name. It tries to connect to
! # the primary according to the connection settings primary_conninfo,
! # and receives XLOG records continuously.
  #
  #standby_mode = 'off'
  #
+ #standby_name = ''
+ #
  #primary_conninfo = ''		# e.g. 'host=localhost port=5432'
  #
  #
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***************
*** 1070,1075 **** EndPrepare(GlobalTransaction gxact)
--- 1070,1087 ----
  
  	END_CRIT_SECTION();
  
+ 	/*
+ 	 * Wait for WAL to be replicated up to the PREPARE record
+ 	 * if replication is enabled. This operation has to be performed
+ 	 * after the PREPARE record is generated and before other
+ 	 * transactions know that this one has already been prepared.
+ 	 *
+ 	 * XXX: Since the caller prevents cancel/die interrupt, we cannot
+ 	 * process that while waiting. Should we remove this restriction?
+ 	 */
+ 	if (max_wal_senders > 0)
+ 		WaitXLogSend(gxact->prepare_lsn);
+ 
  	records.tail = records.head = NULL;
  }
  
***************
*** 2027,2032 **** RecordTransactionCommitPrepared(TransactionId xid,
--- 2039,2053 ----
  	MyProc->inCommit = false;
  
  	END_CRIT_SECTION();
+ 
+ 	/*
+ 	 * Wait for WAL to be replicated up to the COMMIT PREPARED record
+ 	 * if replication is enabled. This operation has to be performed
+ 	 * after the COMMIT PREPARED record is generated and before other
+ 	 * transactions know that this one has already been committed.
+ 	 */
+ 	if (max_wal_senders > 0)
+ 		WaitXLogSend(recptr);
  }
  
  /*
***************
*** 2106,2109 **** RecordTransactionAbortPrepared(TransactionId xid,
--- 2127,2139 ----
  	TransactionIdAbortTree(xid, nchildren, children);
  
  	END_CRIT_SECTION();
+ 
+ 	/*
+ 	 * Wait for WAL to be replicated up to the ABORT PREPARED record
+ 	 * if replication is enabled. This operation has to be performed
+ 	 * after the ABORT PREPARED record is generated and before other
+ 	 * transactions know that this one has already been aborted.
+ 	 */
+ 	if (max_wal_senders > 0)
+ 		WaitXLogSend(recptr);
  }
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 1118,1123 **** RecordTransactionCommit(void)
--- 1118,1135 ----
  	/* Compute latestXid while we have the child XIDs handy */
  	latestXid = TransactionIdLatest(xid, nchildren, children);
  
+ 	/*
+ 	 * Wait for WAL to be replicated up to the COMMIT record if replication
+ 	 * is enabled. This operation has to be performed after the COMMIT record
+ 	 * is generated and before other transactions know that this one has
+ 	 * already been committed.
+ 	 *
+ 	 * XXX: Since the caller prevents cancel/die interrupt, we cannot
+ 	 * process that while waiting. Should we remove this restriction?
+ 	 */
+ 	if (max_wal_senders > 0)
+ 		WaitXLogSend(XactLastRecEnd);
+ 
  	/* Reset XactLastRecEnd until the next transaction writes something */
  	XactLastRecEnd.xrecoff = 0;
  
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 190,195 **** static TimestampTz recoveryTargetTime;
--- 190,196 ----
  static bool StandbyMode = false;
  static char *PrimaryConnInfo = NULL;
  static char *TriggerFile = NULL;
+ static char *StandbyName = NULL;
  
  /* if recoveryStopsHere returns true, it saves actual stop xid/time here */
  static TransactionId recoveryStopXid;
***************
*** 540,545 **** typedef struct xl_parameter_change
--- 541,556 ----
  	int			wal_level;
  } xl_parameter_change;
  
+ /* Replication mode names */
+ const char *ReplicationModeNames[] = {
+ 	"async",				/* REPLICATION_MODE_ASYNC */
+ 	"recv",				/* REPLICATION_MODE_RECV */
+ 	"fsync",				/* REPLICATION_MODE_FSYNC */
+ 	"replay"				/* REPLICATION_MODE_REPLAY */
+ };
+ 
+ ReplicationMode		rplMode = InvalidReplicationMode;
+ 
  /*
   * Flags set by interrupt handlers for later service in the redo loop.
   */
***************
*** 5267,5272 **** readRecoveryCommandFile(void)
--- 5278,5290 ----
  					(errmsg("trigger_file = '%s'",
  							TriggerFile)));
  		}
+ 		else if (strcmp(tok1, "standby_name") == 0)
+ 		{
+ 			StandbyName = pstrdup(tok2);
+ 			ereport(DEBUG2,
+ 					(errmsg("standby_name = '%s'",
+ 							StandbyName)));
+ 		}
  		else
  			ereport(FATAL,
  					(errmsg("unrecognized recovery parameter \"%s\"",
***************
*** 6890,6895 **** GetFlushRecPtr(void)
--- 6908,6930 ----
  }
  
  /*
+  * GetReplayRecPtr -- Returns the last replay position.
+  */
+ XLogRecPtr
+ GetReplayRecPtr(void)
+ {
+ 	/* use volatile pointer to prevent code rearrangement */
+ 	volatile XLogCtlData *xlogctl = XLogCtl;
+ 	XLogRecPtr	recptr;
+ 
+ 	SpinLockAcquire(&xlogctl->info_lck);
+ 	recptr = xlogctl->recoveryLastRecPtr;
+ 	SpinLockRelease(&xlogctl->info_lck);
+ 
+ 	return recptr;
+ }
+ 
+ /*
   * Get the time of the last xlog segment switch
   */
  pg_time_t
***************
*** 8851,8865 **** pg_last_xlog_receive_location(PG_FUNCTION_ARGS)
  Datum
  pg_last_xlog_replay_location(PG_FUNCTION_ARGS)
  {
- 	/* use volatile pointer to prevent code rearrangement */
- 	volatile XLogCtlData *xlogctl = XLogCtl;
  	XLogRecPtr	recptr;
  	char		location[MAXFNAMELEN];
  
! 	SpinLockAcquire(&xlogctl->info_lck);
! 	recptr = xlogctl->recoveryLastRecPtr;
! 	SpinLockRelease(&xlogctl->info_lck);
! 
  	if (recptr.xlogid == 0 && recptr.xrecoff == 0)
  		PG_RETURN_NULL();
  
--- 8886,8895 ----
  Datum
  pg_last_xlog_replay_location(PG_FUNCTION_ARGS)
  {
  	XLogRecPtr	recptr;
  	char		location[MAXFNAMELEN];
  
! 	recptr = GetReplayRecPtr();
  	if (recptr.xlogid == 0 && recptr.xrecoff == 0)
  		PG_RETURN_NULL();
  
***************
*** 9498,9504 **** retry:
  						{
  							RequestXLogStreaming(
  									  fetching_ckpt ? RedoStartLSN : *RecPtr,
! 												 PrimaryConnInfo);
  							continue;
  						}
  					}
--- 9528,9534 ----
  						{
  							RequestXLogStreaming(
  									  fetching_ckpt ? RedoStartLSN : *RecPtr,
! 												 PrimaryConnInfo, StandbyName);
  							continue;
  						}
  					}
***************
*** 9722,9724 **** WakeupRecovery(void)
--- 9752,9768 ----
  {
  	SetLatch(&XLogCtl->recoveryWakeupLatch);
  }
+ 
+ /*
+  * Look up replication mode value by name.
+  */
+ ReplicationMode
+ ReplicationModeNameGetValue(char *name)
+ {
+ 	ReplicationMode	mode;
+ 
+ 	for (mode = 0; mode <= MAXREPLICATIONMODE; mode++)
+ 		if (strcmp(ReplicationModeNames[mode], name) == 0)
+ 			return mode;
+ 	return InvalidReplicationMode;
+ }
*** a/src/backend/libpq/hba.c
--- b/src/backend/libpq/hba.c
***************
*** 38,46 ****
  #define atooid(x)  ((Oid) strtoul((x), NULL, 10))
  #define atoxid(x)  ((TransactionId) strtoul((x), NULL, 10))
  
- /* This is used to separate values in multi-valued column strings */
- #define MULTI_VALUE_SEP "\001"
- 
  #define MAX_TOKEN	256
  
  /* callback data for check_network_callback */
--- 38,43 ----
***************
*** 54,59 **** typedef struct check_network_data
--- 51,59 ----
  /* pre-parsed content of HBA config file: list of HbaLine structs */
  static List *parsed_hba_lines = NIL;
  
+ static const char *hba_keywords[] = {"all", "sameuser", "samegroup", "samerole",
+ 									 "replication", NULL};
+ 
  /*
   * These variables hold the pre-parsed contents of the ident usermap
   * configuration file.	ident_lines is a list of sublists, one sublist for
***************
*** 67,76 **** static List *ident_lines = NIL;
  static List *ident_line_nums = NIL;
  
  
- static void tokenize_file(const char *filename, FILE *file,
- 			  List **lines, List **line_nums);
  static char *tokenize_inc_file(const char *outer_filename,
! 				  const char *inc_filename);
  
  /*
   * isblank() exists in the ISO C99 spec, but it's not very portable yet,
--- 67,74 ----
  static List *ident_line_nums = NIL;
  
  
  static char *tokenize_inc_file(const char *outer_filename,
! 				  const char *inc_filename, const char **keywords);
  
  /*
   * isblank() exists in the ISO C99 spec, but it's not very portable yet,
***************
*** 108,114 **** pg_isblank(const char c)
   * token.
   */
  static bool
! next_token(FILE *fp, char *buf, int bufsz, bool *initial_quote)
  {
  	int			c;
  	char	   *start_buf = buf;
--- 106,113 ----
   * token.
   */
  static bool
! next_token(const char *filename, FILE *fp, char *buf, int bufsz,
! 		   bool *initial_quote, const char **keywords)
  {
  	int			c;
  	char	   *start_buf = buf;
***************
*** 155,162 **** next_token(FILE *fp, char *buf, int bufsz, bool *initial_quote)
  			*buf = '\0';
  			ereport(LOG,
  					(errcode(ERRCODE_CONFIG_FILE_ERROR),
! 			   errmsg("authentication file token too long, skipping: \"%s\"",
! 					  start_buf)));
  			/* Discard remainder of line */
  			while ((c = getc(fp)) != EOF && c != '\n')
  				;
--- 154,161 ----
  			*buf = '\0';
  			ereport(LOG,
  					(errcode(ERRCODE_CONFIG_FILE_ERROR),
! 			   errmsg("configuration file \"%s\" token too long, skipping: \"%s\"",
! 					  filename, start_buf)));
  			/* Discard remainder of line */
  			while ((c = getc(fp)) != EOF && c != '\n')
  				;
***************
*** 196,211 **** next_token(FILE *fp, char *buf, int bufsz, bool *initial_quote)
  
  	*buf = '\0';
  
! 	if (!saw_quote &&
! 		(strcmp(start_buf, "all") == 0 ||
! 		 strcmp(start_buf, "sameuser") == 0 ||
! 		 strcmp(start_buf, "samegroup") == 0 ||
! 		 strcmp(start_buf, "samerole") == 0 ||
! 		 strcmp(start_buf, "replication") == 0))
  	{
! 		/* append newline to a magical keyword */
! 		*buf++ = '\n';
! 		*buf = '\0';
  	}
  
  	return (saw_quote || buf > start_buf);
--- 195,214 ----
  
  	*buf = '\0';
  
! 	if (!saw_quote)
  	{
! 		const char	**entry;
! 
! 		for (entry = keywords; *entry != NULL; entry++)
! 		{
! 			if (strcmp(start_buf, *entry) == 0)
! 			{
! 				/* append newline to a magical keyword */
! 				*buf++ = '\n';
! 				*buf = '\0';
! 				break;
! 			}
! 		}
  	}
  
  	return (saw_quote || buf > start_buf);
***************
*** 219,225 **** next_token(FILE *fp, char *buf, int bufsz, bool *initial_quote)
   * The result is a palloc'd string, or NULL if we have reached EOL.
   */
  static char *
! next_token_expand(const char *filename, FILE *file)
  {
  	char		buf[MAX_TOKEN];
  	char	   *comma_str = pstrdup("");
--- 222,228 ----
   * The result is a palloc'd string, or NULL if we have reached EOL.
   */
  static char *
! next_token_expand(const char *filename, FILE *file, const char **keywords)
  {
  	char		buf[MAX_TOKEN];
  	char	   *comma_str = pstrdup("");
***************
*** 231,237 **** next_token_expand(const char *filename, FILE *file)
  
  	do
  	{
! 		if (!next_token(file, buf, sizeof(buf), &initial_quote))
  			break;
  
  		got_something = true;
--- 234,241 ----
  
  	do
  	{
! 		if (!next_token(filename, file, buf, sizeof(buf), &initial_quote,
! 						keywords))
  			break;
  
  		got_something = true;
***************
*** 246,252 **** next_token_expand(const char *filename, FILE *file)
  
  		/* Is this referencing a file? */
  		if (!initial_quote && buf[0] == '@' && buf[1] != '\0')
! 			incbuf = tokenize_inc_file(filename, buf + 1);
  		else
  			incbuf = pstrdup(buf);
  
--- 250,256 ----
  
  		/* Is this referencing a file? */
  		if (!initial_quote && buf[0] == '@' && buf[1] != '\0')
! 			incbuf = tokenize_inc_file(filename, buf + 1, keywords);
  		else
  			incbuf = pstrdup(buf);
  
***************
*** 273,279 **** next_token_expand(const char *filename, FILE *file)
  /*
   * Free memory used by lines/tokens (i.e., structure built by tokenize_file)
   */
! static void
  free_lines(List **lines, List **line_nums)
  {
  	/*
--- 277,283 ----
  /*
   * Free memory used by lines/tokens (i.e., structure built by tokenize_file)
   */
! void
  free_lines(List **lines, List **line_nums)
  {
  	/*
***************
*** 318,324 **** free_lines(List **lines, List **line_nums)
  
  static char *
  tokenize_inc_file(const char *outer_filename,
! 				  const char *inc_filename)
  {
  	char	   *inc_fullname;
  	FILE	   *inc_file;
--- 322,328 ----
  
  static char *
  tokenize_inc_file(const char *outer_filename,
! 				  const char *inc_filename, const char **keywords)
  {
  	char	   *inc_fullname;
  	FILE	   *inc_file;
***************
*** 348,354 **** tokenize_inc_file(const char *outer_filename,
  	{
  		ereport(LOG,
  				(errcode_for_file_access(),
! 				 errmsg("could not open secondary authentication file \"@%s\" as \"%s\": %m",
  						inc_filename, inc_fullname)));
  		pfree(inc_fullname);
  
--- 352,358 ----
  	{
  		ereport(LOG,
  				(errcode_for_file_access(),
! 				 errmsg("could not open secondary configuration file \"@%s\" as \"%s\": %m",
  						inc_filename, inc_fullname)));
  		pfree(inc_fullname);
  
***************
*** 357,363 **** tokenize_inc_file(const char *outer_filename,
  	}
  
  	/* There is possible recursion here if the file contains @ */
! 	tokenize_file(inc_fullname, inc_file, &inc_lines, &inc_line_nums);
  
  	FreeFile(inc_file);
  	pfree(inc_fullname);
--- 361,368 ----
  	}
  
  	/* There is possible recursion here if the file contains @ */
! 	tokenize_file(inc_fullname, inc_file, &inc_lines, &inc_line_nums,
! 				  keywords);
  
  	FreeFile(inc_file);
  	pfree(inc_fullname);
***************
*** 404,412 **** tokenize_inc_file(const char *outer_filename,
   *
   * filename must be the absolute path to the target file.
   */
! static void
  tokenize_file(const char *filename, FILE *file,
! 			  List **lines, List **line_nums)
  {
  	List	   *current_line = NIL;
  	int			line_number = 1;
--- 409,417 ----
   *
   * filename must be the absolute path to the target file.
   */
! void
  tokenize_file(const char *filename, FILE *file,
! 			  List **lines, List **line_nums, const char **keywords)
  {
  	List	   *current_line = NIL;
  	int			line_number = 1;
***************
*** 416,422 **** tokenize_file(const char *filename, FILE *file,
  
  	while (!feof(file) && !ferror(file))
  	{
! 		buf = next_token_expand(filename, file);
  
  		/* add token to list, unless we are at EOL or comment start */
  		if (buf)
--- 421,427 ----
  
  	while (!feof(file) && !ferror(file))
  	{
! 		buf = next_token_expand(filename, file, keywords);
  
  		/* add token to list, unless we are at EOL or comment start */
  		if (buf)
***************
*** 1490,1496 **** load_hba(void)
  		return false;
  	}
  
! 	tokenize_file(HbaFileName, file, &hba_lines, &hba_line_nums);
  	FreeFile(file);
  
  	/* Now parse all the lines */
--- 1495,1501 ----
  		return false;
  	}
  
! 	tokenize_file(HbaFileName, file, &hba_lines, &hba_line_nums, hba_keywords);
  	FreeFile(file);
  
  	/* Now parse all the lines */
***************
*** 1809,1815 **** load_ident(void)
  	}
  	else
  	{
! 		tokenize_file(IdentFileName, file, &ident_lines, &ident_line_nums);
  		FreeFile(file);
  	}
  }
--- 1814,1821 ----
  	}
  	else
  	{
! 		tokenize_file(IdentFileName, file, &ident_lines, &ident_line_nums,
! 					  hba_keywords);
  		FreeFile(file);
  	}
  }
*** a/src/backend/port/unix_latch.c
--- b/src/backend/port/unix_latch.c
***************
*** 312,317 **** SetLatch(volatile Latch *latch)
--- 312,327 ----
  }
  
  /*
+  * Signal the given reason, in addition to SetLatch.
+  */
+ void
+ SetProcLatch(volatile Latch *latch, ProcSignalReason reason, BackendId backendId)
+ {
+ 	SetProcSignalReason(latch->owner_pid, reason, backendId);
+ 	SetLatch(latch);
+ }
+ 
+ /*
   * Clear the latch. Calling WaitLatch after this will sleep, unless
   * the latch is set again before the WaitLatch call.
   */
*** a/src/backend/port/win32_latch.c
--- b/src/backend/port/win32_latch.c
***************
*** 98,103 **** WaitLatchOrSocket(volatile Latch *latch, SOCKET sock, long timeout)
--- 98,106 ----
  	int			numevents;
  	int			result = 0;
  
+ 	if (latch->owner_pid != MyProcPid)
+ 		elog(ERROR, "cannot wait on a latch owned by another process");
+ 
  	latchevent = latch->event;
  
  	events[0] = latchevent;
***************
*** 191,198 **** SetLatch(volatile Latch *latch)
--- 194,214 ----
  	}
  }
  
+ /*
+  * Signal the given reason, in addition to SetLatch.
+  */
+ void
+ SetProcLatch(volatile Latch *latch, ProcSignalReason reason, BackendId backendId)
+ {
+ 	SetProcSignalReason(latch->owner_pid, reason, backendId);
+ 	SetLatch(latch);
+ }
+ 
  void
  ResetLatch(volatile Latch *latch)
  {
+ 	/* Only the owner should reset the latch */
+ 	Assert(latch->owner_pid == MyProcPid);
+ 
  	latch->is_set = false;
  }
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***************
*** 1063,1069 **** PostmasterMain(int argc, char *argv[])
  	autovac_init();
  
  	/*
! 	 * Load configuration files for client authentication.
  	 */
  	if (!load_hba())
  	{
--- 1063,1069 ----
  	autovac_init();
  
  	/*
! 	 * Load configuration files for client authentication and replication.
  	 */
  	if (!load_hba())
  	{
***************
*** 1075,1080 **** PostmasterMain(int argc, char *argv[])
--- 1075,1085 ----
  				(errmsg("could not load pg_hba.conf")));
  	}
  	load_ident();
+ 	if (max_wal_senders > 0 && !load_standbys())
+ 	{
+ 		ereport(FATAL,
+ 				(errmsg("could not load standbys.conf")));
+ 	}
  
  	/*
  	 * Remember postmaster startup time
***************
*** 1713,1718 **** retry1:
--- 1718,1725 ----
  							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  							 errmsg("invalid value for boolean option \"replication\"")));
  			}
+ 			else if (strcmp(nameptr, "standby_name") == 0)
+ 			    standby_name = pstrdup(valptr);
  			else
  			{
  				/* Assume it's a generic GUC option */
***************
*** 2129,2134 **** SIGHUP_handler(SIGNAL_ARGS)
--- 2136,2146 ----
  
  		load_ident();
  
+ 		/* Reload standbys configuration file too */
+ 		if (max_wal_senders > 0 && !load_standbys())
+ 			ereport(WARNING,
+ 					(errmsg("standbys.conf not reloaded")));
+ 
  #ifdef EXEC_BACKEND
  		/* Update the starting-point file for future children */
  		write_nondefault_variables(PGC_SIGHUP);
*** a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
--- b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
***************
*** 47,55 **** static bool justconnected = false;
  static char *recvBuf = NULL;
  
  /* Prototypes for interface functions */
! static bool libpqrcv_connect(char *conninfo, XLogRecPtr startpoint);
  static bool libpqrcv_receive(int timeout, unsigned char *type,
  				 char **buffer, int *len);
  static void libpqrcv_disconnect(void);
  
  /* Prototypes for private functions */
--- 47,57 ----
  static char *recvBuf = NULL;
  
  /* Prototypes for interface functions */
! static bool libpqrcv_connect(char *conninfo, XLogRecPtr startpoint,
! 							 char *standbyName);
  static bool libpqrcv_receive(int timeout, unsigned char *type,
  				 char **buffer, int *len);
+ static void libpqrcv_send(const char *buffer, int nbytes);
  static void libpqrcv_disconnect(void);
  
  /* Prototypes for private functions */
***************
*** 64,73 **** _PG_init(void)
  {
  	/* Tell walreceiver how to reach us */
  	if (walrcv_connect != NULL || walrcv_receive != NULL ||
! 		walrcv_disconnect != NULL)
  		elog(ERROR, "libpqwalreceiver already loaded");
  	walrcv_connect = libpqrcv_connect;
  	walrcv_receive = libpqrcv_receive;
  	walrcv_disconnect = libpqrcv_disconnect;
  }
  
--- 66,76 ----
  {
  	/* Tell walreceiver how to reach us */
  	if (walrcv_connect != NULL || walrcv_receive != NULL ||
! 		walrcv_send != NULL || walrcv_disconnect != NULL)
  		elog(ERROR, "libpqwalreceiver already loaded");
  	walrcv_connect = libpqrcv_connect;
  	walrcv_receive = libpqrcv_receive;
+ 	walrcv_send = libpqrcv_send;
  	walrcv_disconnect = libpqrcv_disconnect;
  }
  
***************
*** 75,98 **** _PG_init(void)
   * Establish the connection to the primary server for XLOG streaming
   */
  static bool
! libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
  {
! 	char		conninfo_repl[MAXCONNINFO + 37];
  	char	   *primary_sysid;
  	char		standby_sysid[32];
  	TimeLineID	primary_tli;
  	TimeLineID	standby_tli;
  	PGresult   *res;
  	char		cmd[64];
  
  	/*
! 	 * Connect using deliberately undocumented parameter: replication. The
! 	 * database name is ignored by the server in replication mode, but specify
! 	 * "replication" for .pgpass lookup.
  	 */
! 	snprintf(conninfo_repl, sizeof(conninfo_repl),
! 			 "%s dbname=replication replication=true",
! 			 conninfo);
  
  	streamConn = PQconnectdb(conninfo_repl);
  	if (PQstatus(streamConn) != CONNECTION_OK)
--- 78,107 ----
   * Establish the connection to the primary server for XLOG streaming
   */
  static bool
! libpqrcv_connect(char *conninfo, XLogRecPtr startpoint, char *standbyName)
  {
! 	char		conninfo_repl[MAXCONNINFO + MAXSTANDBYNAME + 37];
  	char	   *primary_sysid;
  	char		standby_sysid[32];
  	TimeLineID	primary_tli;
  	TimeLineID	standby_tli;
+ 	char	   *primary_rplMode;
  	PGresult   *res;
  	char		cmd[64];
  
  	/*
! 	 * Connect using deliberately undocumented parameter: replication
! 	 * and standby_name. The database name is ignored by the server in
! 	 * replication mode, but specify "replication" for .pgpass lookup.
  	 */
! 	if (standbyName && standbyName[0] != '\0')
! 		snprintf(conninfo_repl, sizeof(conninfo_repl),
! 				 "%s dbname=replication replication=true standby_name='%s'",
! 				 conninfo, standbyName);
! 	else
! 		snprintf(conninfo_repl, sizeof(conninfo_repl),
! 				 "%s dbname=replication replication=true",
! 				 conninfo);
  
  	streamConn = PQconnectdb(conninfo_repl);
  	if (PQstatus(streamConn) != CONNECTION_OK)
***************
*** 109,119 **** libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
  	{
  		PQclear(res);
  		ereport(ERROR,
! 				(errmsg("could not receive database system identifier and timeline ID from "
! 						"the primary server: %s",
  						PQerrorMessage(streamConn))));
  	}
! 	if (PQnfields(res) != 2 || PQntuples(res) != 1)
  	{
  		int			ntuples = PQntuples(res);
  		int			nfields = PQnfields(res);
--- 118,128 ----
  	{
  		PQclear(res);
  		ereport(ERROR,
! 				(errmsg("could not receive database system identifier, timeline ID and "
! 						"replication mode from the primary server: %s",
  						PQerrorMessage(streamConn))));
  	}
! 	if (PQnfields(res) != 3 || PQntuples(res) != 1)
  	{
  		int			ntuples = PQntuples(res);
  		int			nfields = PQnfields(res);
***************
*** 121,131 **** libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
  		PQclear(res);
  		ereport(ERROR,
  				(errmsg("invalid response from primary server"),
! 				 errdetail("Expected 1 tuple with 2 fields, got %d tuples with %d fields.",
  						   ntuples, nfields)));
  	}
  	primary_sysid = PQgetvalue(res, 0, 0);
  	primary_tli = pg_atoi(PQgetvalue(res, 0, 1), 4, 0);
  
  	/*
  	 * Confirm that the system identifier of the primary is the same as ours.
--- 130,141 ----
  		PQclear(res);
  		ereport(ERROR,
  				(errmsg("invalid response from primary server"),
! 				 errdetail("Expected 1 tuple with 3 fields, got %d tuples with %d fields.",
  						   ntuples, nfields)));
  	}
  	primary_sysid = PQgetvalue(res, 0, 0);
  	primary_tli = pg_atoi(PQgetvalue(res, 0, 1), 4, 0);
+ 	primary_rplMode = PQgetvalue(res, 0, 2);
  
  	/*
  	 * Confirm that the system identifier of the primary is the same as ours.
***************
*** 146,158 **** libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
  	 * recovery target timeline.
  	 */
  	standby_tli = GetRecoveryTargetTLI();
- 	PQclear(res);
  	if (primary_tli != standby_tli)
  		ereport(ERROR,
  				(errmsg("timeline %u of the primary does not match recovery target timeline %u",
  						primary_tli, standby_tli)));
  	ThisTimeLineID = primary_tli;
  
  	/* Start streaming from the point requested by startup process */
  	snprintf(cmd, sizeof(cmd), "START_REPLICATION %X/%X",
  			 startpoint.xlogid, startpoint.xrecoff);
--- 156,180 ----
  	 * recovery target timeline.
  	 */
  	standby_tli = GetRecoveryTargetTLI();
  	if (primary_tli != standby_tli)
+ 	{
+ 		PQclear(res);
  		ereport(ERROR,
  				(errmsg("timeline %u of the primary does not match recovery target timeline %u",
  						primary_tli, standby_tli)));
+ 	}
  	ThisTimeLineID = primary_tli;
  
+ 	/*
+ 	 * Confirm that the passed replication mode is valid.
+ 	 */
+ 	rplMode = ReplicationModeNameGetValue(primary_rplMode);
+ 	PQclear(res);
+ 	if (rplMode == InvalidReplicationMode)
+ 		ereport(ERROR,
+ 				(errmsg("invalid replication mode \"%s\"",
+ 						primary_rplMode)));
+ 
  	/* Start streaming from the point requested by startup process */
  	snprintf(cmd, sizeof(cmd), "START_REPLICATION %X/%X",
  			 startpoint.xlogid, startpoint.xrecoff);
***************
*** 398,400 **** libpqrcv_receive(int timeout, unsigned char *type, char **buffer, int *len)
--- 420,437 ----
  
  	return true;
  }
+ 
+ /*
+  * Send a message to XLOG stream.
+  *
+  * ereports on error.
+  */
+ static void
+ libpqrcv_send(const char *buffer, int nbytes)
+ {
+ 	if (PQputCopyData(streamConn, buffer, nbytes) <= 0 ||
+ 		PQflush(streamConn))
+ 		ereport(ERROR,
+ 				(errmsg("could not send data to WAL stream: %s",
+ 						PQerrorMessage(streamConn))));
+ }
*** /dev/null
--- b/src/backend/replication/standbys.conf.sample
***************
*** 0 ****
--- 1,35 ----
+ # PostgreSQL Standbys Configuration File
+ # ===================================================
+ #
+ # Refer to the "Streaming Replication" section in the PostgreSQL
+ # documentation for a complete description of this file.  A short
+ # synopsis follows.
+ #
+ # This file controls which replication mode each standby uses.
+ # Records are of the form:
+ #
+ # STANDBY-NAME  REPLICATION-MODE
+ #
+ # (The uppercase items must be replaced by actual values.)
+ #
+ # STANDBY-NAME can be "all", standby name, or a comma-separated list
+ # thereof.
+ #
+ # REPLICATION-MODE specifies how long transaction commit waits for
+ # replication before the commit command returns a "success" to a
+ # client. The valid modes are "async", "recv", "fsync" and "replay".
+ #
+ # Standby name containing spaces, commas, quotes and other special
+ # characters must be quoted.  Quoting one of the keyword "all" makes
+ # the name lose its special character, and just match standby with
+ # that name.
+ #
+ # This file is read on server startup and when the postmaster receives
+ # a SIGHUP signal.  If you edit the file on a running system, you have
+ # to SIGHUP the postmaster for the changes to take effect.  You can
+ # use "pg_ctl reload" to do that.
+ 
+ # Put your actual configuration here
+ # ----------------------------------
+ 
+ # STANDBY-NAME       REPLICATION-MODE
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***************
*** 57,62 **** bool		am_walreceiver;
--- 57,63 ----
  /* libpqreceiver hooks to these when loaded */
  walrcv_connect_type walrcv_connect = NULL;
  walrcv_receive_type walrcv_receive = NULL;
+ walrcv_send_type walrcv_send = NULL;
  walrcv_disconnect_type walrcv_disconnect = NULL;
  
  #define NAPTIME_PER_CYCLE 100	/* max sleep time between cycles (100ms) */
***************
*** 113,118 **** static void WalRcvDie(int code, Datum arg);
--- 114,120 ----
  static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
  static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
  static void XLogWalRcvFlush(void);
+ static void XLogWalRcvSendRecPtr(XLogRecPtr recptr);
  
  /* Signal handlers */
  static void WalRcvSigHupHandler(SIGNAL_ARGS);
***************
*** 158,164 **** void
--- 160,168 ----
  WalReceiverMain(void)
  {
  	char		conninfo[MAXCONNINFO];
+ 	char		standbyName[MAXSTANDBYNAME];
  	XLogRecPtr	startpoint;
+ 	XLogRecPtr	ackedpoint = {0, 0};
  
  	/* use volatile pointer to prevent code rearrangement */
  	volatile WalRcvData *walrcv = WalRcv;
***************
*** 206,211 **** WalReceiverMain(void)
--- 210,216 ----
  
  	/* Fetch information required to start streaming */
  	strlcpy(conninfo, (char *) walrcv->conninfo, MAXCONNINFO);
+ 	strlcpy(standbyName, (char *) walrcv->standbyName, MAXSTANDBYNAME);
  	startpoint = walrcv->receivedUpto;
  	SpinLockRelease(&walrcv->mutex);
  
***************
*** 247,253 **** WalReceiverMain(void)
  	/* Load the libpq-specific functions */
  	load_file("libpqwalreceiver", false);
  	if (walrcv_connect == NULL || walrcv_receive == NULL ||
! 		walrcv_disconnect == NULL)
  		elog(ERROR, "libpqwalreceiver didn't initialize correctly");
  
  	/*
--- 252,258 ----
  	/* Load the libpq-specific functions */
  	load_file("libpqwalreceiver", false);
  	if (walrcv_connect == NULL || walrcv_receive == NULL ||
! 		walrcv_send == NULL || walrcv_disconnect == NULL)
  		elog(ERROR, "libpqwalreceiver didn't initialize correctly");
  
  	/*
***************
*** 261,267 **** WalReceiverMain(void)
  
  	/* Establish the connection to the primary for XLOG streaming */
  	EnableWalRcvImmediateExit();
! 	walrcv_connect(conninfo, startpoint);
  	DisableWalRcvImmediateExit();
  
  	/* Loop until end-of-streaming or error */
--- 266,272 ----
  
  	/* Establish the connection to the primary for XLOG streaming */
  	EnableWalRcvImmediateExit();
! 	walrcv_connect(conninfo, startpoint, standbyName);
  	DisableWalRcvImmediateExit();
  
  	/* Loop until end-of-streaming or error */
***************
*** 311,316 **** WalReceiverMain(void)
--- 316,340 ----
  			 */
  			XLogWalRcvFlush();
  		}
+ 
+ 		/*
+ 		 * If replication_mode is "replay", send the last WAL replay location
+ 		 * to the primary, to acknowledge that replication has been completed
+ 		 * up to that. This occurs only when WAL records were replayed since
+ 		 * the last acknowledgement.
+ 		 */
+ 		if (rplMode == REPLICATION_MODE_REPLAY &&
+ 			XLByteLT(ackedpoint, LogstreamResult.Flush))
+ 		{
+ 			XLogRecPtr	recptr;
+ 
+ 			recptr = GetReplayRecPtr();
+ 			if (XLByteLT(ackedpoint, recptr))
+ 			{
+ 				XLogWalRcvSendRecPtr(recptr);
+ 				ackedpoint = recptr;
+ 			}
+ 		}
  	}
  }
  
***************
*** 406,411 **** XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
--- 430,448 ----
  				buf += sizeof(WalDataMessageHeader);
  				len -= sizeof(WalDataMessageHeader);
  
+ 				/*
+ 				 * If replication_mode is "recv", send the last WAL receive
+ 				 * location to the primary, to acknowledge that replication
+ 				 * has been completed up to that.
+ 				 */
+ 				if (rplMode == REPLICATION_MODE_RECV)
+ 				{
+ 					XLogRecPtr	endptr = msghdr.dataStart;
+ 
+ 					XLByteAdvance(endptr, len);
+ 					XLogWalRcvSendRecPtr(endptr);
+ 				}
+ 
  				XLogWalRcvWrite(buf, len, msghdr.dataStart);
  				break;
  			}
***************
*** 523,528 **** XLogWalRcvFlush(void)
--- 560,573 ----
  
  		LogstreamResult.Flush = LogstreamResult.Write;
  
+ 		/*
+ 		 * If replication_mode is "fsync", send the last WAL flush
+ 		 * location to the primary, to acknowledge that replication
+ 		 * has been completed up to that.
+ 		 */
+ 		if (rplMode == REPLICATION_MODE_FSYNC)
+ 			XLogWalRcvSendRecPtr(LogstreamResult.Flush);
+ 
  		/* Update shared-memory status */
  		SpinLockAcquire(&walrcv->mutex);
  		walrcv->latestChunkStart = walrcv->receivedUpto;
***************
*** 544,546 **** XLogWalRcvFlush(void)
--- 589,612 ----
  		}
  	}
  }
+ 
+ /* Send the lsn to the primary server */
+ static void
+ XLogWalRcvSendRecPtr(XLogRecPtr recptr)
+ {
+ 	static char	   *msgbuf = NULL;
+ 	WalAckMessageData	msgdata;
+ 
+ 	/*
+ 	 * Allocate buffer that will be used for each output message if first
+ 	 * time through.  We do this just once to reduce palloc overhead.
+ 	 * The buffer must be made large enough for maximum-sized messages.
+ 	 */
+ 	if (msgbuf == NULL)
+ 		msgbuf = palloc(1 + sizeof(WalAckMessageData));
+ 
+ 	msgbuf[0] = 'l';
+ 	msgdata.ackEnd = recptr;
+ 	memcpy(msgbuf + 1, &msgdata, sizeof(WalAckMessageData));
+ 	walrcv_send(msgbuf, 1 + sizeof(WalAckMessageData));
+ }
*** a/src/backend/replication/walreceiverfuncs.c
--- b/src/backend/replication/walreceiverfuncs.c
***************
*** 168,178 **** ShutdownWalRcv(void)
  /*
   * Request postmaster to start walreceiver.
   *
!  * recptr indicates the position where streaming should begin, and conninfo
!  * is a libpq connection string to use.
   */
  void
! RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile WalRcvData *walrcv = WalRcv;
--- 168,180 ----
  /*
   * Request postmaster to start walreceiver.
   *
!  * recptr indicates the position where streaming should begin, conninfo
!  * is a libpq connection string to use, and standbyName is name of this
!  * standby.
   */
  void
! RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo,
! 					 const char *standbyName)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile WalRcvData *walrcv = WalRcv;
***************
*** 196,201 **** RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo)
--- 198,207 ----
  		strlcpy((char *) walrcv->conninfo, conninfo, MAXCONNINFO);
  	else
  		walrcv->conninfo[0] = '\0';
+ 	if (standbyName != NULL)
+ 		strlcpy((char *) walrcv->standbyName, standbyName, MAXSTANDBYNAME);
+ 	else
+ 		walrcv->standbyName[0] = '\0';
  	walrcv->walRcvState = WALRCV_STARTING;
  	walrcv->startTime = now;
  
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 39,44 ****
--- 39,45 ----
  
  #include "access/xlog_internal.h"
  #include "catalog/pg_type.h"
+ #include "libpq/hba.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "libpq/pqsignal.h"
***************
*** 48,53 ****
--- 49,55 ----
  #include "storage/fd.h"
  #include "storage/ipc.h"
  #include "storage/pmsignal.h"
+ #include "storage/proc.h"
  #include "tcop/tcopprot.h"
  #include "utils/guc.h"
  #include "utils/memutils.h"
***************
*** 60,67 **** WalSndCtlData *WalSndCtl = NULL;
--- 62,73 ----
  /* My slot in the shared memory array */
  static WalSnd *MyWalSnd = NULL;
  
+ /* Array of WalSndWaiter in shared memory */
+ static WalSndWaiter  *WalSndWaiters;
+ 
  /* Global state */
  bool		am_walsender = false;		/* Am I a walsender process ? */
+ char	   *standby_name = NULL;		/* Name of connected standby */
  
  /* User-settable parameters for walsender */
  int			max_wal_senders = 0;	/* the maximum number of concurrent walsenders */
***************
*** 82,92 **** static uint32 sendOff = 0;
--- 88,125 ----
   */
  static XLogRecPtr sentPtr = {0, 0};
  
+ /*
+  * How far have we completed replication already? This is also
+  * advertised in MyWalSnd->ackdPtr. This is not used in asynchronous
+  * replication case.
+  */
+ static XLogRecPtr ackdPtr = {0, 0};
+ 
  /* Flags set by signal handlers for later service in main loop */
  static volatile sig_atomic_t got_SIGHUP = false;
  static volatile sig_atomic_t shutdown_requested = false;
  static volatile sig_atomic_t ready_to_stop = false;
  
+ /* Flag set by signal handler of backends for replication */
+ static volatile sig_atomic_t replication_done = false;
+ 
+ /*
+  * pre-parsed content of standbys configuration file: list of
+  * StandbysLine structs
+  */
+ static List *parsed_standbys_lines = NIL;
+ 
+ static const char *standbys_keywords[] = {"all", NULL};
+ 
+ /*
+  * Path of standbys configuration file (relative to $PGDATA).
+  *
+  * XXX: We should support the GUC parameter specifying the path of
+  * standbys configuration file?
+  */
+ #define STANDBYS_FILENAME	"standbys.conf"
+ static char	*StandbysFileName = NULL;
+ 
  /* Signal handlers */
  static void WalSndSigHupHandler(SIGNAL_ARGS);
  static void WalSndShutdownHandler(SIGNAL_ARGS);
***************
*** 101,107 **** static void WalSndHandshake(void);
  static void WalSndKill(int code, Datum arg);
  static void XLogRead(char *buf, XLogRecPtr recptr, Size nbytes);
  static bool XLogSend(char *msgbuf, bool *caughtup);
! static void CheckClosedConnection(void);
  
  
  /* Main entry point for walsender process */
--- 134,149 ----
  static void WalSndKill(int code, Datum arg);
  static void XLogRead(char *buf, XLogRecPtr recptr, Size nbytes);
  static bool XLogSend(char *msgbuf, bool *caughtup);
! static void ProcessStreamMsgs(StringInfo inMsg);
! 
! static void RegisterWalSndWaiter(BackendId backendId, XLogRecPtr record,
! 								 Latch *latch);
! static void WakeupWalSndWaiters(XLogRecPtr record);
! static XLogRecPtr GetOldestAckdPtr(void);
! 
! static bool parse_standbys_line(List *line, int line_num, StandbysLine *parsedline);
! static void free_standbys_record(StandbysLine *record);
! static void clean_standbys_list(List *lines);
  
  
  /* Main entry point for walsender process */
***************
*** 218,236 **** WalSndHandshake(void)
  						StringInfoData buf;
  						char		sysid[32];
  						char		tli[11];
  
  						/*
! 						 * Reply with a result set with one row, two columns.
! 						 * First col is system ID, and second is timeline ID
  						 */
  
  						snprintf(sysid, sizeof(sysid), UINT64_FORMAT,
  								 GetSystemIdentifier());
  						snprintf(tli, sizeof(tli), "%u", ThisTimeLineID);
  
  						/* Send a RowDescription message */
  						pq_beginmessage(&buf, 'T');
! 						pq_sendint(&buf, 2, 2); /* 2 fields */
  
  						/* first field */
  						pq_sendstring(&buf, "systemid");		/* col name */
--- 260,281 ----
  						StringInfoData buf;
  						char		sysid[32];
  						char		tli[11];
+ 						char		mode[8];
  
  						/*
! 						 * Reply with a result set with one row, three columns.
! 						 * First col is system ID, second is timeline ID, and
! 						 * third is replication mode.
  						 */
  
  						snprintf(sysid, sizeof(sysid), UINT64_FORMAT,
  								 GetSystemIdentifier());
  						snprintf(tli, sizeof(tli), "%u", ThisTimeLineID);
+ 						snprintf(mode, sizeof(mode), "%s", ReplicationModeNames[rplMode]);
  
  						/* Send a RowDescription message */
  						pq_beginmessage(&buf, 'T');
! 						pq_sendint(&buf, 3, 2); /* 3 fields */
  
  						/* first field */
  						pq_sendstring(&buf, "systemid");		/* col name */
***************
*** 249,263 **** WalSndHandshake(void)
  						pq_sendint(&buf, 4, 2); /* typlen */
  						pq_sendint(&buf, 0, 4); /* typmod */
  						pq_sendint(&buf, 0, 2); /* format code */
  						pq_endmessage(&buf);
  
  						/* Send a DataRow message */
  						pq_beginmessage(&buf, 'D');
! 						pq_sendint(&buf, 2, 2); /* # of columns */
  						pq_sendint(&buf, strlen(sysid), 4);		/* col1 len */
  						pq_sendbytes(&buf, (char *) &sysid, strlen(sysid));
  						pq_sendint(&buf, strlen(tli), 4);		/* col2 len */
  						pq_sendbytes(&buf, (char *) tli, strlen(tli));
  						pq_endmessage(&buf);
  
  						/* Send CommandComplete and ReadyForQuery messages */
--- 294,319 ----
  						pq_sendint(&buf, 4, 2); /* typlen */
  						pq_sendint(&buf, 0, 4); /* typmod */
  						pq_sendint(&buf, 0, 2); /* format code */
+ 
+ 						/* third field */
+ 						pq_sendstring(&buf, "replication_mode");	/* col name */
+ 						pq_sendint(&buf, 0, 4); /* table oid */
+ 						pq_sendint(&buf, 0, 2); /* attnum */
+ 						pq_sendint(&buf, TEXTOID, 4);	/* type oid */
+ 						pq_sendint(&buf, -1, 2);		/* typlen */
+ 						pq_sendint(&buf, 0, 4); /* typmod */
+ 						pq_sendint(&buf, 0, 2); /* format code */
  						pq_endmessage(&buf);
  
  						/* Send a DataRow message */
  						pq_beginmessage(&buf, 'D');
! 						pq_sendint(&buf, 3, 2); /* # of columns */
  						pq_sendint(&buf, strlen(sysid), 4);		/* col1 len */
  						pq_sendbytes(&buf, (char *) &sysid, strlen(sysid));
  						pq_sendint(&buf, strlen(tli), 4);		/* col2 len */
  						pq_sendbytes(&buf, (char *) tli, strlen(tli));
+ 						pq_sendint(&buf, strlen(mode), 4);	/* col3 len */
+ 						pq_sendbytes(&buf, (char *) &mode, strlen(mode));
  						pq_endmessage(&buf);
  
  						/* Send CommandComplete and ReadyForQuery messages */
***************
*** 295,304 **** WalSndHandshake(void)
  						pq_flush();
  
  						/*
! 						 * Initialize position to the received one, then the
  						 * xlog records begin to be shipped from that position
  						 */
! 						sentPtr = recptr;
  
  						/* break out of the loop */
  						replication_started = true;
--- 351,360 ----
  						pq_flush();
  
  						/*
! 						 * Initialize positions to the received one, then the
  						 * xlog records begin to be shipped from that position
  						 */
! 						sentPtr = ackdPtr = recptr;
  
  						/* break out of the loop */
  						replication_started = true;
***************
*** 332,384 **** WalSndHandshake(void)
  }
  
  /*
!  * Check if the remote end has closed the connection.
   */
  static void
! CheckClosedConnection(void)
  {
! 	unsigned char firstchar;
! 	int			r;
  
! 	r = pq_getbyte_if_available(&firstchar);
! 	if (r < 0)
! 	{
! 		/* unexpected error or EOF */
! 		ereport(COMMERROR,
! 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
! 				 errmsg("unexpected EOF on standby connection")));
! 		proc_exit(0);
! 	}
! 	if (r == 0)
  	{
! 		/* no data available without blocking */
! 		return;
! 	}
  
- 	/* Handle the very limited subset of commands expected in this phase */
- 	switch (firstchar)
- 	{
  			/*
  			 * 'X' means that the standby is closing down the socket.
  			 */
! 		case 'X':
! 			proc_exit(0);
  
! 		default:
! 			ereport(FATAL,
! 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
! 					 errmsg("invalid standby closing message type %d",
! 							firstchar)));
  	}
  }
  
  /* Main loop of walsender process */
  static int
  WalSndLoop(void)
  {
  	char	   *output_message;
  	bool		caughtup = false;
  
  	/*
  	 * Allocate buffer that will be used for each output message.  We do this
  	 * just once to reduce palloc overhead.  The buffer must be made large
--- 388,512 ----
  }
  
  /*
!  * Process messages received from the standby.
!  *
!  * ereports on error.
   */
  static void
! ProcessStreamMsgs(StringInfo inMsg)
  {
! 	bool	acked = false;
  
! 	/* Loop to process successive complete messages available */
! 	for (;;)
  	{
! 		unsigned char firstchar;
! 		int			r;
! 
! 		r = pq_getbyte_if_available(&firstchar);
! 		if (r < 0)
! 		{
! 			/* unexpected error or EOF */
! 			ereport(COMMERROR,
! 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
! 					 errmsg("unexpected EOF on standby connection")));
! 			proc_exit(0);
! 		}
! 		if (r == 0)
! 		{
! 			/* no data available without blocking */
! 			break;
! 		}
! 
! 		/* Handle the very limited subset of commands expected in this phase */
! 		switch (firstchar)
! 		{
! 			case 'd':       /* CopyData message */
! 			{
! 				unsigned char	rpltype;
! 
! 				/*
! 				 * Read the message contents. This is expected to be done without
! 				 * blocking because we've been able to get message type code.
! 				 */
! 				if (pq_getmessage(inMsg, 0))
! 					proc_exit(0);		/* suitable message already logged */
! 
! 				/* Read the replication message type from CopyData message */
! 				rpltype = pq_getmsgbyte(inMsg);
! 				switch (rpltype)
! 				{
! 					case 'l':
! 					{
! 						WalAckMessageData  *msgdata;
! 
! 						msgdata = (WalAckMessageData *) pq_getmsgbytes(inMsg, sizeof(WalAckMessageData));
! 
! 						/*
! 						 * Update local status.
! 						 *
! 						 * The ackd ptr received from standby should not
! 						 * go backwards.
! 						 */
! 						if (XLByteLE(ackdPtr, msgdata->ackEnd))
! 							ackdPtr = msgdata->ackEnd;
! 						else
! 							ereport(FATAL,
! 									(errmsg("replication completion location went back from "
! 											"%X/%X to %X/%X",
! 											ackdPtr.xlogid, ackdPtr.xrecoff,
! 											msgdata->ackEnd.xlogid, msgdata->ackEnd.xrecoff)));
! 
! 						acked = true;	/* also need to update shared position */
! 						break;
! 					}
! 					default:
! 						ereport(FATAL,
! 								(errcode(ERRCODE_PROTOCOL_VIOLATION),
! 								 errmsg("invalid replication message type %d",
! 										rpltype)));
! 				}
! 				break;
! 			}
  
  			/*
  			 * 'X' means that the standby is closing down the socket.
  			 */
! 			case 'X':
! 				proc_exit(0);
  
! 			default:
! 				ereport(FATAL,
! 						(errcode(ERRCODE_PROTOCOL_VIOLATION),
! 						 errmsg("invalid standby closing message type %d",
! 								firstchar)));
! 		}
  	}
+ 
+ 	if (acked)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile WalSnd *walsnd = MyWalSnd;
+ 
+ 		SpinLockAcquire(&walsnd->mutex);
+ 		walsnd->ackdPtr = ackdPtr;
+ 		SpinLockRelease(&walsnd->mutex);
+  	}
+ 
+ 	/* Wake up the backends that this walsender had been blocking */
+ 	WakeupWalSndWaiters(GetOldestAckdPtr());
  }
  
  /* Main loop of walsender process */
  static int
  WalSndLoop(void)
  {
+ 	StringInfoData	input_message;
  	char	   *output_message;
  	bool		caughtup = false;
  
+ 	initStringInfo(&input_message);
+ 
  	/*
  	 * Allocate buffer that will be used for each output message.  We do this
  	 * just once to reduce palloc overhead.  The buffer must be made large
***************
*** 455,462 **** WalSndLoop(void)
  								  WalSndDelay * 1000L);
  			}
  
! 			/* Check if the connection was closed */
! 			CheckClosedConnection();
  		}
  		else
  		{
--- 583,590 ----
  								  WalSndDelay * 1000L);
  			}
  
! 			/* Process messages received from the standby */
! 			ProcessStreamMsgs(&input_message);
  		}
  		else
  		{
***************
*** 515,520 **** InitWalSnd(void)
--- 643,650 ----
  			 */
  			walsnd->pid = MyProcPid;
  			MemSet(&walsnd->sentPtr, 0, sizeof(XLogRecPtr));
+ 			MemSet(&walsnd->ackdPtr, 0, sizeof(XLogRecPtr));
+ 			walsnd->rplMode = rplMode;
  			SpinLockRelease(&walsnd->mutex);
  			/* don't need the lock anymore */
  			OwnLatch((Latch *) &walsnd->latch);
***************
*** 540,545 **** WalSndKill(int code, Datum arg)
--- 670,679 ----
  {
  	Assert(MyWalSnd != NULL);
  
+ 	/* Wake up the backends that this walsender had been blocking */
+ 	MyWalSnd->rplMode = InvalidReplicationMode;
+ 	WakeupWalSndWaiters(GetOldestAckdPtr());
+ 
  	/*
  	 * Mark WalSnd struct no longer in use. Assume that no lock is required
  	 * for this.
***************
*** 904,909 **** WalSndShmemSize(void)
--- 1038,1050 ----
  	size = offsetof(WalSndCtlData, walsnds);
  	size = add_size(size, mul_size(max_wal_senders, sizeof(WalSnd)));
  
+ 	/*
+ 	 * If replication is enabled, we have a data structure called
+ 	 * WalSndWaiters, created in shared memory.
+ 	 */
+ 	if (max_wal_senders > 0)
+ 		size = add_size(size, mul_size(MaxBackends, sizeof(WalSndWaiter)));
+ 
  	return size;
  }
  
***************
*** 913,926 **** WalSndShmemInit(void)
  {
  	bool		found;
  	int			i;
  
  	WalSndCtl = (WalSndCtlData *)
! 		ShmemInitStruct("Wal Sender Ctl", WalSndShmemSize(), &found);
  
  	if (!found)
  	{
  		/* First time through, so initialize */
! 		MemSet(WalSndCtl, 0, WalSndShmemSize());
  
  		for (i = 0; i < max_wal_senders; i++)
  		{
--- 1054,1069 ----
  {
  	bool		found;
  	int			i;
+ 	Size		size = add_size(offsetof(WalSndCtlData, walsnds),
+ 								mul_size(max_wal_senders, sizeof(WalSnd)));
  
  	WalSndCtl = (WalSndCtlData *)
! 		ShmemInitStruct("Wal Sender Ctl", size, &found);
  
  	if (!found)
  	{
  		/* First time through, so initialize */
! 		MemSet(WalSndCtl, 0, size);
  
  		for (i = 0; i < max_wal_senders; i++)
  		{
***************
*** 930,935 **** WalSndShmemInit(void)
--- 1073,1088 ----
  			InitSharedLatch(&walsnd->latch);
  		}
  	}
+ 
+ 	/* Create or attach to the WalSndWaiters array too, if needed */
+ 	if (max_wal_senders > 0)
+ 	{
+ 		WalSndWaiters = (WalSndWaiter *)
+ 			ShmemInitStruct("WalSndWaiters",
+ 							mul_size(MaxBackends, sizeof(WalSndWaiter)),
+ 							&found);
+ 		WalSndCtl->maxWaiters = MaxBackends;
+ 	}
  }
  
  /* Wake up all walsenders */
***************
*** 943,977 **** WalSndWakeup(void)
  }
  
  /*
!  * This isn't currently used for anything. Monitoring tools might be
!  * interested in the future, and we'll need something like this in the
!  * future for synchronous replication.
   */
! #ifdef NOT_USED
  /*
!  * Returns the oldest Send position among walsenders. Or InvalidXLogRecPtr
!  * if none.
   */
! XLogRecPtr
! GetOldestWALSendPointer(void)
  {
  	XLogRecPtr	oldest = {0, 0};
! 	int			i;
! 	bool		found = false;
  
  	for (i = 0; i < max_wal_senders; i++)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
! 		XLogRecPtr	recptr;
  
! 		if (walsnd->pid == 0)
  			continue;
  
  		SpinLockAcquire(&walsnd->mutex);
! 		recptr = walsnd->sentPtr;
  		SpinLockRelease(&walsnd->mutex);
  
  		if (recptr.xlogid == 0 && recptr.xrecoff == 0)
  			continue;
  
--- 1096,1283 ----
  }
  
  /*
!  * Ensure that replication has been completed up to the given position.
   */
! void
! WaitXLogSend(XLogRecPtr record)
! {
! 	int		i;
! 
! 	Assert(max_wal_senders > 0);
! 
! 	for (i = 0; i < max_wal_senders; i++)
! 	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
! 		XLogRecPtr		recptr;
! 
! 		/* Don't need to wait for asynchronous walsender */
! 		if (walsnd->pid == 0 ||
! 			walsnd->rplMode <= REPLICATION_MODE_ASYNC)
! 			continue;
! 
! 		SpinLockAcquire(&walsnd->mutex);
! 		recptr = walsnd->ackdPtr;
! 		SpinLockRelease(&walsnd->mutex);
! 
! 		if (recptr.xlogid == 0 && recptr.xrecoff == 0)
! 			continue;
! 
! 		if (XLByteLT(recptr, record))
! 		{
! 			/*
! 			 * Register myself into the wait list and sleep until
! 			 * replication has been completed up to the given position
! 			 * and the walsender signals me.
! 			 *
! 			 * If replication has been completed up to the latest
! 			 * position before the registration, walsender might be
! 			 * unable to send the signal immediately. We must wake up
! 			 * the walsender after the registration.
! 			 */
! 			ResetLatch(&MyProc->latch);
! 			RegisterWalSndWaiter(MyBackendId, record, &MyProc->latch);
! 			WalSndWakeup();
! 
! 			for (;;)
! 			{
! 				WaitLatch(&MyProc->latch, 1000000L);
! 				if (replication_done)
! 				{
! 					replication_done = false;
! 					return;
! 				}
! 			}
! 		}
! 	}
! }
! 
  /*
!  * Register the given backend into the wait list.
   */
! static void
! RegisterWalSndWaiter(BackendId backendId, XLogRecPtr record, Latch *latch)
! {
! 	/* use volatile pointer to prevent code rearrangement */
! 	volatile WalSndCtlData	*walsndctl = WalSndCtl;
! 	int		i;
! 	int		count = 0;
! 
! 	LWLockAcquire(WalSndWaiterLock, LW_EXCLUSIVE);
! 
! 	/* Out of slots. This should not happen. */
! 	if (walsndctl->numWaiters + 1 > walsndctl->maxWaiters)
! 		elog(PANIC, "out of replication waiters slots");
! 
! 	/*
! 	 * The given position is expected to be relatively new in the
! 	 * wait list. Since the entries in the list are sorted in an
! 	 * increasing order of XLogRecPtr, we can shorten the time it
! 	 * takes to find an insert slot by scanning the list backwards.
! 	 */
! 	for (i = walsndctl->numWaiters; i > 0; i--)
! 	{
! 		if (XLByteLE(WalSndWaiters[i - 1].record, record))
! 			break;
! 		count++;
!  	}
! 
! 	/* Shuffle the list if needed */
! 	if (count > 0)
! 		memmove(&WalSndWaiters[i + 1], &WalSndWaiters[i],
! 				count * sizeof(WalSndWaiter));
! 
! 	WalSndWaiters[i].backendId = backendId;
! 	WalSndWaiters[i].record = record;
! 	WalSndWaiters[i].latch = latch;
! 	walsndctl->numWaiters++;
! 
! 	LWLockRelease(WalSndWaiterLock);
! }
! 
! /*
!  * Wake up the backends waiting until replication has been completed
!  * up to the position older than or equal to the given one.
!  *
!  * Wake up all waiters if InvalidXLogRecPtr is given.
!  */
! static void
! WakeupWalSndWaiters(XLogRecPtr record)
! {
! 	/* use volatile pointer to prevent code rearrangement */
! 	volatile WalSndCtlData	*walsndctl = WalSndCtl;
! 	int		i;
! 	int		count = 0;
! 	bool	all_wakeup = (record.xlogid == 0 && record.xrecoff == 0);
! 
! 	LWLockAcquire(WalSndWaiterLock, LW_EXCLUSIVE);
! 
! 	for (i = 0; i < walsndctl->numWaiters; i++)
! 	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile WalSndWaiter  *waiter = &WalSndWaiters[i];
! 
! 		if (all_wakeup || XLByteLE(waiter->record, record))
! 		{
! 			SetProcLatch(waiter->latch, PROCSIG_REPLICATION_INTERRUPT,
! 						 waiter->backendId);
! 			count++;
! 		}
! 		else
! 		{
! 			/*
! 			 * If the backend waiting for the Ack position newer than
! 			 * the given one is found, we don't need to search the wait
! 			 * list any more. This is because the waiters in the list
! 			 * are guaranteed to be sorted in an increasing order of
! 			 * XLogRecPtr.
! 			 */
! 			break;
! 		}
! 	}
! 
! 	/* If there are still some waiters, left-justify them in the list */
! 	walsndctl->numWaiters -= count;
! 	if (walsndctl->numWaiters > 0 && count > 0)
! 		memmove(&WalSndWaiters[0], &WalSndWaiters[i],
! 				walsndctl->numWaiters * sizeof(WalSndWaiter));
! 
! 	LWLockRelease(WalSndWaiterLock);
! }
! 
! /*
!  * Returns the oldest Ack position in synchronous walsenders. Or
!  * InvalidXLogRecPtr if none.
!  */
! static XLogRecPtr
! GetOldestAckdPtr(void)
  {
  	XLogRecPtr	oldest = {0, 0};
! 	int		i;
! 	bool	found = false;
  
  	for (i = 0; i < max_wal_senders; i++)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
! 		XLogRecPtr		recptr;
  
! 		/*
! 		 * Ignore the Ack position that asynchronous walsender has
! 		 * since it has never received any Ack.
! 		 */
! 		if (walsnd->pid == 0 ||
! 			walsnd->rplMode <= REPLICATION_MODE_ASYNC)
  			continue;
  
  		SpinLockAcquire(&walsnd->mutex);
! 		recptr = walsnd->ackdPtr;
  		SpinLockRelease(&walsnd->mutex);
  
+ 		/*
+ 		 * Ignore the Ack position that the walsender which has not
+ 		 * received any Ack yet has.
+ 		 */
  		if (recptr.xlogid == 0 && recptr.xrecoff == 0)
  			continue;
  
***************
*** 982,985 **** GetOldestWALSendPointer(void)
  	return oldest;
  }
  
! #endif
--- 1288,1500 ----
  	return oldest;
  }
  
! /*
!  * This is called when PROCSIG_REPLICATION_INTERRUPT is received.
!  */
! void
! HandleReplicationInterrupt(void)
! {
! 	replication_done = true;
! }
! 
! 
! /* ----------
!  * Routines to handle standbys configuration file
!  * ----------
!  */
! 
! /*
!  * Scan the (pre-parsed) standbys configuration file line by line,
!  * looking for a match to the standby name passed from the standby.
!  */
! bool
! check_standbys(void)
! {
! 	ListCell   *line;
! 	StandbysLine *standbys;
! 
! 	foreach(line, parsed_standbys_lines)
! 	{
! 		char	   *tok;
! 
! 		standbys = (StandbysLine *) lfirst(line);
! 
! 		/* Check standby name */
! 		for (tok = strtok(standbys->standbyName, MULTI_VALUE_SEP);
! 			 tok != NULL;
! 			 tok = strtok(NULL, MULTI_VALUE_SEP))
! 		{
! 			if (strcmp(tok, "all\n") == 0 ||
! 				(standby_name != NULL &&
! 				 strcmp(tok, standby_name) == 0))
! 			{
! 				rplMode = standbys->rplMode;
! 				return true;
! 			}
! 		}
! 	}
! 	return false;
! }
! 
! /*
!  * Parse one line in the standbys configuration file and store
!  * the result in a StandbysLine structure.
!  */
! static bool
! parse_standbys_line(List *line, int line_num, StandbysLine *parsedline)
! {
! 	char	   *token;
! 	ListCell   *line_item;
! 
! 	line_item = list_head(line);
! 
! 	parsedline->linenumber = line_num;
! 
! 	/* Get the standby name. */
! 	parsedline->standbyName = pstrdup(lfirst(line_item));
! 
! 	/* Get the mode. */
! 	line_item = lnext(line_item);
! 	if (!line_item)
! 	{
! 		ereport(LOG,
! 				(errcode(ERRCODE_CONFIG_FILE_ERROR),
! 				 errmsg("end-of-line before mode specification"),
! 				 errcontext("line %d of configuration file \"%s\"",
! 							line_num, StandbysFileName)));
! 		return false;
! 	}
! 	token = lfirst(line_item);
! 
! 	parsedline->rplMode = ReplicationModeNameGetValue(token);
! 	if (parsedline->rplMode == InvalidReplicationMode)
! 	{
! 		ereport(LOG,
! 				(errcode(ERRCODE_CONFIG_FILE_ERROR),
! 				 errmsg("invalid replication mode \"%s\"",
! 						token),
! 				 errcontext("line %d of configuration file \"%s\"",
! 							line_num, StandbysFileName)));
! 		return false;
! 	}
! 
! 	/* Ignore remaining tokens */
! 
! 	return true;
! }
! 
! /*
!  * Free an StandbysLine structure
!  */
! static void
! free_standbys_record(StandbysLine *record)
! {
! 	if (record->standbyName)
! 		pfree(record->standbyName);
! 	pfree(record);
! }
! 
! /*
!  * Free all records on the parsed Standbys list
!  */
! static void
! clean_standbys_list(List *lines)
! {
! 	ListCell   *line;
! 
! 	foreach(line, lines)
! 	{
! 		StandbysLine    *parsed = (StandbysLine *) lfirst(line);
! 
! 		if (parsed)
! 			free_standbys_record(parsed);
! 	}
! 	list_free(lines);
! }
! 
! /*
!  * Read the config file and create a List of StandbysLine records for the contents.
!  *
!  * The configuration is read into a temporary list, and if any parse error occurs
!  * the old list is kept in place and false is returned. Only if the whole file
!  * parses Ok is the list replaced, and the function returns true.
!  */
! bool
! load_standbys(void)
! {
! 	FILE	   *file;
! 	List	   *standbys_lines = NIL;
! 	List	   *standbys_line_nums = NIL;
! 	ListCell   *line,
! 			   *line_num;
! 	List	   *new_parsed_lines = NIL;
! 	bool		ok = true;
! 
! 	/* Ignore standbys.conf if replication is not enabled */
! 	if (max_wal_senders <= 0)
! 		return true;
! 
! 	/* If first time through, convert relative path to absolute */
! 	if (StandbysFileName == NULL)
! 		StandbysFileName = make_absolute_path(STANDBYS_FILENAME);
! 
! 	file = AllocateFile(StandbysFileName, "r");
! 	if (file == NULL)
! 	{
! 		ereport(LOG,
! 				(errcode_for_file_access(),
! 				 errmsg("could not open configuration file \"%s\": %m",
! 						StandbysFileName)));
! 
! 		/*
! 		 * Caller will take care of making this a FATAL error in case this is
! 		 * the initial startup. If it happens on reload, we just keep the old
! 		 * version around.
! 		 */
! 		return false;
! 	}
! 
! 	tokenize_file(StandbysFileName, file, &standbys_lines, &standbys_line_nums,
! 				  standbys_keywords);
! 	FreeFile(file);
! 
! 	/* Now parse all the lines */
! 	forboth(line, standbys_lines, line_num, standbys_line_nums)
! 	{
! 		StandbysLine    *newline;
! 
! 		newline = palloc0(sizeof(StandbysLine));
! 
! 		if (!parse_standbys_line(lfirst(line), lfirst_int(line_num), newline))
! 		{
! 			/* Parse error in the file, so indicate there's a problem */
! 			free_standbys_record(newline);
! 			ok = false;
! 
! 			/*
! 			 * Keep parsing the rest of the file so we can report errors on
! 			 * more than the first row. Error has already been reported in the
! 			 * parsing function, so no need to log it here.
! 			 */
! 			continue;
! 		}
! 
! 		new_parsed_lines = lappend(new_parsed_lines, newline);
! 	}
! 
! 	/* Free the temporary lists */
! 	free_lines(&standbys_lines, &standbys_line_nums);
! 
! 	if (!ok)
! 	{
! 		/* Parsing failed at one or more rows, so bail out */
! 		clean_standbys_list(new_parsed_lines);
! 		return false;
! 	}
! 
! 	/* Loaded new file successfully, replace the one we use */
! 	clean_standbys_list(parsed_standbys_lines);
! 	parsed_standbys_lines = new_parsed_lines;
! 
! 	return true;
! }
*** a/src/backend/storage/ipc/procsignal.c
--- b/src/backend/storage/ipc/procsignal.c
***************
*** 20,25 ****
--- 20,26 ----
  #include "bootstrap/bootstrap.h"
  #include "commands/async.h"
  #include "miscadmin.h"
+ #include "replication/walsender.h"
  #include "storage/ipc.h"
  #include "storage/latch.h"
  #include "storage/procsignal.h"
***************
*** 172,177 **** CleanupProcSignalState(int status, Datum arg)
--- 173,192 ----
  int
  SendProcSignal(pid_t pid, ProcSignalReason reason, BackendId backendId)
  {
+ 	if (SetProcSignalReason(pid, reason, backendId))
+ 		return kill(pid, SIGUSR1);		/* Send signal */
+ 
+ 	errno = ESRCH;
+ 	return -1;
+ }
+ 
+ /*
+  * SetProcSignalReason
+  *		Set the reason flag
+  */
+ bool
+ SetProcSignalReason(pid_t pid, ProcSignalReason reason, BackendId backendId)
+ {
  	volatile ProcSignalSlot *slot;
  
  	if (backendId != InvalidBackendId)
***************
*** 190,197 **** SendProcSignal(pid_t pid, ProcSignalReason reason, BackendId backendId)
  		{
  			/* Atomically set the proper flag */
  			slot->pss_signalFlags[reason] = true;
! 			/* Send signal */
! 			return kill(pid, SIGUSR1);
  		}
  	}
  	else
--- 205,211 ----
  		{
  			/* Atomically set the proper flag */
  			slot->pss_signalFlags[reason] = true;
! 			return true;
  		}
  	}
  	else
***************
*** 214,227 **** SendProcSignal(pid_t pid, ProcSignalReason reason, BackendId backendId)
  
  				/* Atomically set the proper flag */
  				slot->pss_signalFlags[reason] = true;
! 				/* Send signal */
! 				return kill(pid, SIGUSR1);
  			}
  		}
  	}
! 
! 	errno = ESRCH;
! 	return -1;
  }
  
  /*
--- 228,238 ----
  
  				/* Atomically set the proper flag */
  				slot->pss_signalFlags[reason] = true;
! 				return true;
  			}
  		}
  	}
! 	return false;
  }
  
  /*
***************
*** 279,284 **** procsignal_sigusr1_handler(SIGNAL_ARGS)
--- 290,298 ----
  	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
  		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
  
+ 	if (CheckProcSignal(PROCSIG_REPLICATION_INTERRUPT))
+ 		HandleReplicationInterrupt();
+ 
  	latch_sigusr1_handler();
  
  	errno = save_errno;
*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
***************
*** 196,201 **** InitProcGlobal(void)
--- 196,202 ----
  		PGSemaphoreCreate(&(procs[i].sem));
  		procs[i].links.next = (SHM_QUEUE *) ProcGlobal->freeProcs;
  		ProcGlobal->freeProcs = &procs[i];
+ 		InitSharedLatch(&procs[i].latch);
  	}
  
  	/*
***************
*** 214,219 **** InitProcGlobal(void)
--- 215,221 ----
  		PGSemaphoreCreate(&(procs[i].sem));
  		procs[i].links.next = (SHM_QUEUE *) ProcGlobal->autovacFreeProcs;
  		ProcGlobal->autovacFreeProcs = &procs[i];
+ 		InitSharedLatch(&procs[i].latch);
  	}
  
  	/*
***************
*** 325,330 **** InitProcess(void)
--- 327,333 ----
  	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
  		SHMQueueInit(&(MyProc->myProcLocks[i]));
  	MyProc->recoveryConflictPending = false;
+ 	OwnLatch(&MyProc->latch);
  
  	/*
  	 * We might be reusing a semaphore that belonged to a failed process. So
***************
*** 688,693 **** ProcKill(int code, Datum arg)
--- 691,697 ----
  	}
  
  	/* PGPROC struct isn't mine anymore */
+ 	DisownLatch(&MyProc->latch);
  	MyProc = NULL;
  
  	/* Update shared estimate of spins_per_delay */
*** a/src/backend/utils/init/postinit.c
--- b/src/backend/utils/init/postinit.c
***************
*** 664,669 **** InitPostgres(const char *in_dbname, Oid dboid, const char *username,
--- 664,690 ----
  					(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  					 errmsg("must be superuser to start walsender")));
  
+ 		/*
+ 		 * In EXEC_BACKEND case, we didn't inherit the contents of standbys.conf
+ 		 * etcetera from the postmaster, and have to load them ourselves.  Note we
+ 		 * are loading them into the startup transaction's memory context, not
+ 		 * PostmasterContext, but that shouldn't matter.
+ 		 *
+ 		 * FIXME: [fork/exec] Ugh.	Is there a way around this overhead?
+ 		 */
+ #ifdef EXEC_BACKEND
+ 		if (!load_standbys())
+ 		{
+ 			ereport(FATAL,
+ 					(errmsg("could not load standbys.conf")));
+ 		}
+ #endif
+ 
+ 		if (!check_standbys())
+ 			ereport(FATAL,
+ 					(errmsg("no standbys.conf entry for standby name \"%s\"",
+ 							standby_name)));
+ 
  		/* process any options passed in the startup packet */
  		if (MyProcPort != NULL)
  			process_startup_options(MyProcPort, am_superuser);
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 189,194 **** typedef enum
--- 189,229 ----
  
  extern XLogRecPtr XactLastRecEnd;
  
+ /*
+  * Replication mode. This is used to identify how long transaction
+  * commit should wait for replication.
+  *
+  * REPLICATION_MODE_ASYNC doesn't make transaction commit wait for
+  * replication, i.e., asynchronous replication.
+  *
+  * REPLICATION_MODE_RECV makes transaction commit wait for XLOG
+  * records to be received on the standby.
+  *
+  * REPLICATION_MODE_FSYNC makes transaction commit wait for XLOG
+  * records to be received and fsync'd on the standby.
+  *
+  * REPLICATION_MODE_REPLAY makes transaction commit wait for XLOG
+  * records to be received, fsync'd and replayed on the standby.
+  */
+ typedef enum ReplicationMode
+ {
+ 	InvalidReplicationMode = -1,
+ 	REPLICATION_MODE_ASYNC = 0,
+ 	REPLICATION_MODE_RECV,
+ 	REPLICATION_MODE_FSYNC,
+ 	REPLICATION_MODE_REPLAY
+ 
+ 	/*
+ 	 * NOTE: if you add a new mode, change MAXREPLICATIONMODE below
+ 	 * and update the ReplicationModeNames array in xlog.c
+ 	 */
+ } ReplicationMode;
+ 
+ #define MAXREPLICATIONMODE		REPLICATION_MODE_REPLAY
+ 
+ extern const char *ReplicationModeNames[];
+ extern ReplicationMode	rplMode;
+ 
  /* these variables are GUC parameters related to XLOG */
  extern int	CheckPointSegments;
  extern int	wal_keep_segments;
***************
*** 298,303 **** extern void XLogPutNextOid(Oid nextOid);
--- 333,339 ----
  extern XLogRecPtr GetRedoRecPtr(void);
  extern XLogRecPtr GetInsertRecPtr(void);
  extern XLogRecPtr GetFlushRecPtr(void);
+ extern XLogRecPtr GetReplayRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
  extern TimeLineID GetRecoveryTargetTLI(void);
  
***************
*** 305,308 **** extern void HandleStartupProcInterrupts(void);
--- 341,346 ----
  extern void StartupProcessMain(void);
  extern void WakeupRecovery(void);
  
+ extern ReplicationMode ReplicationModeNameGetValue(char *name);
+ 
  #endif   /* XLOG_H */
*** a/src/include/libpq/hba.h
--- b/src/include/libpq/hba.h
***************
*** 15,20 ****
--- 15,24 ----
  #include "libpq/pqcomm.h"
  
  
+ /* This is used to separate values in multi-valued column strings */
+ #define MULTI_VALUE_SEP "\001"
+ 
+ 
  typedef enum UserAuth
  {
  	uaReject,
***************
*** 89,93 **** extern int check_usermap(const char *usermap_name,
--- 93,100 ----
  			  const char *pg_role, const char *auth_user,
  			  bool case_sensitive);
  extern bool pg_isblank(const char c);
+ extern void tokenize_file(const char *filename, FILE *file,
+ 			  List **lines, List **line_nums, const char **keywords);
+ extern void free_lines(List **lines, List **line_nums);
  
  #endif   /* HBA_H */
*** a/src/include/replication/walprotocol.h
--- b/src/include/replication/walprotocol.h
***************
*** 50,53 **** typedef struct
--- 50,63 ----
   */
  #define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
  
+ /*
+  * Body for a WAL acknowledgment message (message type 'l'). This is wrapped
+  * within a CopyData message at the FE/BE protocol level.
+  */
+ typedef struct
+ {
+ 	/* End of WAL replicated to the standby */
+ 	XLogRecPtr	ackEnd;
+ } WalAckMessageData;
+ 
  #endif   /* _WALPROTOCOL_H */
*** a/src/include/replication/walreceiver.h
--- b/src/include/replication/walreceiver.h
***************
*** 26,31 **** extern bool am_walreceiver;
--- 26,38 ----
  #define MAXCONNINFO		1024
  
  /*
+  * MAXSTANDBYNAME: maximum size of standby name.
+  *
+  * XXX: Should this move to pg_config_manual.h?
+  */
+ #define MAXSTANDBYNAME	64
+ 
+ /*
   * Values for WalRcv->walRcvState.
   */
  typedef enum
***************
*** 71,89 **** typedef struct
  	 */
  	char		conninfo[MAXCONNINFO];
  
  	slock_t		mutex;			/* locks shared variables shown above */
  } WalRcvData;
  
  extern WalRcvData *WalRcv;
  
  /* libpqwalreceiver hooks */
! typedef bool (*walrcv_connect_type) (char *conninfo, XLogRecPtr startpoint);
  extern PGDLLIMPORT walrcv_connect_type walrcv_connect;
  
  typedef bool (*walrcv_receive_type) (int timeout, unsigned char *type,
  												 char **buffer, int *len);
  extern PGDLLIMPORT walrcv_receive_type walrcv_receive;
  
  typedef void (*walrcv_disconnect_type) (void);
  extern PGDLLIMPORT walrcv_disconnect_type walrcv_disconnect;
  
--- 78,106 ----
  	 */
  	char		conninfo[MAXCONNINFO];
  
+ 	/*
+ 	 * standby name; is used for the master to determine replication mode
+ 	 * from standbys configuration file.
+ 	 */
+ 	char		standbyName[MAXSTANDBYNAME];
+ 
  	slock_t		mutex;			/* locks shared variables shown above */
  } WalRcvData;
  
  extern WalRcvData *WalRcv;
  
  /* libpqwalreceiver hooks */
! typedef bool (*walrcv_connect_type) (char *conninfo, XLogRecPtr startpoint,
! 									 char *standbyName);
  extern PGDLLIMPORT walrcv_connect_type walrcv_connect;
  
  typedef bool (*walrcv_receive_type) (int timeout, unsigned char *type,
  												 char **buffer, int *len);
  extern PGDLLIMPORT walrcv_receive_type walrcv_receive;
  
+ typedef void (*walrcv_send_type) (const char *buffer, int nbytes);
+ extern PGDLLIMPORT walrcv_send_type walrcv_send;
+ 
  typedef void (*walrcv_disconnect_type) (void);
  extern PGDLLIMPORT walrcv_disconnect_type walrcv_disconnect;
  
***************
*** 95,101 **** extern Size WalRcvShmemSize(void);
  extern void WalRcvShmemInit(void);
  extern void ShutdownWalRcv(void);
  extern bool WalRcvInProgress(void);
! extern void RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo);
  extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart);
  
  #endif   /* _WALRECEIVER_H */
--- 112,119 ----
  extern void WalRcvShmemInit(void);
  extern void ShutdownWalRcv(void);
  extern bool WalRcvInProgress(void);
! extern void RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo,
! 								 const char *standbyName);
  extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart);
  
  #endif   /* _WALRECEIVER_H */
*** a/src/include/replication/walsender.h
--- b/src/include/replication/walsender.h
***************
*** 23,28 **** typedef struct WalSnd
--- 23,31 ----
  {
  	pid_t		pid;			/* this walsender's process id, or 0 */
  	XLogRecPtr	sentPtr;		/* WAL has been sent up to this point */
+ 	XLogRecPtr	ackdPtr;		/* WAL has been replicated up to this point */
+ 
+ 	ReplicationMode	rplMode;	/* replication mode */
  
  	slock_t		mutex;			/* locks shared variables shown above */
  
***************
*** 36,57 **** typedef struct WalSnd
--- 39,91 ----
  /* There is one WalSndCtl struct for the whole database cluster */
  typedef struct
  {
+ 	/* Protected by WalSndWaiterLock */
+ 	int			numWaiters;	/* current # of WalSndWaiters */
+ 	int			maxWaiters;	/* allocated size of WalSndWaiters */
+ 
  	WalSnd		walsnds[1];		/* VARIABLE LENGTH ARRAY */
  } WalSndCtlData;
  
  extern WalSndCtlData *WalSndCtl;
  
+ /*
+  * Each waiter has a WalSndWaiter struct in shared memory.
+  */
+ typedef struct WalSndWaiter
+ {
+ 	BackendId	backendId;	/* this waiter's backend ID */
+ 	XLogRecPtr	record;		/* this waiter wants for replication to be
+ 							 * acked up to this point */
+ 	Latch	   *latch;		/* pointer to the latch used to wake up this
+ 							 * waiter */
+ } WalSndWaiter;
+ 
  /* global state */
  extern bool am_walsender;
+ extern char *standby_name;
  
  /* user-settable parameters */
  extern int	WalSndDelay;
  extern int	max_wal_senders;
  
+ /* struct definition for standbys configuration file */
+ typedef struct
+ {
+ 	int			linenumber;
+ 	char	   *standbyName;
+ 	ReplicationMode	rplMode;
+ } StandbysLine;
+ 
  extern int	WalSenderMain(void);
  extern void WalSndSignals(void);
  extern Size WalSndShmemSize(void);
  extern void WalSndShmemInit(void);
  extern void WalSndWakeup(void);
+ extern void WaitXLogSend(XLogRecPtr record);
+ 
+ extern void HandleReplicationInterrupt(void);
+ 
+ extern bool check_standbys(void);
+ extern bool load_standbys(void);
  
  #endif   /* _WALSENDER_H */
*** a/src/include/storage/latch.h
--- b/src/include/storage/latch.h
***************
*** 16,21 ****
--- 16,23 ----
  
  #include <signal.h>
  
+ #include "storage/procsignal.h"
+ 
  /*
   * Latch structure should be treated as opaque and only accessed through
   * the public functions. It is defined here to allow embedding Latches as
***************
*** 42,47 **** extern bool WaitLatch(volatile Latch *latch, long timeout);
--- 44,51 ----
  extern int	WaitLatchOrSocket(volatile Latch *latch, pgsocket sock,
  				  long timeout);
  extern void SetLatch(volatile Latch *latch);
+ extern void SetProcLatch(volatile Latch *latch,
+ 				  ProcSignalReason reason, BackendId backendId);
  extern void ResetLatch(volatile Latch *latch);
  #define TestLatch(latch) (((volatile Latch *) latch)->is_set)
  
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 70,75 **** typedef enum LWLockId
--- 70,76 ----
  	RelationMappingLock,
  	AsyncCtlLock,
  	AsyncQueueLock,
+ 	WalSndWaiterLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
*** a/src/include/storage/proc.h
--- b/src/include/storage/proc.h
***************
*** 14,19 ****
--- 14,20 ----
  #ifndef _PROC_H_
  #define _PROC_H_
  
+ #include "storage/latch.h"
  #include "storage/lock.h"
  #include "storage/pg_sema.h"
  #include "utils/timestamp.h"
***************
*** 116,121 **** struct PGPROC
--- 117,128 ----
  								 * lock object by this backend */
  
  	/*
+ 	 * Latch used by walsenders to wake up this backend when replication
+ 	 * has been done.
+ 	 */
+ 	Latch		latch;
+ 
+ 	/*
  	 * All PROCLOCK objects for locks held or awaited by this backend are
  	 * linked into one of these lists, according to the partition number of
  	 * their lock.
*** a/src/include/storage/procsignal.h
--- b/src/include/storage/procsignal.h
***************
*** 40,45 **** typedef enum
--- 40,47 ----
  	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
  	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
  
+ 	PROCSIG_REPLICATION_INTERRUPT,	/* replication interrupt */
+ 
  	NUM_PROCSIGNALS				/* Must be last! */
  } ProcSignalReason;
  
***************
*** 52,57 **** extern void ProcSignalShmemInit(void);
--- 54,61 ----
  extern void ProcSignalInit(int pss_idx);
  extern int SendProcSignal(pid_t pid, ProcSignalReason reason,
  			   BackendId backendId);
+ extern bool SetProcSignalReason(pid_t pid, ProcSignalReason reason,
+ 			   BackendId backendId);
  
  extern void procsignal_sigusr1_handler(SIGNAL_ARGS);
  
*** a/src/interfaces/libpq/fe-connect.c
--- b/src/interfaces/libpq/fe-connect.c
***************
*** 254,259 **** static const PQconninfoOption PQconninfoOptions[] = {
--- 254,262 ----
  	{"replication", NULL, NULL, NULL,
  	"Replication", "D", 5},
  
+ 	{"standby_name", NULL, NULL, NULL,
+ 	"Standby-Name", "D", 64},
+ 
  	/* Terminating entry --- MUST BE LAST */
  	{NULL, NULL, NULL, NULL,
  	NULL, NULL, 0}
***************
*** 613,618 **** fillPGconn(PGconn *conn, PQconninfoOption *connOptions)
--- 616,623 ----
  #endif
  	tmp = conninfo_getval(connOptions, "replication");
  	conn->replication = tmp ? strdup(tmp) : NULL;
+ 	tmp = conninfo_getval(connOptions, "standby_name");
+ 	conn->standbyName = tmp ? strdup(tmp) : NULL;
  }
  
  /*
***************
*** 2622,2627 **** freePGconn(PGconn *conn)
--- 2627,2634 ----
  		free(conn->dbName);
  	if (conn->replication)
  		free(conn->replication);
+ 	if (conn->standbyName)
+ 		free(conn->standbyName);
  	if (conn->pguser)
  		free(conn->pguser);
  	if (conn->pgpass)
*** a/src/interfaces/libpq/fe-exec.c
--- b/src/interfaces/libpq/fe-exec.c
***************
*** 2002,2007 **** PQnotifies(PGconn *conn)
--- 2002,2010 ----
  /*
   * PQputCopyData - send some data to the backend during COPY IN
   *
+  * This function can be called by walreceiver even during COPY OUT
+  * to send a message to the master.
+  *
   * Returns 1 if successful, 0 if data could not be sent (only possible
   * in nonblock mode), or -1 if an error occurs.
   */
***************
*** 2010,2016 **** PQputCopyData(PGconn *conn, const char *buffer, int nbytes)
  {
  	if (!conn)
  		return -1;
! 	if (conn->asyncStatus != PGASYNC_COPY_IN)
  	{
  		printfPQExpBuffer(&conn->errorMessage,
  						  libpq_gettext("no COPY in progress\n"));
--- 2013,2020 ----
  {
  	if (!conn)
  		return -1;
! 	if (conn->asyncStatus != PGASYNC_COPY_IN &&
! 		conn->asyncStatus != PGASYNC_COPY_OUT)
  	{
  		printfPQExpBuffer(&conn->errorMessage,
  						  libpq_gettext("no COPY in progress\n"));
*** a/src/interfaces/libpq/fe-protocol3.c
--- b/src/interfaces/libpq/fe-protocol3.c
***************
*** 1911,1916 **** build_startup_packet(const PGconn *conn, char *packet,
--- 1911,1918 ----
  		ADD_STARTUP_OPTION("database", conn->dbName);
  	if (conn->replication && conn->replication[0])
  		ADD_STARTUP_OPTION("replication", conn->replication);
+ 	if (conn->standbyName && conn->standbyName[0])
+ 		ADD_STARTUP_OPTION("standby_name", conn->standbyName);
  	if (conn->pgoptions && conn->pgoptions[0])
  		ADD_STARTUP_OPTION("options", conn->pgoptions);
  	if (conn->send_appname)
*** a/src/interfaces/libpq/libpq-int.h
--- b/src/interfaces/libpq/libpq-int.h
***************
*** 297,302 **** struct pg_conn
--- 297,303 ----
  	char	   *fbappname;		/* fallback application name */
  	char	   *dbName;			/* database name */
  	char	   *replication;	/* connect as the replication standby? */
+ 	char	   *standbyName;	/* standby name */
  	char	   *pguser;			/* Postgres username and password, if any */
  	char	   *pgpass;
  	char	   *keepalives;		/* use TCP keepalives? */

#57

Erik Rijkers

er@xs4all.nl

over 15 years ago

In reply to: Fujii Masao (#55)

Re: Synchronous replication - patch status inquiry

On Wed, September 15, 2010 11:58, Fujii Masao wrote:

On Wed, Sep 15, 2010 at 6:38 AM, David Fetter <david@fetter.org> wrote:

Now that the latch patch is in, when do you think you'll be able to use it
instead of the poll loop?

Here is the updated version, which uses a latch in communication from
walsender to backend. I've not changed the others. Because walsender
already uses it in HEAD, and Heikki already proposed the patch which
replaced the poll loop between walreceiver and startup process with
a latch.

( synchrep_0915-2.patch; patch applies cleanly;
compile, check and install are without problem)

How does one enable synchronous replication with this patch?
With previous versions I could do (in standby's recovery.conf):

replication_mode = 'recv'

but not anymore, apparently.

(sorry, I have probably overlooked part of the discussion;
-hackers is getting too high-volume for me... )

thanks,

Erik Rijkers

#58

Erik Rijkers

er@xs4all.nl

over 15 years ago

In reply to: Erik Rijkers (#57)

Re: Synchronous replication - patch status inquiry

nevermind... I see standbys.conf is now used.

sorry for the noise...

Erik Rijkers

Show quoted text

On Thu, September 16, 2010 01:12, Erik Rijkers wrote:

On Wed, September 15, 2010 11:58, Fujii Masao wrote:

On Wed, Sep 15, 2010 at 6:38 AM, David Fetter <david@fetter.org> wrote:

Now that the latch patch is in, when do you think you'll be able to use it
instead of the poll loop?

Here is the updated version, which uses a latch in communication from
walsender to backend. I've not changed the others. Because walsender
already uses it in HEAD, and Heikki already proposed the patch which
replaced the poll loop between walreceiver and startup process with
a latch.

( synchrep_0915-2.patch; patch applies cleanly;
compile, check and install are without problem)

How does one enable synchronous replication with this patch?
With previous versions I could do (in standby's recovery.conf):

replication_mode = 'recv'

but not anymore, apparently.

(sorry, I have probably overlooked part of the discussion;
-hackers is getting too high-volume for me... )

thanks,

Erik Rijkers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers