Streaming Replication patch for CommitFest 2009-09

Started by Fujii Masaoover 16 years ago41 messageshackers
Jump to latest
#1Fujii Masao
masao.fujii@gmail.com

Hi,

Here is the latest version of Streaming Replication (SR) patch.

There were four major problems in the SR patch which was submitted for
the last CommitFest. The latest patch has overcome those problems:

1. Change the way synchronization is done when standby connects to
primary. After authentication, standby should send a message to primary,
stating the <begin> point (where <begin> is an XLogRecPtr, not a WAL
segment name). Primary starts streaming WAL starting from that point,
and keeps streaming forever. pg_read_xlogfile() needs to be removed.

In the latest version, at first, the standby attempts to do an archive recovery
as long as there is WAL record available in pg_xlog or archival area (only
possible if restore_command is supplied). When it finds the recovery error
(e.g., there is no WAL file available), it starts walreceiver process, and
requests the primary server to ship the WAL records following the last applied
record. Then the primary continuously sends the WAL records. OTOH, the
standby continuously receives, writes and replays them.

2. The primary should have no business reading back from the archive.
The standby can read from the archive, as it can today.

I got rid of the capability to restore the archived file, from the
primary. Also in
order not to lose the WAL file (required for the standby) from pg_xlog before
sending it, I tweaked the recycling policy of checkpoint.

3. Need to support multiple WALSenders. While multiple slave support
isn't 1st priority right now, it's not acceptable that a new WALSender
can't connect while one is active already. That can cause trouble in
case of network problems etc.

In the latest version, more than one standbys can establish a connection to
the primary. The WAL is concurrently shipped to those standbys, respectively.
The maximum number of standbys can be specified as a GUC variable
(max_wal_senders: better name?).

4. It is not acceptable that normal backends have to wait for walsender
to send data. That means that connecting a standby behind a slow
connection to the primary can grind the primary to a halt. walsender
needs to be able to read data from disk, not just from shared memory. (I
raised this back in December
http://archives.postgresql.org/message-id/495106FA.1050605@enterprisedb.com)

In the latest version, the walsender reads the WAL records from disk
instead of wal_buffers. So when the backend attempts to delete old data
from wal_buffer to insert new one, it doesn't need to wait until walsender
has read that data from wal_buffers.

As a hint, I think you'll find it a lot easier if you implement only
asynchronous replication at first. That reduces the amount of
inter-process communication a lot. You can then add synchronous
capability in a later commitfest. I would also suggest that for point 4,
you implement WAL sender so that it *only* reads from disk at first, and
only add the capability send from wal_buffers later on, and only if
performance testing shows that it's needed.

I advance development of SR in stages as Heikki suggested.
So note that the current patch provides only core part of *asynchronous*
log-shipping. There are many TODO items for later CommitFests:
synchronous capability, more useful statistics for SR, some feature for
admin, and so on.

The attached tarball contains some files. Description of each files,
a brief procedure to set up SR and the functional overview of it are in wiki.
And, I'm going to add the description of design of SR into wiki as much
as possible.
http://wiki.postgresql.org/wiki/Streaming_Replication

If you notice anything, please feel free to comment!

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

SR_0914.tgzapplication/x-gzip; name=SR_0914.tgzDownload+1-1
#2Greg Smith
gsmith@gregsmith.com
In reply to: Fujii Masao (#1)
Re: Streaming Replication patch for CommitFest 2009-09

This is looking really neat now, making async replication really solid
first before even trying to move on to sync is the right way to go here
IMHO. I just cleaned up the docs on the Wiki page, when this patch is
closer to being committed I officially volunteer to do the same on the
internal SGML docs; someone should nudge me when the patch is at that
point if I don't take care of it before then.

Putting on my DBA hat for a minute, the first question I see people asking
is "how do I measure how far behind the slaves are?". Presumably you can
get that out of pg_controldata; my first question is whether that's
complete enough information? If not, what else should be monitored?

I don't think running that program going to fly for a production quality
integrated replication setup though. The UI admins are going to want
would allow querying this easily via a standard database query. Most
monitoring systems can issue psql queries but not necessarily run a remote
binary. I think that parts of pg_controldata needs to get exposed via
some number of built-in UDFs instead, and whatever new internal state
makes sense too. I could help out writing those, if someone more familiar
with the replication internals can help me nail down a spec on what to
watch.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Greg Smith (#2)
Re: Streaming Replication patch for CommitFest 2009-09

Greg Smith wrote:

Putting on my DBA hat for a minute, the first question I see people
asking is "how do I measure how far behind the slaves are?". Presumably
you can get that out of pg_controldata; my first question is whether
that's complete enough information? If not, what else should be monitored?

I don't think running that program going to fly for a production quality
integrated replication setup though. The UI admins are going to want
would allow querying this easily via a standard database query. Most
monitoring systems can issue psql queries but not necessarily run a
remote binary. I think that parts of pg_controldata needs to get
exposed via some number of built-in UDFs instead, and whatever new
internal state makes sense too. I could help out writing those, if
someone more familiar with the replication internals can help me nail
down a spec on what to watch.

Yep, assuming for a moment that hot standby goes into 8.5, status
functions that return such information is the natural interface. It
should be trivial to write them as soon as hot standby and streaming
replication are in place.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Greg Smith (#2)
Re: Streaming Replication patch for CommitFest 2009-09

Greg Smith wrote:

This is looking really neat now, making async replication really solid
first before even trying to move on to sync is the right way to go
here IMHO.

I agree with both of those sentiments.

One question I have is what is the level of traffic involved between the
master and the slave. I know numbers of people have found the traffic
involved in shipping of log files to be a pain, and thus we get things
like pglesslog.

cheers

andrew

#5Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Greg Smith (#2)
Re: Streaming Replication patch for CommitFest 2009-09

Greg Smith <gsmith@gregsmith.com> wrote:

Putting on my DBA hat for a minute, the first question I see people
asking is "how do I measure how far behind the slaves are?".
Presumably you can get that out of pg_controldata; my first question
is whether that's complete enough information? If not, what else
should be monitored?

I don't think running that program going to fly for a production
quality integrated replication setup though. The UI admins are
going to want would allow querying this easily via a standard
database query. Most monitoring systems can issue psql queries but
not necessarily run a remote binary. I think that parts of
pg_controldata needs to get exposed via some number of built-in UDFs
instead, and whatever new internal state makes sense too. I could
help out writing those, if someone more familiar with the
replication internals can help me nail down a spec on what to watch.

IMO, it would be best if the status could be sent via NOTIFY. In my
experience, this results in monitoring which both has less overhead
and is more current. We tend to be almost as interested in metrics on
throughput as lag. Backlogged volume can be interesting, too, if it's
available.

-Kevin

#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kevin Grittner (#5)
Re: Streaming Replication patch for CommitFest 2009-09

Kevin Grittner wrote:

Greg Smith <gsmith@gregsmith.com> wrote:

I don't think running that program going to fly for a production
quality integrated replication setup though. The UI admins are
going to want would allow querying this easily via a standard
database query. Most monitoring systems can issue psql queries but
not necessarily run a remote binary. I think that parts of
pg_controldata needs to get exposed via some number of built-in UDFs
instead, and whatever new internal state makes sense too. I could
help out writing those, if someone more familiar with the
replication internals can help me nail down a spec on what to watch.

IMO, it would be best if the status could be sent via NOTIFY.

To where?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#7Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Heikki Linnakangas (#6)
Re: Streaming Replication patch for CommitFest 2009-09

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

Kevin Grittner wrote:

IMO, it would be best if the status could be sent via NOTIFY.

To where?

To registered listeners?

I guess I should have worded that as "it would be best if a change is
replication status could be signaled via NOTIFY" -- does that satisfy,
or am I missing your point entirely?

-Kevin

#8Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#1)
Re: Streaming Replication patch for CommitFest 2009-09

Fujii Masao wrote:

Here is the latest version of Streaming Replication (SR) patch.

The first thing that caught my eye is that I don't think "replication"
should be a real database. Rather, it should by a keyword in
pg_hba.conf, like the existing "all", "sameuser", "samerole" keywords
that you can put into the database-column.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#1)
Re: Streaming Replication patch for CommitFest 2009-09

On Mon, 2009-09-14 at 20:24 +0900, Fujii Masao wrote:

The latest patch has overcome those problems:

Well done. I hope to look at it myself in a few days time.

--
Simon Riggs www.2ndQuadrant.com

#10Fujii Masao
masao.fujii@gmail.com
In reply to: Greg Smith (#2)
Re: Streaming Replication patch for CommitFest 2009-09

Hi,

On Tue, Sep 15, 2009 at 12:47 AM, Greg Smith <gsmith@gregsmith.com> wrote:

Putting on my DBA hat for a minute, the first question I see people asking
is "how do I measure how far behind the slaves are?".  Presumably you can
get that out of pg_controldata; my first question is whether that's complete
enough information?  If not, what else should be monitored?

Currently the progress of replication is shown only in PS display. So, the
following three steps are necessary to measure the gap of the servers.

1. execute pg_current_xlog_location() to check how far the primary has
written WAL.
2. execute 'ps' to check how far the standby has written WAL.
3. compare the above results.

This is very messy. More user-friendly monitoring feature is necessary,
and development of it is one of TODO item for the later CommitFest.

I'm thinking something like pg_standbys_xlog_location() which returns
one row per standby servers, showing pid of walsender, host name/
port number/user OID of the standby, the location where the standby
has written/flushed WAL. DBA can measure the gap from the
combination of pg_current_xlog_location() and pg_standbys_xlog_location()
via one query on the primary. Thought?

But the problem might be what happens after the primary has fallen
down. The current write location of the primary cannot be checked via
pg_current_xlog_locaton, and might need to be calculated from WAL
files on the primary. Is the tool which performs such calculation
necessary?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#11Fujii Masao
masao.fujii@gmail.com
In reply to: Andrew Dunstan (#4)
Re: Streaming Replication patch for CommitFest 2009-09

Hi,

On Tue, Sep 15, 2009 at 1:06 AM, Andrew Dunstan <andrew@dunslane.net> wrote:

One question I have is what is the level of traffic involved between the
master and the slave. I know numbers of people have found the traffic
involved in shipping of log files to be a pain, and thus we get things like
pglesslog.

That is almost the same as the WAL write traffic on the primary. In fact,
the content of WAL files written to the standby are exactly the same as
those on the primary. Currently SR has provided no compression
capability of the traffic. Should we introduce something like
walsender_hook/walreceiver_hook to cooperate with the add-on program
for compression like pglesslog?

If you always use PITR instead of normal recovery, full_page_writes = off
might be another solution.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#12Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#8)
Re: Streaming Replication patch for CommitFest 2009-09

Hi,

On Tue, Sep 15, 2009 at 2:54 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

The first thing that caught my eye is that I don't think "replication"
should be a real database. Rather, it should by a keyword in
pg_hba.conf, like the existing "all", "sameuser", "samerole" keywords
that you can put into the database-column.

I'll try that! It might be only necessary to prevent walsender from accessing
pg_database and checking if the target database is present, in InitPostres().

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#13Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kevin Grittner (#7)
Re: Streaming Replication patch for CommitFest 2009-09

Kevin Grittner wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

Kevin Grittner wrote:

IMO, it would be best if the status could be sent via NOTIFY.

To where?

To registered listeners?

I guess I should have worded that as "it would be best if a change is
replication status could be signaled via NOTIFY" -- does that satisfy,
or am I missing your point entirely?

Ok, makes more sense now.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#14Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#1)
Re: Streaming Replication patch for CommitFest 2009-09

After playing with this a little bit, I think we need logic in the slave
to reconnect to the master if the connection is broken for some reason,
or can't be established in the first place. At the moment, that is
considered as the end of recovery, and the slave starts up. You have the
trigger file mechanism to stop that, but it only gives you a chance to
manually kill and restart the slave before it chooses a new timeline and
starts up, it doesn't reconnect automatically.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#15Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#14)
Re: Streaming Replication patch for CommitFest 2009-09

Hi,

On Tue, Sep 15, 2009 at 7:53 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

After playing with this a little bit, I think we need logic in the slave
to reconnect to the master if the connection is broken for some reason,
or can't be established in the first place. At the moment, that is
considered as the end of recovery, and the slave starts up. You have the
trigger file mechanism to stop that, but it only gives you a chance to
manually kill and restart the slave before it chooses a new timeline and
starts up, it doesn't reconnect automatically.

I was thinking that the automatic reconnection capability is the TODO item
for the later CF. The infrastructure for it has already been introduced in the
current patch. Please see the macro MAX_WALRCV_RETRIES (backend/
postmaster/walreceiver.c). This is the maximum number of times to retry
walreceiver. In the current version, this is the fixed value, but we can make
this user-configurable (parameter of recovery.conf is suitable, I think).

Also a parameter like retries_interval might be necessary. This parameter
indicates the interval between each reconnection attempt.

Do you think that these parameters should be introduced right now? or
the later CF?

BTW, these parameters are provided in MySQL replication.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#16Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#15)
Re: Streaming Replication patch for CommitFest 2009-09

Hi,

On Wed, Sep 16, 2009 at 11:37 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

I was thinking that the automatic reconnection capability is the TODO item
for the later CF. The infrastructure for it has already been introduced in the
current patch. Please see the macro MAX_WALRCV_RETRIES (backend/
postmaster/walreceiver.c). This is the maximum number of times to retry
walreceiver. In the current version, this is the fixed value, but we can make
this user-configurable (parameter of recovery.conf is suitable, I think).

Also a parameter like retries_interval might be necessary. This parameter
indicates the interval between each reconnection attempt.

Do you think that these parameters should be introduced right now? or
the later CF?

I updated the TODO list on the wiki, and marked the items that I'm going to
develop for the later CommitFest.
http://wiki.postgresql.org/wiki/Streaming_Replication#Todo_and_Claim

Do you have any other TODO item? How much is that priority?
And, is there already-listed TODO item which should be developed right
now (CommitFest 2009-09)?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#17Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#15)
Re: Streaming Replication patch for CommitFest 2009-09

Fujii Masao wrote:

On Tue, Sep 15, 2009 at 7:53 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

After playing with this a little bit, I think we need logic in the slave
to reconnect to the master if the connection is broken for some reason,
or can't be established in the first place. At the moment, that is
considered as the end of recovery, and the slave starts up. You have the
trigger file mechanism to stop that, but it only gives you a chance to
manually kill and restart the slave before it chooses a new timeline and
starts up, it doesn't reconnect automatically.

I was thinking that the automatic reconnection capability is the TODO item
for the later CF. The infrastructure for it has already been introduced in the
current patch. Please see the macro MAX_WALRCV_RETRIES (backend/
postmaster/walreceiver.c). This is the maximum number of times to retry
walreceiver. In the current version, this is the fixed value, but we can make
this user-configurable (parameter of recovery.conf is suitable, I think).

Ah, I see.

Robert Haas suggested a while ago that walreceiver could be a
stand-alone utility, not requiring postmaster at all. That would allow
you to set up streaming replication as another way to implement WAL
archiving. Looking at how the processes interact, there really isn't
much communication between walreceiver and the rest of the system, so
that sounds pretty attractive.

Walreceiver only needs access to shared memory so that it can tell the
startup process how far it has replicated already. Even when we add the
synchronous capability, I don't think we need any more inter-process
communication. Only if we wanted to acknowledge to the master when a
piece of WAL log has been successfully replayed, the startup process
would need to tell walreceiver about it, but I think we're going to
settle for acknowledging when a piece of log has been fsync'd to disk.

Walreceiver is really a slave to the startup process. The startup
process decides when it's launched, and it's the startup process that
then waits for it to advance. But the way it's set up at the moment, the
startup process needs to ask the postmaster to start it up, and it
doesn't look very robust to me. For example, if launching walreceiver
fails for some reason, startup process will just hang waiting for it.

I'm thinking that walreceiver should be a stand-alone program that the
startup process launches, similar to how it invokes restore_command in
PITR recovery. Instead of using system(), though, it would use
fork+exec, and a pipe to communicate.

Also, when we get around to implement the "fetch base backup
automatically via the TCP connection" feature, we can't use walreceiver
as it is now for that, because there's no hope of starting up the system
that far without a base backup. I'm not sure if it can or should be
merged with the walreceiver program, but it can't be a postmaster child
process, that's for sure.

Thoughts?

Also a parameter like retries_interval might be necessary. This parameter
indicates the interval between each reconnection attempt.

Yeah, maybe, although a hard-coded interval of a few seconds should be
enough to get us started.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#18Magnus Hagander
magnus@hagander.net
In reply to: Heikki Linnakangas (#17)
Re: Streaming Replication patch for CommitFest 2009-09

On Thu, Sep 17, 2009 at 10:08, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Fujii Masao wrote:

On Tue, Sep 15, 2009 at 7:53 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

After playing with this a little bit, I think we need logic in the slave
to reconnect to the master if the connection is broken for some reason,
or can't be established in the first place. At the moment, that is
considered as the end of recovery, and the slave starts up. You have the
trigger file mechanism to stop that, but it only gives you a chance to
manually kill and restart the slave before it chooses a new timeline and
starts up, it doesn't reconnect automatically.

I was thinking that the automatic reconnection capability is the TODO item
for the later CF. The infrastructure for it has already been introduced in the
current patch. Please see the macro MAX_WALRCV_RETRIES (backend/
postmaster/walreceiver.c). This is the maximum number of times to retry
walreceiver. In the current version, this is the fixed value, but we can make
this user-configurable (parameter of recovery.conf is suitable, I think).

Ah, I see.

Robert Haas suggested a while ago that walreceiver could be a
stand-alone utility, not requiring postmaster at all. That would allow
you to set up streaming replication as another way to implement WAL
archiving. Looking at how the processes interact, there really isn't
much communication between walreceiver and the rest of the system, so
that sounds pretty attractive.

Yes, that would be very very useful.

Walreceiver is really a slave to the startup process. The startup
process decides when it's launched, and it's the startup process that
then waits for it to advance. But the way it's set up at the moment, the
startup process needs to ask the postmaster to start it up, and it
doesn't look very robust to me. For example, if launching walreceiver
fails for some reason, startup process will just hang waiting for it.

I'm thinking that walreceiver should be a stand-alone program that the
startup process launches, similar to how it invokes restore_command in
PITR recovery. Instead of using system(), though, it would use
fork+exec, and a pipe to communicate.

Not having looked at all into the details, that sounds like a nice
improvement :-)

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#19Csaba Nagy
nagy@ecircle-ag.com
In reply to: Heikki Linnakangas (#17)
Re: Streaming Replication patch for CommitFest 2009-09

On Thu, 2009-09-17 at 10:08 +0200, Heikki Linnakangas wrote:

Robert Haas suggested a while ago that walreceiver could be a
stand-alone utility, not requiring postmaster at all. That would allow
you to set up streaming replication as another way to implement WAL
archiving. Looking at how the processes interact, there really isn't
much communication between walreceiver and the rest of the system, so
that sounds pretty attractive.

Just a small comment in this direction: what if the archive would be
itself a postgres DB, and it would collect the WALs in some special
place (together with some meta data, snapshots, etc), and then a slave
could connect to it just like to any other master ? (except maybe it
could specify which snapshot to to start with and possibly choosing
between different archived WAL streams).

Maybe it is completely stupid what I'm saying, but I see the archive as
just another form of a postgres server, with the same protocol from the
POV of a slave. While I don't have the clue to implement such a thing, I
thought it might be interesting as an idea while discussing the
walsender/receiver interface...

Cheers,
Csaba.

#20Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#1)
Re: Streaming Replication patch for CommitFest 2009-09

Some random comments:

I don't think we need the new PM_SHUTDOWN_3 postmaster state. We can
treat walsenders the same as the archive process, and kill and wait for
both of them to die in PM_SHUTDOWN_2 state.

I think there's something wrong with the napping in walsender. When I
perform px_xlog_switch(), it takes surprisingly long for it to trickle
to the standby. When I put a little proxy program in between the master
and slave that delays all messages from the slave to the master by one
second, it got worse, even though I would expect the master to still
keep sending WAL at full speed. I get logs like this:

2009-09-17 14:13:16.876 EEST LOG: xlog send request 0/38000000; send
0/3700006C; write 0/3700006C
2009-09-17 14:13:16.877 EEST LOG: xlog read request 0/37010000; send
0/37010000; write 0/3700006C
2009-09-17 14:13:17.077 EEST LOG: xlog send request 0/38000000; send
0/37010000; write 0/3700006C
2009-09-17 14:13:17.077 EEST LOG: xlog read request 0/37020000; send
0/37020000; write 0/3700006C
2009-09-17 14:13:17.078 EEST LOG: xlog read request 0/37030000; send
0/37030000; write 0/3700006C
2009-09-17 14:13:17.278 EEST LOG: xlog send request 0/38000000; send
0/37030000; write 0/3700006C
2009-09-17 14:13:17.279 EEST LOG: xlog read request 0/37040000; send
0/37040000; write 0/3700006C
...
2009-09-17 14:13:22.796 EEST LOG: xlog read request 0/37FD0000; send
0/37FD0000; write 0/376D0000
2009-09-17 14:13:22.896 EEST LOG: xlog send request 0/38000000; send
0/37FD0000; write 0/376D0000
2009-09-17 14:13:22.896 EEST LOG: xlog read request 0/37FE0000; send
0/37FE0000; write 0/376D0000
2009-09-17 14:13:22.896 EEST LOG: xlog read request 0/37FF0000; send
0/37FF0000; write 0/376D0000
2009-09-17 14:13:22.897 EEST LOG: xlog read request 0/38000000; send
0/38000000; write 0/376D0000
2009-09-17 14:14:09.932 EEST LOG: xlog send request 0/38000428; send
0/38000000; write 0/38000000
2009-09-17 14:14:09.932 EEST LOG: xlog read request 0/38000428; send
0/38000428; write 0/38000000

It looks like it's having 100 or 200 ms naps in between. Also, I
wouldn't expect to see so many "read request" acknowledgments from the
slave. The master doesn't really need to know how far the slave is,
except in synchronous replication when it has requested a flush to
slave. Another reason why master needs to know is so that the master can
recycle old log files, but for that we'd really only need an
acknowledgment once per WAL file or even less.

Why does XLogSend() care about page boundaries? Perhaps it's a leftover
from the old approach that read from wal_buffers?

Do we really need the support for asynchronous backend libpq commands?
Could walsender just keep blasting WAL to the slave, and only try to
read an acknowledgment after it has requested one, by setting
XLOGSTREAM_FLUSH flag. Or maybe we should be putting the socket into
non-blocking mode.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#21Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#17)
#22Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#20)
#23Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#21)
#24Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#21)
#25Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#24)
#26Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#17)
#27Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#25)
#28Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#27)
#29Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#27)
#30Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#29)
#31Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#30)
#32Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#28)
#33Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#31)
#34Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#32)
#35Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#34)
#36Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#35)
#37Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#28)
#38Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#17)
#39Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#27)
#40Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Fujii Masao (#38)
#41Fujii Masao
masao.fujii@gmail.com
In reply to: Alvaro Herrera (#40)