Streaming Replication patch for CommitFest 2009-09

Started by Fujii Masaoalmost 17 years ago41 messageshackers

masao.fujii@gmail.com

almost 17 years ago

Hi,

Here is the latest version of Streaming Replication (SR) patch.

There were four major problems in the SR patch which was submitted for
the last CommitFest. The latest patch has overcome those problems:

1. Change the way synchronization is done when standby connects to
primary. After authentication, standby should send a message to primary,
stating the <begin> point (where <begin> is an XLogRecPtr, not a WAL
segment name). Primary starts streaming WAL starting from that point,
and keeps streaming forever. pg_read_xlogfile() needs to be removed.

In the latest version, at first, the standby attempts to do an archive recovery
as long as there is WAL record available in pg_xlog or archival area (only
possible if restore_command is supplied). When it finds the recovery error
(e.g., there is no WAL file available), it starts walreceiver process, and
requests the primary server to ship the WAL records following the last applied
record. Then the primary continuously sends the WAL records. OTOH, the
standby continuously receives, writes and replays them.

2. The primary should have no business reading back from the archive.
The standby can read from the archive, as it can today.

I got rid of the capability to restore the archived file, from the
primary. Also in
order not to lose the WAL file (required for the standby) from pg_xlog before
sending it, I tweaked the recycling policy of checkpoint.

3. Need to support multiple WALSenders. While multiple slave support
isn't 1st priority right now, it's not acceptable that a new WALSender
can't connect while one is active already. That can cause trouble in
case of network problems etc.

In the latest version, more than one standbys can establish a connection to
the primary. The WAL is concurrently shipped to those standbys, respectively.
The maximum number of standbys can be specified as a GUC variable
(max_wal_senders: better name?).

4. It is not acceptable that normal backends have to wait for walsender
to send data. That means that connecting a standby behind a slow
connection to the primary can grind the primary to a halt. walsender
needs to be able to read data from disk, not just from shared memory. (I
raised this back in December
http://archives.postgresql.org/message-id/495106FA.1050605@enterprisedb.com)

In the latest version, the walsender reads the WAL records from disk
instead of wal_buffers. So when the backend attempts to delete old data
from wal_buffer to insert new one, it doesn't need to wait until walsender
has read that data from wal_buffers.

As a hint, I think you'll find it a lot easier if you implement only
asynchronous replication at first. That reduces the amount of
inter-process communication a lot. You can then add synchronous
capability in a later commitfest. I would also suggest that for point 4,
you implement WAL sender so that it *only* reads from disk at first, and
only add the capability send from wal_buffers later on, and only if
performance testing shows that it's needed.

I advance development of SR in stages as Heikki suggested.
So note that the current patch provides only core part of *asynchronous*
log-shipping. There are many TODO items for later CommitFests:
synchronous capability, more useful statistics for SR, some feature for
admin, and so on.

The attached tarball contains some files. Description of each files,
a brief procedure to set up SR and the functional overview of it are in wiki.
And, I'm going to add the description of design of SR into wiki as much
as possible.
http://wiki.postgresql.org/wiki/Streaming_Replication

If you notice anything, please feel free to comment!

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Greg Smith

gsmith@gregsmith.com

almost 17 years ago

In reply to: Fujii Masao (#1)

Re: Streaming Replication patch for CommitFest 2009-09

This is looking really neat now, making async replication really solid
first before even trying to move on to sync is the right way to go here
IMHO. I just cleaned up the docs on the Wiki page, when this patch is
closer to being committed I officially volunteer to do the same on the
internal SGML docs; someone should nudge me when the patch is at that
point if I don't take care of it before then.

Putting on my DBA hat for a minute, the first question I see people asking
is "how do I measure how far behind the slaves are?". Presumably you can
get that out of pg_controldata; my first question is whether that's
complete enough information? If not, what else should be monitored?

I don't think running that program going to fly for a production quality
integrated replication setup though. The UI admins are going to want
would allow querying this easily via a standard database query. Most
monitoring systems can issue psql queries but not necessarily run a remote
binary. I think that parts of pg_controldata needs to get exposed via
some number of built-in UDFs instead, and whatever new internal state
makes sense too. I could help out writing those, if someone more familiar
with the replication internals can help me nail down a spec on what to
watch.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 17 years ago

In reply to: Greg Smith (#2)

Re: Streaming Replication patch for CommitFest 2009-09

Greg Smith wrote:

Putting on my DBA hat for a minute, the first question I see people
asking is "how do I measure how far behind the slaves are?". Presumably
you can get that out of pg_controldata; my first question is whether
that's complete enough information? If not, what else should be monitored?

I don't think running that program going to fly for a production quality
integrated replication setup though. The UI admins are going to want
would allow querying this easily via a standard database query. Most
monitoring systems can issue psql queries but not necessarily run a
remote binary. I think that parts of pg_controldata needs to get
exposed via some number of built-in UDFs instead, and whatever new
internal state makes sense too. I could help out writing those, if
someone more familiar with the replication internals can help me nail
down a spec on what to watch.

Yep, assuming for a moment that hot standby goes into 8.5, status
functions that return such information is the natural interface. It
should be trivial to write them as soon as hot standby and streaming
replication are in place.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Andrew Dunstan

andrew@dunslane.net

almost 17 years ago

In reply to: Greg Smith (#2)

Re: Streaming Replication patch for CommitFest 2009-09

Greg Smith wrote:

This is looking really neat now, making async replication really solid
first before even trying to move on to sync is the right way to go
here IMHO.

I agree with both of those sentiments.

One question I have is what is the level of traffic involved between the
master and the slave. I know numbers of people have found the traffic
involved in shipping of log files to be a pain, and thus we get things
like pglesslog.

cheers

andrew

Kevin Grittner

Kevin.Grittner@wicourts.gov

almost 17 years ago

In reply to: Greg Smith (#2)

Re: Streaming Replication patch for CommitFest 2009-09

Greg Smith <gsmith@gregsmith.com> wrote:

Putting on my DBA hat for a minute, the first question I see people
asking is "how do I measure how far behind the slaves are?".
Presumably you can get that out of pg_controldata; my first question
is whether that's complete enough information? If not, what else
should be monitored?

I don't think running that program going to fly for a production
quality integrated replication setup though. The UI admins are
going to want would allow querying this easily via a standard
database query. Most monitoring systems can issue psql queries but
not necessarily run a remote binary. I think that parts of
pg_controldata needs to get exposed via some number of built-in UDFs
instead, and whatever new internal state makes sense too. I could
help out writing those, if someone more familiar with the
replication internals can help me nail down a spec on what to watch.

IMO, it would be best if the status could be sent via NOTIFY. In my
experience, this results in monitoring which both has less overhead
and is more current. We tend to be almost as interested in metrics on
throughput as lag. Backlogged volume can be interesting, too, if it's
available.

-Kevin

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 17 years ago

In reply to: Kevin Grittner (#5)

Re: Streaming Replication patch for CommitFest 2009-09

Kevin Grittner wrote:

Greg Smith <gsmith@gregsmith.com> wrote:

I don't think running that program going to fly for a production
quality integrated replication setup though. The UI admins are
going to want would allow querying this easily via a standard
database query. Most monitoring systems can issue psql queries but
not necessarily run a remote binary. I think that parts of
pg_controldata needs to get exposed via some number of built-in UDFs
instead, and whatever new internal state makes sense too. I could
help out writing those, if someone more familiar with the
replication internals can help me nail down a spec on what to watch.

IMO, it would be best if the status could be sent via NOTIFY.

To where?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Kevin Grittner

Kevin.Grittner@wicourts.gov

almost 17 years ago

In reply to: Heikki Linnakangas (#6)

Re: Streaming Replication patch for CommitFest 2009-09

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

Kevin Grittner wrote:

IMO, it would be best if the status could be sent via NOTIFY.

To where?

To registered listeners?

I guess I should have worded that as "it would be best if a change is
replication status could be signaled via NOTIFY" -- does that satisfy,
or am I missing your point entirely?

-Kevin

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 17 years ago

In reply to: Fujii Masao (#1)

Re: Streaming Replication patch for CommitFest 2009-09

Fujii Masao wrote:

Here is the latest version of Streaming Replication (SR) patch.

The first thing that caught my eye is that I don't think "replication"
should be a real database. Rather, it should by a keyword in
pg_hba.conf, like the existing "all", "sameuser", "samerole" keywords
that you can put into the database-column.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Simon Riggs

simon@2ndQuadrant.com

almost 17 years ago

In reply to: Fujii Masao (#1)

Re: Streaming Replication patch for CommitFest 2009-09

On Mon, 2009-09-14 at 20:24 +0900, Fujii Masao wrote:

The latest patch has overcome those problems:

Well done. I hope to look at it myself in a few days time.

--
Simon Riggs www.2ndQuadrant.com

Streaming Replication patch for CommitFest 2009-09

Attachments: