Simplifying replication

Started by Josh Berkusabout 15 years ago50 messages
#1Josh Berkus
josh@agliodbs.com

Robert asked me to write this up, so here it is.

It is critical that we make replication easier to set up, administrate
and monitor than it currently is. In my conversations with people, this
is more important to our users and the adoption of PostgreSQL than
synchronous replication is.

First, I'm finding myself constantly needing to tutor people on how to
set up replication. The mere fact that it requires a minimum 1-hour
class to explain how to use it, or a 10-page tutoral, tells us it's too
complex. As further evidence, Bruce and I explained binary replication
to several MySQL geeks at OpenSQLCamp last weekend, and they were
horrified at the number and complexity of the steps required. As it
currently is, binary replication is not going to win us a lot of new
users from the web development or virtualization world.

I had to write it up a couple of times; I started with a critique of the
various current commands and options, but that seemed to miss the point.
So instead, let me lay out how I think replication should work in my
dream world 9.1:

1. Any postgresql standalone server can become a replication master
simply by enabling replication connections in pg_hba.conf. No other
configuration is required, and no server restart is required.

2. Should I choose to adjust master configuration, for say performance
reasons, most replication variables (including ones like
wal_keep_segments) should be changeable without a server restart.

3. I can configure a standby by copying the same postgresql.conf on the
master. I only have to change a single configuration variable (the
primary_conninfo, or maybe a replication_mode setting) in order to start
the server in standby mode. GUCs which apply only to masters are ignored.

4. I can start a new replica off the master by running a single
command-line utility on the standby and giving it connection information
to the master. Using this connection, it should be able to start a
backup snapshot, copy the entire database and any required logs, and
then come up in standby mode. All that should be required for this is
one or two highport connections to the master. No recovery.conf file is
required, or exists.

5. I can to cause the standby to fail over with a single command to the
failover server. If this is a trigger file, then it already has a
default path to the trigger file in postgresql.conf, so that this does
not require reconfiguration and restart of the standby at crisis time.
Ideally, I use a "pg_failover" command or something similar.

6. Should I decide to make the standby the new master, this should also
be possible with a single command and a one-line configuration on the
other standbys. To aid this, we have an easy way to tell which standby
in a group are most "caught up". If I try to promote the wrong standby
(it's behind or somehow incompatible), it should fail with an
appropriate message.

7. Should I choose to use archive files as well as streaming
replication, the utilities to manage them (such as pg_archivecleanup and
pg_standby) are built and installed with PostgreSQL by default, and do
not require complex settings with escape codes.

That's my vision of "simple replication". It is also 100% achieveable.
We just have to priorities ease-of-use over having, and requiring the
user to set, 1,000 little knobs.

Speaking of knobs .... (next message)

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#2Brendan Jurd
direvus@gmail.com
In reply to: Josh Berkus (#1)
Re: Simplifying replication

On 19 October 2010 11:16, Josh Berkus <josh@agliodbs.com> wrote:

4. I can start a new replica off the master by running a single command-line
utility on the standby and giving it connection information to the master.
 Using this connection, it should be able to start a backup snapshot, copy
the entire database and any required logs, and then come up in standby mode.
 All that should be required for this is one or two highport connections to
the master.  No recovery.conf file is required, or exists.

Having just configured replication for the first time with 9.0, I
agree in general with your whole message, but in particular, I want to
give a double-thumbs-up to the above.

Resolving this one item would subtract a great deal of pain -- and
potential for error -- from the process.

Cheers,
BJ

#3Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Josh Berkus (#1)
Re: Simplifying replication

Hi,

Josh Berkus <josh@agliodbs.com> writes:

It is critical that we make replication easier to set up, administrate and
monitor than it currently is. In my conversations with people, this is more
important to our users and the adoption of PostgreSQL than synchronous
replication is.

I want to say a big big +1 here. The way replication and PITR setup are
implemented now are a very good prototype, it's time to consolidate and
get to something usable by normal people, as opposed to PostgreSQL full
time geeks.

Well, the current setup offers lots of flexibility which we'd better not
lose in the process, but the simple setup simply does not exists yet.

1. Any postgresql standalone server can become a replication master simply
by enabling replication connections in pg_hba.conf. No other configuration
is required, and no server restart is required.

That sounds as simple as changing the default wal_level to hot_standby,
and the default max_wal_senders to non-zero.

2. Should I choose to adjust master configuration, for say performance
reasons, most replication variables (including ones like wal_keep_segments)
should be changeable without a server restart.

Anybody know how difficult that is without having to spend lots of time
studying the source code with the question in mind?

3. I can configure a standby by copying the same postgresql.conf on the
master. I only have to change a single configuration variable (the
primary_conninfo, or maybe a replication_mode setting) in order to start the
server in standby mode. GUCs which apply only to masters are ignored.

4. I can start a new replica off the master by running a single command-line
utility on the standby and giving it connection information to the master.
Using this connection, it should be able to start a backup snapshot, copy
the entire database and any required logs, and then come up in standby mode.
All that should be required for this is one or two highport connections to
the master. No recovery.conf file is required, or exists.

There's a prototype to stream a base backup from a libpq connection, I
think someone here wanted to integrate that into the replication
protocol itself. It should be doable with a simple libpq connection and
all automated.

The pg_basebackup python client software is 100 lines of code. It's
mainly a recursive query to get the list of files within the master,
then two server side functions to get binary file chunks,
compressed. Then client side, a loop to decompress and write the chunks
at the right place. That's it.

http://github.com/dimitri/pg_basebackup/blob/master/pg_basebackup.py

I could prepare a patch given some advice on the replication protocol
integration. For one, is streaming a base backup something that
walsender should care about?

5. I can to cause the standby to fail over with a single command to the
failover server. If this is a trigger file, then it already has a default
path to the trigger file in postgresql.conf, so that this does not require
reconfiguration and restart of the standby at crisis time. Ideally, I use a
"pg_failover" command or something similar.

This feature is in walmgr.py from Skytools and it's something necessary
to have in -core now that we have failover standby capacity. Much
agreed, and the pg_failover command is a good idea.

BTW, do we have a clear idea of how to implement pg_ping, and should it
reports current WAL location(s) of a standby?

6. Should I decide to make the standby the new master, this should also be
possible with a single command and a one-line configuration on the other
standbys. To aid this, we have an easy way to tell which standby in a group
are most "caught up". If I try to promote the wrong standby (it's behind or
somehow incompatible), it should fail with an appropriate message.

That needs a way to define a group of standby. There's nothing there
that makes them know about each other. That could fall off the automated
registration of them in a shared catalog on the master, with this shared
catalog spread over (hard-coded) asynchronous replication (sync ==
disaster here). But there's no agreement on this feature yet.

Then you need a way to organise them in groups in this shared catalog,
and you need to ask your network admins to make it so that they can
communicate with each other.

Now say we have pg_ping (or another tool) returning the current recv,
applied and synced LSNs, it would be possible for any standby to figure
out which other ones must be shot in case you failover here. The
failover command could list those other standby in the group that you're
behind of, and with a force command allow you to still failover to this
one. Now you have to STONITH the one listed, but that's your problem
after all.

Then, of course, any standby that's not in the same group as the one
that you failed over to has to be checked and resynced.

7. Should I choose to use archive files as well as streaming replication,
the utilities to manage them (such as pg_archivecleanup and pg_standby) are
built and installed with PostgreSQL by default, and do not require complex
settings with escape codes.

Now that PITR has been in for a long enough time, we *need* to take it
to next step integration-wise. By that I mean that we have to support
internal commands and provide reasonable default implementation of the
different scripts needed (in portable C, hence "internal").

There are too many pitfalls in this part of the setup to be serious in
documenting them all and expecting people to come up with bash or perl
implementations that avoid them all. That used to be good enough, but
Josh is right, we need to get even better!

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#4Greg Smith
greg@2ndquadrant.com
In reply to: Josh Berkus (#1)
Re: Simplifying replication

Josh Berkus wrote:

It is critical that we make replication easier to set up, administrate
and monitor than it currently is. In my conversations with people,
this is more important to our users and the adoption of PostgreSQL
than synchronous replication is.

You should enjoy one of the patches we're furiously working on then,
which is aiming at some of the administration and monitoring pieces
here. I have my own grand vision of how easy replication should be to
setup too. Visions and plans are nice, but building functional pieces
of them and delivering them to the community is what actually moves
PostgreSQL forward. So far, multiple people have done that for sync
rep, and what we're supposed to be focused on at this stage in the
development cycle is finishing the work related to the open CommitFest
item that includes that.

I find this launch into a new round of bike-shedding a bit distracting.
If you want this to be easier to use, which it's obvious to any observer
it should be because what's delivered in 9.0 is way too complicated,
please work on finding development resources to assign to that problem.
Because that's the bottleneck on simplifying things, not ideas about
what to do. I would recommend finding or assigning a developer to work
on integrating base backup in to the streaming protocol as the biggest
single thing that would improve the built-in replication. All of the
rest of the trivia about what knobs to set and such are tiny details
that make for only a minor improvement until that's taken care of.

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us

#5Josh Berkus
josh@agliodbs.com
In reply to: Greg Smith (#4)
Re: Simplifying replication

You should enjoy one of the patches we're furiously working on then,
which is aiming at some of the administration and monitoring pieces
here.

Great, glad to hear it! Would you be willing to go into detail?

I have my own grand vision of how easy replication should be to
setup too.

So, share it. I'd look forward to hearing it, especially since your
vision probably takes synch rep and quorum commit into account, which
mine doesn't. If not here, then on your blog.

Visions and plans are nice, but building functional pieces of
them and delivering them to the community is what actually moves
PostgreSQL forward.

*shrug*. Robert asked me to write it up for the list based on the
discussions around synch rep. Now you're going to bash me for doing so?

Many of the goals I described will mean removing knobs and changing
defaults, or even foregoing fine-grained control entirely. If we don't
have agreement that simplifying replication is a high-priority goal,
then it won't happen; anyone submitting a patch will be
this-or-that-use-cased to death and will give up.

For that matter, I'm not sure that everyone agrees that simplification
is a worthwhile goal. For example, somewhere between 9.0beta4 and final
release, someone changed the defaults for max_wal_senders and
hot_standby to "0" and "off". I don't remember there even being
discussion about it.

The discussion around synch rep certainly showed that the "natural"
tendency of this list is to add complexity with each incarnation of a
feature. It's the easiest way to accomodate conflicting use cases, but
it's not the best way.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#6Greg Smith
greg@2ndquadrant.com
In reply to: Josh Berkus (#5)
Re: Simplifying replication

Josh Berkus wrote:

*shrug*. Robert asked me to write it up for the list based on the
discussions around synch rep. Now you're going to bash me for doing so?

Sorry, next time I'll make sure to bash Robert too. I don't have any
problems with the basic ideas you're proposing, just concerns about when
the right time to get into that whole giant subject is and who is going
to work on.

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us

#7Josh Berkus
josh@agliodbs.com
In reply to: Dimitri Fontaine (#3)
Re: Simplifying replication

Dimitri, Greg,

I want to say a big big +1 here. The way replication and PITR setup are
implemented now are a very good prototype, it's time to consolidate and
get to something usable by normal people, as opposed to PostgreSQL full
time geeks.

Well, one thing to be addressed is separating the PITR functionality
from replication. PITR needs a lot of features -- timelines, recovery
stop points, etc. -- which replication doesn't need or want. I think
that focussing on streaming replication functionality and ignoring the
archive logs case is probably the best way to logically separate these
two. Presumably anyone who needs archive logs as well will be a
professional DBA.

I could prepare a patch given some advice on the replication protocol
integration. For one, is streaming a base backup something that
walsender should care about?

Yeah, I thought there was a prototype for this somewhere. From a user
perspective, using a 2nd pgport connection for the initial clone is
fine. I don't know if we want to worry about it otherwise from a
resource management perspective; presumably the cloning process is going
to be a pretty big performance hit on the master.

BTW, do we have a clear idea of how to implement pg_ping, and should it
reports current WAL location(s) of a standby?

pg_ping?

That needs a way to define a group of standby. There's nothing there
that makes them know about each other.

Let me clarify. I meant that if I try to make a *single* standby point
to a new master, and that new master was behind the standby when it
failed over, then the attempt to remaster should fail with an error.

I do *not* want to get into standby groups. That way lies madness. ;-)

Now say we have pg_ping (or another tool) returning the current recv,
applied and synced LSNs, it would be possible for any standby to figure
out which other ones must be shot in case you failover here. The
failover command could list those other standby in the group that you're
behind of, and with a force command allow you to still failover to this
one. Now you have to STONITH the one listed, but that's your problem
after all.

The LSN isn't enough; as others have pointed out, we have a fairly
serious failure case if a standby comes up as a master, accepts
transactions, and then we try to remaster a 2nd standby which was
actually ahead of the first standby at the time of master failure. I
haven't seen a solution posted to that yet; maybe I missed it?

Sorry, next time I'll make sure to bash Robert too. I don't have any
problems with the basic ideas you're proposing, just concerns about when
the right time to get into that whole giant subject is and who is going
to work on.

If not now, when? The 2nd CommitFest is almost complete. If we're
going to make any substantial changes, we need to have patches for the
3rd commitfest. And I didn't see anyone discussing simplification until
I brought it up.

I don't realistically think that we're going to get 100% simplification
for 9.1. But it would be nice to at least get some components, which
means getting agreement on how things should work, at least roughly.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#8Greg Stark
gsstark@mit.edu
In reply to: Josh Berkus (#7)
Re: Simplifying replication

On Tue, Oct 19, 2010 at 9:16 AM, Josh Berkus <josh@agliodbs.com> wrote:

Well, one thing to be addressed is separating the PITR functionality from
replication.  PITR needs a lot of features -- timelines, recovery stop
points, etc. -- which replication doesn't need or want.  I think that
focussing on streaming replication functionality and ignoring the archive
logs case is probably the best way to logically separate these two.
 Presumably anyone who needs archive logs as well will be a professional
DBA.

The way things stand you *always* need archived logs. Even if you have
streaming set up it might try to use archived logs if it falls too far
behind.

Also all the features PITR needs are needed by replication as well.
Recovery stop points are absolutely critical. Otherwise if your
replica crashed it would have to start over from the original clone
time and replay all logs since then.

Timelines are not as obvious but perhaps that's our own mistake. When
you fail over to your replica shouldn't the new master get a new
timelineid? Isn't that the answer to the failure case when a slave
finds it's ahead of the master? If it has already replayed logs from a
different timelineid in the same lsn range then it can't switch
timelines to follow the new master. But if it hasn't then it can.

--
greg

#9Josh Berkus
josh@agliodbs.com
In reply to: Greg Stark (#8)
Re: Simplifying replication

Greg,

The way things stand you *always* need archived logs. Even if you have
streaming set up it might try to use archived logs if it falls too far
behind.

Actually, you don't. If you're willing to accept possible
desynchronization and recloning of the standbys, then you can skip the
archive logs.

Timelines are not as obvious but perhaps that's our own mistake. When
you fail over to your replica shouldn't the new master get a new
timelineid? Isn't that the answer to the failure case when a slave
finds it's ahead of the master? If it has already replayed logs from a
different timelineid in the same lsn range then it can't switch
timelines to follow the new master. But if it hasn't then it can.

Oh? Do we have this information (i.e. what LSNs are associated with
which timeline)?

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#10Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Josh Berkus (#7)
Re: Simplifying replication

Josh Berkus <josh@agliodbs.com> writes:

Well, one thing to be addressed is separating the PITR functionality from
replication. PITR needs a lot of features -- timelines, recovery stop
points, etc. -- which replication doesn't need or want. I think that
focussing on streaming replication functionality and ignoring the archive
logs case is probably the best way to logically separate these two.
Presumably anyone who needs archive logs as well will be a professional DBA.

So, I've been thinking some more about this.

We now have two modes of operation when starting up a PostgreSQL
cluster, either it's a primary/master or it's a standby (in recovery or
hot_standby). What I think would make sense here would be to add another
mode of operation, archiving.

A cluster that is archiving will accept replication connections and will
accept WAL files streamed there, that it will put on its archive
directory, defaults to $PGDATA/pg_xlog_archive. It should also be able
to take a base backup from its primary server, maybe with a new pg_ctl
command. The base backup location defaults to
$PGDATA/pg_basebackup/$label.

Then, it would also accept replication connection in the other way
around, starting a walsender to publish its archive or its base backup,
so that you could prepare a new standby against the archive server, then
switch the primary_conninfo to the master and that's it.

Now, the archive and restore commands in this setup would be internal
commands pg_walsender and pg_walreceiver. That would mean we need to
extend those spacial backends to be able to send and receive a full WAL
file but it sounds simple enough, right?

Then we could add support for those to compress the WAL files before
streaming, as we're talking "internal commands" here it sounds easy
enough (we're already linking against gzip I think).

The base backup streaming support would be there for the archiving mode
but also for the normal standby "empty $PGDATA" initial start up.

Comments?
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#11Robert Treat
rob@xzilla.net
In reply to: Greg Smith (#4)
Re: Simplifying replication

On Tue, Oct 19, 2010 at 11:16 AM, Greg Smith <greg@2ndquadrant.com> wrote:

Josh Berkus wrote:

It is critical that we make replication easier to set up, administrate and
monitor than it currently is. In my conversations with people, this is more
important to our users and the adoption of PostgreSQL than synchronous
replication is.

<snip>

I find this launch into a new round of bike-shedding a bit distracting. If
you want this to be easier to use, which it's obvious to any observer it
should be because what's delivered in 9.0 is way too complicated, please
work on finding development resources to assign to that problem. Because
that's the bottleneck on simplifying things, not ideas about what to do. I
would recommend finding or assigning a developer to work on integrating base
backup in to the streaming protocol as the biggest single thing that would
improve the built-in replication. All of the rest of the trivia about what
knobs to set and such are tiny details that make for only a minor
improvement until that's taken care of.

Yeah, I'm sure we all think it should be easier, but figuring out what that
means is certainly a moving target. The idea of being able to create a base
backup automagically sounds good, but comparatively it's
not significantly more difficult than what many other systems make you do,
and actually if done incorrectly could be something rather limiting. On the
whole the customers we are talking with are far more concerned about things
like managing failover scenarios when you have multiple slaves, and it's the
lack of capabilities around those kinds of things that hurt postgres
adoption much more than it being hard to set up.

Robert Treat
play: http://www.xzilla.net
work: http://omniti.com/is/hiring

#12Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Josh Berkus (#1)
Re: Simplifying replication

On 19/10/10 13:16, Josh Berkus wrote:

Robert asked me to write this up, so here it is.

It is critical that we make replication easier to set up, administrate
and monitor than it currently is. In my conversations with people,
this is more important to our users and the adoption of PostgreSQL
than synchronous replication is.

First, I'm finding myself constantly needing to tutor people on how to
set up replication. The mere fact that it requires a minimum 1-hour
class to explain how to use it, or a 10-page tutoral, tells us it's
too complex. As further evidence, Bruce and I explained binary
replication to several MySQL geeks at OpenSQLCamp last weekend, and
they were horrified at the number and complexity of the steps
required. As it currently is, binary replication is not going to win
us a lot of new users from the web development or virtualization world.

+1

I've been having the same experience - how to set this up and do
failover and failback etc occupies quite a bit of time in courses I've
been teaching here in NZ and Australia. Having this whole replication
business much simpler is definitely the way to go.

A good example of how simple it can be is mongodb, where it is
essentially one command to setup a 2 replica system with a voting arbiter:

$ mongo

rs.initiate(

{
_id : "replication_set0",
members : [
{ _id : 0, host : "192.163,2,100" },
{ _id : 1, host : "192.168.2.101" },
{ _id : 2, host : "192.168.2.103", arbiterOnly : true }
]
}
)

#13Bruce Momjian
bruce@momjian.us
In reply to: Josh Berkus (#9)
Re: Simplifying replication

Josh Berkus wrote:

Greg,

The way things stand you *always* need archived logs. Even if you have
streaming set up it might try to use archived logs if it falls too far
behind.

Actually, you don't. If you're willing to accept possible
desynchronization and recloning of the standbys, then you can skip the
archive logs.

Agreed, but as a reality check: when I proposed that wal_keep_segments
= -1 would keep all WAL segments (for use while the file system was
being backed up), I was told administrators shoud compute how much free
disk space they had. Obviously easy of use is not our #1 priority.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#14Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#13)
Re: Simplifying replication

On Thu, Oct 21, 2010 at 8:22 PM, Bruce Momjian <bruce@momjian.us> wrote:

Josh Berkus wrote:

Greg,

The way things stand you *always* need archived logs. Even if you have
streaming set up it might try to use archived logs if it falls too far
behind.

Actually, you don't.  If you're willing to accept possible
desynchronization and recloning of the standbys, then you can skip the
archive logs.

Agreed, but as a reality check:  when I proposed that wal_keep_segments
= -1 would keep all WAL segments (for use while the file system was
being backed up), I was told administrators shoud compute how much free
disk space they had.  Obviously easy of use is not our #1 priority.

Amen.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Josh Berkus
josh@agliodbs.com
In reply to: Bruce Momjian (#13)
Re: Simplifying replication

Agreed, but as a reality check: when I proposed that wal_keep_segments
= -1 would keep all WAL segments (for use while the file system was
being backed up), I was told administrators shoud compute how much free
disk space they had. Obviously easy of use is not our #1 priority.

Depends. Running out of disk space isn't exactly user-friendly either.
And detecting how much free space is available would be a painful bit
of platform-dependant code ...

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#16Greg Stark
gsstark@mit.edu
In reply to: Josh Berkus (#15)
Re: Simplifying replication

On Thu, Oct 21, 2010 at 5:46 PM, Josh Berkus <josh@agliodbs.com> wrote:

Agreed, but as a reality check:  when I proposed that wal_keep_segments
= -1 would keep all WAL segments (for use while the file system was
being backed up), I was told administrators shoud compute how much free
disk space they had.  Obviously easy of use is not our #1 priority.

Depends.  Running out of disk space isn't exactly user-friendly either.
 And detecting how much free space is available would be a painful bit
of platform-dependant code ...

Nor can we assume we're the only thing using disk space.

However the user-unfriendliness isn't the fact that administrators
need to determine how much disk they're willing to dedicate to
Postgres. The user-unfriendliness is that they then have to specify
this in terms of WAL log files and also have to know that we sometimes
keep more than that and so on.

We've done a good job in the past of converting GUC variables to
meaningful units for administrators and users but it's an ongoing
effort. If we need a GUC to control the amount of disk space we use it
should be in units of MB/GB/TB. If we need a GUC for controlling how
much WAL history to keep for recovering standbys or replicas then it
should be specified in units of time.

Units like "number of wal files" or worse in the case of
checkpoint_segments "number of wal files / 2 - 1" or something like
that.... are terrible. They require arcane knowledge for the
administrator to have a clue how to set.

--
greg

#17Robert Haas
robertmhaas@gmail.com
In reply to: Greg Stark (#16)
Re: Simplifying replication

On Thu, Oct 21, 2010 at 8:52 PM, Greg Stark <gsstark@mit.edu> wrote:

On Thu, Oct 21, 2010 at 5:46 PM, Josh Berkus <josh@agliodbs.com> wrote:

Agreed, but as a reality check:  when I proposed that wal_keep_segments
= -1 would keep all WAL segments (for use while the file system was
being backed up), I was told administrators shoud compute how much free
disk space they had.  Obviously easy of use is not our #1 priority.

Depends.  Running out of disk space isn't exactly user-friendly either.
 And detecting how much free space is available would be a painful bit
of platform-dependant code ...

Nor can we assume we're the only thing using disk space.

However the user-unfriendliness isn't the fact that administrators
need to determine how much disk they're willing to dedicate to
Postgres. The user-unfriendliness is that they then have to specify
this in terms of WAL log files and also have to know that we sometimes
keep more than that and so on.

We've done a good job in the past of converting GUC variables to
meaningful units for administrators and users but it's an ongoing
effort. If we need a GUC to control the amount of disk space we use it
should be in units of MB/GB/TB. If we need a GUC for controlling how
much WAL history to keep for recovering standbys or replicas then it
should be specified in units of time.

Units like "number of wal files" or worse in the case of
checkpoint_segments "number of wal files / 2 - 1" or something like
that.... are terrible. They require arcane knowledge for the
administrator to have a clue how to set.

Very true. But the lack of a -1 setting for wal_keep_segments means
that if you would like to take a backup without archiving, you must
set wal_keep_segments to a value greater than or equal to the rate at
which you generate WAL segments multiplied by the time it takes you to
run a backup. If that doesn't qualify as requiring arcane knowledge,
I'm mystified as to what ever could.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#17)
Re: Simplifying replication

Robert Haas wrote:

However the user-unfriendliness isn't the fact that administrators
need to determine how much disk they're willing to dedicate to
Postgres. The user-unfriendliness is that they then have to specify
this in terms of WAL log files and also have to know that we sometimes
keep more than that and so on.

We've done a good job in the past of converting GUC variables to
meaningful units for administrators and users but it's an ongoing
effort. If we need a GUC to control the amount of disk space we use it
should be in units of MB/GB/TB. If we need a GUC for controlling how
much WAL history to keep for recovering standbys or replicas then it
should be specified in units of time.

Units like "number of wal files" or worse in the case of
checkpoint_segments "number of wal files / 2 - 1" or something like
that.... are terrible. They require arcane knowledge for the
administrator to have a clue how to set.

Very true. But the lack of a -1 setting for wal_keep_segments means
that if you would like to take a backup without archiving, you must
set wal_keep_segments to a value greater than or equal to the rate at
which you generate WAL segments multiplied by the time it takes you to
run a backup. If that doesn't qualify as requiring arcane knowledge,
I'm mystified as to what ever could.

LOL. Time machine required (both forward and backward time options).

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#19Josh Berkus
josh@agliodbs.com
In reply to: Bruce Momjian (#18)
Re: Simplifying replication

Very true. But the lack of a -1 setting for wal_keep_segments means
that if you would like to take a backup without archiving, you must
set wal_keep_segments to a value greater than or equal to the rate at
which you generate WAL segments multiplied by the time it takes you to
run a backup. If that doesn't qualify as requiring arcane knowledge,
I'm mystified as to what ever could.

Speaking of which, what's the relationship between checkpoint_segments
and wal_keep_segments? PG seems perfectly willing to let me set the
latter higher than the former, and it's not documented.

If checkpoint_segments were a hard limit, then we could let admins set
wal_keep_segments to -1, knowing that they'd set checkpoint_segments to
the max space they had available.

Although we might want to rename those.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#20Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#19)
Re: Simplifying replication

On Thu, Oct 21, 2010 at 9:09 PM, Josh Berkus <josh@agliodbs.com> wrote:

Very true.  But the lack of a -1 setting for wal_keep_segments means
that if you would like to take a backup without archiving, you must
set wal_keep_segments to a value greater than or equal to the rate at
which you generate WAL segments multiplied by the time it takes you to
run a backup.  If that doesn't qualify as requiring arcane knowledge,
I'm mystified as to what ever could.

Speaking of which, what's the relationship between checkpoint_segments
and wal_keep_segments?  PG seems perfectly willing to let me set the
latter higher than the former, and it's not documented.

I think it's pretty well explained in the fine manual.

http://www.postgresql.org/docs/current/static/runtime-config-wal.html#GUC-WAL-KEEP-SEGMENTS

If checkpoint_segments were a hard limit, then we could let admins set
wal_keep_segments to -1, knowing that they'd set checkpoint_segments to
the max space they had available.

This assumes that more checkpoint segments is always better, which
isn't true. I might have 100 GB of disk space free, but not want to
replay WAL for 4 days if I have a crash.

I do think that the current default of checkpoint_segments=3 is
pathologically insane, but that's another can of worms.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21Josh Berkus
josh@agliodbs.com
In reply to: Robert Haas (#20)
Re: Simplifying replication

I think it's pretty well explained in the fine manual.

http://www.postgresql.org/docs/current/static/runtime-config-wal.html#GUC-WAL-KEEP-SEGMENTS

Nope. No relationship to checkpoint_segments is explained there. Try
again?

If checkpoint_segments were a hard limit, then we could let admins set
wal_keep_segments to -1, knowing that they'd set checkpoint_segments to
the max space they had available.

This assumes that more checkpoint segments is always better, which
isn't true. I might have 100 GB of disk space free, but not want to
replay WAL for 4 days if I have a crash.

No, it assumes no such thing.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#22Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#21)
Re: Simplifying replication

On Thu, Oct 21, 2010 at 10:03 PM, Josh Berkus <josh@agliodbs.com> wrote:

I think it's pretty well explained in the fine manual.

http://www.postgresql.org/docs/current/static/runtime-config-wal.html#GUC-WAL-KEEP-SEGMENTS

Nope.  No relationship to checkpoint_segments is explained there.  Try
again?

Well, it says "This sets only the minimum number of segments retained
in pg_xlog; the system might need to retain more segments for WAL
archival or to recover from a checkpoint." So in other words, the
relationship with checkpoint segments is that whichever one currently
requires retaining a larger number of segments applies. That's all
the relationship there is. I'm not sure I understand the question.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#23Fujii Masao
masao.fujii@gmail.com
In reply to: Josh Berkus (#21)
Re: Simplifying replication

On Fri, Oct 22, 2010 at 11:03 AM, Josh Berkus <josh@agliodbs.com> wrote:

I think it's pretty well explained in the fine manual.

http://www.postgresql.org/docs/current/static/runtime-config-wal.html#GUC-WAL-KEEP-SEGMENTS

Nope.  No relationship to checkpoint_segments is explained there.  Try
again?

Please see
http://archives.postgresql.org/pgsql-docs/2010-10/msg00038.php

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#24Josh Berkus
josh@agliodbs.com
In reply to: Fujii Masao (#23)
Re: Simplifying replication

Please see
http://archives.postgresql.org/pgsql-docs/2010-10/msg00038.php

Ye gods and little fishes!

You really want to talk arcane formulas. I've re-read that
three times, and am still not sure that I could tell someone
definitively how much disk space WAL needs for a given group of
settings. I'll also point out that that formula is not in our docs --
what's an appropriate location?

I think this needs to be corrected in 9.1, *even if it means breaking
backwards compatibility*.

What would be sensible for DBAs is to have two settings:

max_wal_size
min_wal_size

These would be expresses in MB or GB and would be simple direct
quantities, which our formulas would work backwards from. max_wal_size
would be a hard limit (i.e. Postgres would stop accepting writes if we
hit it), and Admins would not be allowed to set min_wal_size to more
than max_wal_size - 2.

Even better would be to replace min_wal_size with min_wal_time, which
would set a time span for the oldest WAL segment to be kept (up to
max_wal_size - 2). Hmmm. That doesn't seem that hard to implement.
Is it?

(BTW, Robert, that e-mail is what I meant by "relationship")

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#24)
Re: Simplifying replication

Josh Berkus <josh@agliodbs.com> writes:

What would be sensible for DBAs is to have two settings:

max_wal_size
min_wal_size

[ scratches head... ] What's the functional effect of min_wal_size, exactly?

Even better would be to replace min_wal_size with min_wal_time, which
would set a time span for the oldest WAL segment to be kept (up to
max_wal_size - 2). Hmmm. That doesn't seem that hard to implement.
Is it?

Um, what happens when honoring min_wal_time conflicts with honoring
max_wal_size?

regards, tom lane

#26Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#25)
Re: Simplifying replication

max_wal_size
min_wal_size

[ scratches head... ] What's the functional effect of min_wal_size, exactly?

Replaces wal_keep_segments. The rename is to make the GUCs obviously
symmetrical, and to make it clear that the *meaning* of the variable has
changed.

Even better would be to replace min_wal_size with min_wal_time, which
would set a time span for the oldest WAL segment to be kept (up to
max_wal_size - 2). Hmmm. That doesn't seem that hard to implement.
Is it?

Um, what happens when honoring min_wal_time conflicts with honoring
max_wal_size?

When we get close enough to max_wal_size (we'll need a couple segments
of leeway, I think), we start recycling WAL segments even if they are
less that min_wal_time old. This is under the presumption that most
DBAs will prefer having the standby desyncrhonize to having the master
lock up due to running out of disk space. Presumably if such recycling
happens we'd also write a WARNING to the logs.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#27Fujii Masao
masao.fujii@gmail.com
In reply to: Josh Berkus (#26)
Re: Simplifying replication

On Sat, Oct 23, 2010 at 2:34 AM, Josh Berkus <josh@agliodbs.com> wrote:

When we get close enough to max_wal_size (we'll need a couple segments of
leeway, I think), we start recycling WAL segments even if they are less that
min_wal_time old.

What happens if max_wal_size is less than checkpoint_segments?
Currently a checkpoint tries to leave WAL files which were generated
from the prior ckpt start to current ckpt end. Because those WAL files
are required for crash recovery. But we should delete some of them
according to max_wal_size?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#28Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#17)
Re: Simplifying replication

On Thu, 2010-10-21 at 20:57 -0400, Robert Haas wrote:

Very true. But the lack of a -1 setting for wal_keep_segments means
that if you would like to take a backup without archiving, you must
set wal_keep_segments to a value greater than or equal to the rate at
which you generate WAL segments multiplied by the time it takes you to
run a backup. If that doesn't qualify as requiring arcane knowledge,
I'm mystified as to what ever could.

People are missing the point here:

You have to put the WAL files *somewhere* while you do the base backup.
PostgreSQL can't itself work out where that is, nor can it work out
ahead of time how big it will need to be, since it is up to you how you
do your base backup. Setting a parameter to -1 doesn't make the problem
go away, it just pretends and hopes it doesn't exist, but screws you
badly if you do hit the wall.

My view is that is irresponsible, even if I share people's wish that the
problem did not exist.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#29Josh Berkus
josh@agliodbs.com
In reply to: Fujii Masao (#27)
Re: Simplifying replication

What happens if max_wal_size is less than checkpoint_segments?
Currently a checkpoint tries to leave WAL files which were generated
from the prior ckpt start to current ckpt end. Because those WAL files
are required for crash recovery. But we should delete some of them
according to max_wal_size?

The ideas is that max_wal_size would *replace* checkpoint_segments. The
checkpoint_segments setting is baffling to most PG DBAs.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#30Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#28)
Re: Simplifying replication

On Tue, Oct 26, 2010 at 8:27 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, 2010-10-21 at 20:57 -0400, Robert Haas wrote:

Very true.  But the lack of a -1 setting for wal_keep_segments means
that if you would like to take a backup without archiving, you must
set wal_keep_segments to a value greater than or equal to the rate at
which you generate WAL segments multiplied by the time it takes you to
run a backup.  If that doesn't qualify as requiring arcane knowledge,
I'm mystified as to what ever could.

People are missing the point here:

You have to put the WAL files *somewhere* while you do the base backup.
PostgreSQL can't itself work out where that is, nor can it work out
ahead of time how big it will need to be, since it is up to you how you
do your base backup. Setting a parameter to -1 doesn't make the problem
go away, it just pretends and hopes it doesn't exist, but screws you
badly if you do hit the wall.

If you set wal_keep_segments=0, archive_mode=on, and
archive_command=<something>, you might run out of disk space.

If you set wal_keep_segments=-1, you might run out of disk space.

Are you any more screwed in the second case than you are in the first
case? Why?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#31Josh Berkus
josh@agliodbs.com
In reply to: Robert Haas (#30)
Re: Simplifying replication

If you set wal_keep_segments=0, archive_mode=on, and
archive_command=<something>, you might run out of disk space.

If you set wal_keep_segments=-1, you might run out of disk space.

Are you any more screwed in the second case than you are in the first
case?

It is the same to the user either way. In either case you have to
change some settings and restart the master.

Well, for the archive case, you could conceivably mass-delete the
archive files.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#32Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#31)
Re: Simplifying replication

On Tue, Oct 26, 2010 at 9:59 PM, Josh Berkus <josh@agliodbs.com> wrote:

If you set wal_keep_segments=0, archive_mode=on, and
archive_command=<something>, you might run out of disk space.

If you set wal_keep_segments=-1, you might run out of disk space.

Are you any more screwed in the second case than you are in the first
case?

It is the same to the user either way.  In either case you have to
change some settings and restart the master.

Except that changing wal_keep_segments doesn't require restarting the master.

The point of allowing -1 was to allow someone to set it to that value
temporarily, to be able to do a hot backup without having to guess how
large to set it. If you don't have enough disk space for a backup to
complete, you're kind of hosed either way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#33Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#32)
Re: Simplifying replication

On Tue, 2010-10-26 at 22:03 -0400, Robert Haas wrote:

On Tue, Oct 26, 2010 at 9:59 PM, Josh Berkus <josh@agliodbs.com> wrote:

If you set wal_keep_segments=0, archive_mode=on, and
archive_command=<something>, you might run out of disk space.

If you set wal_keep_segments=-1, you might run out of disk space.

Are you any more screwed in the second case than you are in the first
case?

It is the same to the user either way. In either case you have to
change some settings and restart the master.

Except that changing wal_keep_segments doesn't require restarting the master.

The point of allowing -1 was to allow someone to set it to that value
temporarily, to be able to do a hot backup without having to guess how
large to set it. If you don't have enough disk space for a backup to
complete, you're kind of hosed either way.

You're not hosed either way. Fujii designed this carefully to avoid that
and it works. The case of archive_command failing isn't comparable
because that is a failure case, not a normal working server.

You don't need to guess the setting of wal_keep_segments. It's a safety
net that has been deliberately created to avoid the crash that would
otherwise happen. I've not heard a better proposal, yet, though I too am
hopeful there is a better one.

This is all described in my new book on PostgreSQL Administration,
available from the link below. I'm told that everything you need is also
in the docs.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#34Josh Berkus
josh@agliodbs.com
In reply to: Robert Haas (#32)
Re: Simplifying replication

It is the same to the user either way. In either case you have to
change some settings and restart the master.

Except that changing wal_keep_segments doesn't require restarting the master.

Our docs say that it does:
This parameter can only be set in the postgresql.conf file or on the
server command line.

http://www.postgresql.org/docs/9.0/static/runtime-config-wal.html#RUNTIME-CONFIG-REPLICATION

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#35Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Josh Berkus (#34)
Re: Simplifying replication

Josh Berkus <josh@agliodbs.com> wrote:

Except that changing wal_keep_segments doesn't require restarting
the master.

Our docs say that it does:
This parameter can only be set in the postgresql.conf file or on
the server command line.

That sounds as though a reload would do it; I don't see that
indicating that a restart is needed.

-Kevin

#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#35)
Re: Simplifying replication

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Josh Berkus <josh@agliodbs.com> wrote:

Except that changing wal_keep_segments doesn't require restarting
the master.

Our docs say that it does:
This parameter can only be set in the postgresql.conf file or on
the server command line.

That sounds as though a reload would do it; I don't see that
indicating that a restart is needed.

That is, in fact, our standard boilerplate wording for SIGHUP
parameters.

regards, tom lane

#37Josh Berkus
josh@agliodbs.com
In reply to: Simon Riggs (#28)
Re: Simplifying replication

You have to put the WAL files *somewhere* while you do the base backup.
PostgreSQL can't itself work out where that is, nor can it work out
ahead of time how big it will need to be, since it is up to you how you
do your base backup. Setting a parameter to -1 doesn't make the problem
go away, it just pretends and hopes it doesn't exist, but screws you
badly if you do hit the wall.

Agreed. That's why I like the idea of having a
max_wal_size/min_wal_time instead of keep_wal_segments or
checkpoint_segments. It's relatively simple for a DBA to know how much
disk space s/he has for WAL, total, before locking up the system.

And to answer Robert's question, because now I understand what he was
getting at. The reason we want a min_wal_time is because we don't want
to keep a larger WAL around always. If more WAL were always better,
then we'd only need max_wal_size and we'd only recycle when we hit it.
Instead, we'd recycle whenever we passed max_wal_time. That's why I
said that I was assuming nothing of the sort.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#38Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#33)
Re: Simplifying replication

On Wed, Oct 27, 2010 at 3:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, 2010-10-26 at 22:03 -0400, Robert Haas wrote:

On Tue, Oct 26, 2010 at 9:59 PM, Josh Berkus <josh@agliodbs.com> wrote:

If you set wal_keep_segments=0, archive_mode=on, and
archive_command=<something>, you might run out of disk space.

If you set wal_keep_segments=-1, you might run out of disk space.

Are you any more screwed in the second case than you are in the first
case?

It is the same to the user either way.  In either case you have to
change some settings and restart the master.

Except that changing wal_keep_segments doesn't require restarting the master.

The point of allowing -1 was to allow someone to set it to that value
temporarily, to be able to do a hot backup without having to guess how
large to set it.  If you don't have enough disk space for a backup to
complete, you're kind of hosed either way.

You're not hosed either way. Fujii designed this carefully to avoid that
and it works. The case of archive_command failing isn't comparable
because that is a failure case, not a normal working server.

You don't need to guess the setting of wal_keep_segments. It's a safety
net that has been deliberately created to avoid the crash that would
otherwise happen. I've not heard a better proposal, yet, though I too am
hopeful there is a better one.

I think you might be confused about what the use case Bruce and I are
imagining, because this doesn't make any sense at all in that context.
The specific use case is that you have archive_mode=off,
wal_level=archive or wal_level=hot_standby, and you want to take a hot
backup. If you do pg_start_backup(), copy the data directory, and do
pg_stop_backup(), you won't necessarily end up with enough xlog to
reach a consistent state. To do that, you must copy all the WAL files
that exist after pg_stop_backup() has completed, but you may not
actually be able to do that, because they might get recycled before
you can copy them. You can fix this by temporarily increasing
max_wal_segments to a sufficiently large value, and then dropping it
back down to whatever you like after you've copied the files out of
pg_xlog. If -1 isn't an option, you can always use a million, which
is almost certainly large enough to be safe. But what you definitely
DON'T want to do is temporarily set wal_keep_segments to a value like
7, because if you should manage to go through 8 segments while running
the backup, you may find that you now have an unusable backup. And
you may not realize this until you try to restore it.

I wonder if we should document this procedure. I see that it is not
in the docs at present, and it might be useful to somebody.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#39Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#37)
Re: Simplifying replication

On Wed, Oct 27, 2010 at 5:01 PM, Josh Berkus <josh@agliodbs.com> wrote:

You have to put the WAL files *somewhere* while you do the base backup.
PostgreSQL can't itself work out where that is, nor can it work out
ahead of time how big it will need to be, since it is up to you how you
do your base backup. Setting a parameter to -1 doesn't make the problem
go away, it just pretends and hopes it doesn't exist, but screws you
badly if you do hit the wall.

Agreed.  That's why I like the idea of having a
max_wal_size/min_wal_time instead of keep_wal_segments or
checkpoint_segments.  It's relatively simple for a DBA to know how much
disk space s/he has for WAL, total, before locking up the system.

And to answer Robert's question, because now I understand what he was
getting at.  The reason we want a min_wal_time is because we don't want
to keep a larger WAL around always.  If more WAL were always better,
then we'd only need max_wal_size and we'd only recycle when we hit it.
Instead, we'd recycle whenever we passed max_wal_time.  That's why I
said that I was assuming nothing of the sort.

I sort of agree with you that the current checkpoint_segments
parameter is a bit hard to tune, at least if your goal is to control
the amount of disk space that will be used by WAL files. But I'm not
sure your proposal is better. Instead of having a complicated formula
for predicting how much disk space would get used by a given value for
checkpoint_segments, we'd have a complicated formula for the amount of
WAL that would force a checkpoint based on max_wal_size.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#40Josh Berkus
josh@agliodbs.com
In reply to: Robert Haas (#39)
Re: Simplifying replication

I sort of agree with you that the current checkpoint_segments
parameter is a bit hard to tune, at least if your goal is to control
the amount of disk space that will be used by WAL files. But I'm not
sure your proposal is better. Instead of having a complicated formula
for predicting how much disk space would get used by a given value for
checkpoint_segments, we'd have a complicated formula for the amount of
WAL that would force a checkpoint based on max_wal_size.

Yes, but the complicated formula would then be *in our code* instead of
being inflicted on the user, as it now is.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#41Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#40)
Re: Simplifying replication

On Thu, Oct 28, 2010 at 1:13 AM, Josh Berkus <josh@agliodbs.com> wrote:

I sort of agree with you that the current checkpoint_segments
parameter is a bit hard to tune, at least if your goal is to control
the amount of disk space that will be used by WAL files.  But I'm not
sure your proposal is better.  Instead of having a complicated formula
for predicting how much disk space would get used by a given value for
checkpoint_segments, we'd have a complicated formula for the amount of
WAL that would force a checkpoint based on max_wal_size.

Yes, but the complicated formula would then be *in our code* instead of
being inflicted on the user, as it now is.

I don't think so - I think it will just be inflicted on the user in a
different way. We'd still have to document what the formula is,
because people will want to understand how often a checkpoint is going
to get forced.

So here's an example of how this could happen. Someone sets
max_wal_size = 480MB. Then, they hear about the
checkpoint_completion_target parameter, and say, ooh, goody, let me
boost that. So they raise it from 0.5 to 0.9. Now, all of a sudden,
they're getting more frequent checkpoints. Performance may get worse
rather than better. To figure out what value for max_wal_size forces
a checkpoint after the same amount of WAL that forced a checkpoint
before, they need to work backwards from max_wal_size to
checkpoint_segments, and then work forward again to figure out the new
value for the max_wal_size parameter.

Here's the math. max_wal_size = 480MB = 30 segments. With
checkpoint_completion_target = 0.5, that means that
checkpoint_segments is (30 - 1) / (2 + 0.5) = 11 (rounded down). With
checkpoint_completion_target = 0.9, that means they'll need to set
max_wal_size to (2 + 0.9) * 11 + 1 = 33 (rounded up) * 16MB = 528MB.
Whew!

My theory is that most tuning of checkpoint_segments is based on a
worry about recovery time or performance, not disk consumption.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#42Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Josh Berkus (#37)
Re: Simplifying replication

Josh Berkus <josh@agliodbs.com> writes:

You have to put the WAL files *somewhere* while you do the base backup.

Agreed. That's why I like the idea of having a
max_wal_size/min_wal_time instead of keep_wal_segments or
checkpoint_segments. It's relatively simple for a DBA to know how much
disk space s/he has for WAL, total, before locking up the system.

What if that somewhere is as easy to setup as a PostgreSQL archive
cluster: set a GUC a two, start the server, then in the production
server have archive_mode = on and use some internal archive and restore
commands, like 'pg_archivewal -h host -p port …'?

It's only pushing the problem away, but in my mind the only reason why
we're still talking about the problem is *not* the wal related settings
but the current complexity of setting up a trustworthy archive server,
and the number of external tools required in the operation (shell, scp,
rsync, rm, etc…).

Or is it just me?

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#43Hannu Krosing
hannu@2ndQuadrant.com
In reply to: Dimitri Fontaine (#3)
Re: Simplifying replication

On Tue, 2010-10-19 at 10:31 +0200, Dimitri Fontaine wrote:

4. I can start a new replica off the master by running a single command-line
utility on the standby and giving it connection information to the master.
Using this connection, it should be able to start a backup snapshot, copy
the entire database and any required logs, and then come up in standby mode.
All that should be required for this is one or two highport connections to
the master. No recovery.conf file is required, or exists.

There's a prototype to stream a base backup from a libpq connection, I
think someone here wanted to integrate that into the replication
protocol itself. It should be doable with a simple libpq connection and
all automated.

The pg_basebackup python client software is 100 lines of code. It's
mainly a recursive query to get the list of files within the master,
then two server side functions to get binary file chunks,
compressed. Then client side, a loop to decompress and write the chunks
at the right place. That's it.

http://github.com/dimitri/pg_basebackup/blob/master/pg_basebackup.py

I could prepare a patch given some advice on the replication protocol
integration. For one, is streaming a base backup something that
walsender should care about?

To make pg_basebackup.py self-sufficient it should also open 2nd
connection to the same master and make sure that all WAL files are
copied for the duration of base copy.

This way you don't need to do anything extra to make sure you have
enough wal files. And if you fail because of filling up disks, you fail
on slave side, where it is less of a problem.

--
-------
Hannu Krosing
PostgreSQL Infinite Scalability and Preformance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/

#44Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Hannu Krosing (#43)
Re: Simplifying replication

Hannu Krosing <hannu@2ndQuadrant.com> writes:

To make pg_basebackup.py self-sufficient it should also open 2nd
connection to the same master and make sure that all WAL files are
copied for the duration of base copy.

Excellent idea, will make that happen soon'ish.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#45Hannu Krosing
hannu@2ndQuadrant.com
In reply to: Dimitri Fontaine (#44)
Re: Simplifying replication

On Sat, 2010-11-06 at 18:02 +0100, Dimitri Fontaine wrote:

Hannu Krosing <hannu@2ndQuadrant.com> writes:

To make pg_basebackup.py self-sufficient it should also open 2nd
connection to the same master and make sure that all WAL files are
copied for the duration of base copy.

Excellent idea, will make that happen soon'ish.

Unitil I learned better, I thought that this is how SR is supposed to works ;)

btw, as next step you could backport this to 8.x and have most of the
benefits of SR. It should not be very hard to keep track of wal position
inside a pl/pythonu function and send one or more records back in form of

(walfile_name text, start_pos int, data bytea)

and then call this function from client every second to keep possible data
loss down to 1 sec.

this lets you set up warm standby with nothing more than a postgresql
superuser access to master (assuming right defaults in postgresql conf).

--
-------
Hannu Krosing
PostgreSQL Infinite Scalability and Preformance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/

#46Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Hannu Krosing (#45)
Re: Simplifying replication

Hannu Krosing <hannu@2ndQuadrant.com> writes:

To make pg_basebackup.py self-sufficient it should also open 2nd
connection to the same master and make sure that all WAL files are
copied for the duration of base copy.

Done now, please have a look and try it if possible:

https://github.com/dimitri/pg_basebackup

btw, as next step you could backport this to 8.x and have most of the
benefits of SR. It should not be very hard to keep track of wal position
inside a pl/pythonu function and send one or more records back in form of

You now have the -x and -D options to set that up, but I didn't check
the backport part: it still depends on the pg_bb_list_files() function
to get the recursive listing of the pg_xlog directory, and it does that
using a WITH RECURSIVE query.

The way I did it is to only copy the (whole) WAL again if its ctime
changed since last loop. Also pg_basebackup won't start a backup if you
run it on its own, I don't think that's what you want here.

Oh, as I needed to fork() a process to care for the pg_xlog in a loop
while the base backup is sill ongoing, I added a -j --jobs option so
that you can hammer the master some more by having more than one process
doing the copying.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#47Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#41)
Re: Simplifying replication

Robert Haas wrote:

On Thu, Oct 28, 2010 at 1:13 AM, Josh Berkus <josh@agliodbs.com> wrote:

I sort of agree with you that the current checkpoint_segments
parameter is a bit hard to tune, at least if your goal is to control
the amount of disk space that will be used by WAL files. ?But I'm not
sure your proposal is better. ?Instead of having a complicated formula
for predicting how much disk space would get used by a given value for
checkpoint_segments, we'd have a complicated formula for the amount of
WAL that would force a checkpoint based on max_wal_size.

Yes, but the complicated formula would then be *in our code* instead of
being inflicted on the user, as it now is.

I don't think so - I think it will just be inflicted on the user in a
different way. We'd still have to document what the formula is,
because people will want to understand how often a checkpoint is going
to get forced.

So here's an example of how this could happen. Someone sets
max_wal_size = 480MB. Then, they hear about the
checkpoint_completion_target parameter, and say, ooh, goody, let me
boost that. So they raise it from 0.5 to 0.9. Now, all of a sudden,
they're getting more frequent checkpoints. Performance may get worse

Uh, checkpoint_completion_target only controls flushing of buffers
between checkpoints, not the frequency of checkpoints.

It is hard to believe that, for tuning, the number of 16mb files is more
meaningful then raw file size.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#48Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#38)
Re: Simplifying replication

Robert Haas wrote:

On Wed, Oct 27, 2010 at 3:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, 2010-10-26 at 22:03 -0400, Robert Haas wrote:

On Tue, Oct 26, 2010 at 9:59 PM, Josh Berkus <josh@agliodbs.com> wrote:

If you set wal_keep_segments=0, archive_mode=on, and
archive_command=<something>, you might run out of disk space.

If you set wal_keep_segments=-1, you might run out of disk space.

Are you any more screwed in the second case than you are in the first
case?

It is the same to the user either way. ?In either case you have to
change some settings and restart the master.

Except that changing wal_keep_segments doesn't require restarting the master.

The point of allowing -1 was to allow someone to set it to that value
temporarily, to be able to do a hot backup without having to guess how
large to set it. ?If you don't have enough disk space for a backup to
complete, you're kind of hosed either way.

You're not hosed either way. Fujii designed this carefully to avoid that
and it works. The case of archive_command failing isn't comparable
because that is a failure case, not a normal working server.

You don't need to guess the setting of wal_keep_segments. It's a safety
net that has been deliberately created to avoid the crash that would
otherwise happen. I've not heard a better proposal, yet, though I too am
hopeful there is a better one.

I think you might be confused about what the use case Bruce and I are
imagining, because this doesn't make any sense at all in that context.
The specific use case is that you have archive_mode=off,
wal_level=archive or wal_level=hot_standby, and you want to take a hot
backup. If you do pg_start_backup(), copy the data directory, and do
pg_stop_backup(), you won't necessarily end up with enough xlog to

This is a clear case of protecting people from themselves (make them
specify a max wal size), and making the feature easy to use. We can't
have both, folks. For 9.0, we picked the former. The same tradeoff
often exists for flexibility and ease of use.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#49Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#47)
Re: Simplifying replication

On Thu, Nov 11, 2010 at 10:13 PM, Bruce Momjian <bruce@momjian.us> wrote:

Robert Haas wrote:

On Thu, Oct 28, 2010 at 1:13 AM, Josh Berkus <josh@agliodbs.com> wrote:

I sort of agree with you that the current checkpoint_segments
parameter is a bit hard to tune, at least if your goal is to control
the amount of disk space that will be used by WAL files. ?But I'm not
sure your proposal is better. ?Instead of having a complicated formula
for predicting how much disk space would get used by a given value for
checkpoint_segments, we'd have a complicated formula for the amount of
WAL that would force a checkpoint based on max_wal_size.

Yes, but the complicated formula would then be *in our code* instead of
being inflicted on the user, as it now is.

I don't think so - I think it will just be inflicted on the user in a
different way.  We'd still have to document what the formula is,
because people will want to understand how often a checkpoint is going
to get forced.

So here's an example of how this could happen.  Someone sets
max_wal_size = 480MB.  Then, they hear about the
checkpoint_completion_target parameter, and say, ooh, goody, let me
boost that.  So they raise it from 0.5 to 0.9.  Now, all of a sudden,
they're getting more frequent checkpoints.  Performance may get worse

Uh, checkpoint_completion_target only controls flushing of buffers
between checkpoints, not the frequency of checkpoints.

According to the formula in our fine documentation, if you increase
checkpoint_completion_target, the maximum number of WAL files also
increases. This makes sense: the files from the last checkpoint can't
be removed until further along into the next cycle. Therefore, if you
wanted to increase the checkpoint_completion_target while keeping the
maximum amount of WAL on disk the same, you'd need to trigger
checkpoints more frequently.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#50Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#49)
Re: Simplifying replication

Robert Haas wrote:

On Thu, Nov 11, 2010 at 10:13 PM, Bruce Momjian <bruce@momjian.us> wrote:

Robert Haas wrote:

On Thu, Oct 28, 2010 at 1:13 AM, Josh Berkus <josh@agliodbs.com> wrote:

I sort of agree with you that the current checkpoint_segments
parameter is a bit hard to tune, at least if your goal is to control
the amount of disk space that will be used by WAL files. ?But I'm not
sure your proposal is better. ?Instead of having a complicated formula
for predicting how much disk space would get used by a given value for
checkpoint_segments, we'd have a complicated formula for the amount of
WAL that would force a checkpoint based on max_wal_size.

Yes, but the complicated formula would then be *in our code* instead of
being inflicted on the user, as it now is.

I don't think so - I think it will just be inflicted on the user in a
different way. ?We'd still have to document what the formula is,
because people will want to understand how often a checkpoint is going
to get forced.

So here's an example of how this could happen. ?Someone sets
max_wal_size = 480MB. ?Then, they hear about the
checkpoint_completion_target parameter, and say, ooh, goody, let me
boost that. ?So they raise it from 0.5 to 0.9. ?Now, all of a sudden,
they're getting more frequent checkpoints. ?Performance may get worse

Uh, checkpoint_completion_target only controls flushing of buffers
between checkpoints, not the frequency of checkpoints.

According to the formula in our fine documentation, if you increase
checkpoint_completion_target, the maximum number of WAL files also
increases. This makes sense: the files from the last checkpoint can't
be removed until further along into the next cycle. Therefore, if you
wanted to increase the checkpoint_completion_target while keeping the
maximum amount of WAL on disk the same, you'd need to trigger
checkpoints more frequently.

Do we recycle WAL files between checkpoints or just at checkpoint time?
I thought it was only at checkpoint time.

Also, there was talk that a larger WAL directory would slow recovery,
but I thought it was only the time since the last checkpoint that
controlled that.

[ Again, sorry for my late reading of this and other threads. ]

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +