Redesigning checkpoint_segments

Started by Heikki Linnakangasover 12 years ago121 messages

hlinnakangas@vmware.com

over 12 years ago

checkpoint_segments is awkward. From an admin's point of view, it
controls two things:

1. it limits the amount of disk space needed for pg_xlog. (it's a soft
limit, but still)
2. it limits the time required to recover after a crash.

For limiting the disk space needed for pg_xlog, checkpoint_segments is
awkward because it's defined in terms of 16MB segments between
checkpoints. It takes a fair amount of arithmetic to calculate the disk
space required to hold the specified number of segments. The manual
gives the formula: (2 + checkpoint_completion_target) *
checkpoint_segments + 1, which amounts to about 1GB per 20 segments as a
rule of thumb. We shouldn't impose that calculation on the user. It
should be possible to just specify "checkpoint_segments=512MB", and the
system would initiate checkpoints so that the total size of WAL in
pg_xlog stays below 512MB.

For limiting the time required to recover after crash,
checkpoint_segments is awkward because it's difficult to calculate how
long recovery will take, given checkpoint_segments=X. A bulk load can
use up segments really fast, and recovery will be fast, while segments
full of random deletions can need a lot of random I/O to replay, and
take a long time. IMO checkpoint_timeout is a much better way to control
that, although it's not perfect either.

A third point is that even if you have 10 GB of disk space reserved for
WAL, you don't want to actually consume all that 10 GB, if it's not
required to run the database smoothly. There are several reasons for
that: backups based on a filesystem-level snapshot are larger than
necessary, if there are a lot of preallocated WAL segments and in a
virtualized or shared system, there might be other VMs or applications
that could make use of the disk space. On the other hand, you don't want
to run out of disk space while writing WAL - that can lead to a PANIC in
the worst case.

In VMware's vPostgres fork, we've hacked the way that works, so that
there is a new setting, checkpoint_segments_max that can be set by the
user, but checkpoint_segments is adjusted automatically, on the fly. The
system counts how many segments were consumed during the last checkpoint
cycle, and that becomes the checkpoint_segments setting for the next
cycle. That means that in a system with a steady load, checkpoints are
triggered by checkpoint_timeout, and the effective checkpoint_segments
value converges at the exact number of segments needed for that. That's
simple but very effective. It doesn't behave too well with bursty load,
however; during quiet times, checkpoint_segments is dialed way down, and
when the next burst comes along, you get several checkpoints in quick
succession, until checkpoint_segments is dialed back up again.

I propose that we do something similar, but not exactly the same. Let's
have a setting, max_wal_size, to control the max. disk space reserved
for WAL. Once that's reached (or you get close enough, so that there are
still some segments left to consume while the checkpoint runs), a
checkpoint is triggered.

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high,
without actually consuming that much space in normal operation. It's
just a backstop, to avoid completely filling the disk, if there's a
sudden burst of activity. The number of segments preallocated is
auto-tuned, based on the number of segments used in previous checkpoint
cycles.

I'll write up a patch to do that, but before I do, does anyone disagree
on those tuning principles? How do you typically tune
checkpoint_segments on your servers? If the system was to tune it
automatically, what formula should it use?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fujii Masao

masao.fujii@gmail.com

over 12 years ago

In reply to: Heikki Linnakangas (#1)

Re: Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 9:16 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

checkpoint_segments is awkward. From an admin's point of view, it controls
two things:

1. it limits the amount of disk space needed for pg_xlog. (it's a soft
limit, but still)
2. it limits the time required to recover after a crash.

For limiting the disk space needed for pg_xlog, checkpoint_segments is
awkward because it's defined in terms of 16MB segments between checkpoints.
It takes a fair amount of arithmetic to calculate the disk space required to
hold the specified number of segments. The manual gives the formula: (2 +
checkpoint_completion_target) * checkpoint_segments + 1, which amounts to
about 1GB per 20 segments as a rule of thumb. We shouldn't impose that
calculation on the user. It should be possible to just specify
"checkpoint_segments=512MB", and the system would initiate checkpoints so
that the total size of WAL in pg_xlog stays below 512MB.

For limiting the time required to recover after crash, checkpoint_segments
is awkward because it's difficult to calculate how long recovery will take,
given checkpoint_segments=X. A bulk load can use up segments really fast,
and recovery will be fast, while segments full of random deletions can need
a lot of random I/O to replay, and take a long time. IMO checkpoint_timeout
is a much better way to control that, although it's not perfect either.

A third point is that even if you have 10 GB of disk space reserved for WAL,
you don't want to actually consume all that 10 GB, if it's not required to
run the database smoothly. There are several reasons for that: backups based
on a filesystem-level snapshot are larger than necessary, if there are a lot
of preallocated WAL segments and in a virtualized or shared system, there
might be other VMs or applications that could make use of the disk space. On
the other hand, you don't want to run out of disk space while writing WAL -
that can lead to a PANIC in the worst case.

In VMware's vPostgres fork, we've hacked the way that works, so that there
is a new setting, checkpoint_segments_max that can be set by the user, but
checkpoint_segments is adjusted automatically, on the fly. The system counts
how many segments were consumed during the last checkpoint cycle, and that
becomes the checkpoint_segments setting for the next cycle. That means that
in a system with a steady load, checkpoints are triggered by
checkpoint_timeout, and the effective checkpoint_segments value converges at
the exact number of segments needed for that. That's simple but very
effective. It doesn't behave too well with bursty load, however; during
quiet times, checkpoint_segments is dialed way down, and when the next burst
comes along, you get several checkpoints in quick succession, until
checkpoint_segments is dialed back up again.

I propose that we do something similar, but not exactly the same. Let's have
a setting, max_wal_size, to control the max. disk space reserved for WAL.
Once that's reached (or you get close enough, so that there are still some
segments left to consume while the checkpoint runs), a checkpoint is
triggered.

What if max_wal_size is reached while the checkpoint is running? We should
change the checkpoint from spread mode to fast mode? Or, if max_wal_size
is hard limit, we should keep the allocation of new WAL file waiting until
the checkpoint has finished and removed some old WAL files?

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high, without
actually consuming that much space in normal operation. It's just a
backstop, to avoid completely filling the disk, if there's a sudden burst of
activity. The number of segments preallocated is auto-tuned, based on the
number of segments used in previous checkpoint cycles.

How is wal_keep_segments handled in your approach?

I'll write up a patch to do that, but before I do, does anyone disagree on
those tuning principles?

No at least from me. I like your idea.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Fujii Masao (#2)

Re: Redesigning checkpoint_segments

On 05.06.2013 21:16, Fujii Masao wrote:

On Wed, Jun 5, 2013 at 9:16 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I propose that we do something similar, but not exactly the same. Let's have
a setting, max_wal_size, to control the max. disk space reserved for WAL.
Once that's reached (or you get close enough, so that there are still some
segments left to consume while the checkpoint runs), a checkpoint is
triggered.

What if max_wal_size is reached while the checkpoint is running? We should
change the checkpoint from spread mode to fast mode?

The checkpoint spreading code already tracks if the checkpoint is "on
schedule", and it takes into account both checkpoint_timeout and
checkpoint_segments. Ie. if you consume segments faster than expected,
the checkpoint will speed up as well. Once checkpoint_segments is
reached, the checkpoint will complete ASAP, with no delays to spread it out.

This would still work the same with max_wal_size. A new checkpoint would
be started well before reaching max_wal_size, so that it has enough time
to complete. If the checkpoint "falls behind", it will hurry up until
it's back on schedule. If max_wal_size is reached anyway, it will
complete ASAP.

Or, if max_wal_size
is hard limit, we should keep the allocation of new WAL file waiting until
the checkpoint has finished and removed some old WAL files?

I was not thinking of making it a hard limit. It would be just like
checkpoint_segments from that point of view - if a checkpoint takes a
long time, max_wal_size might still be exceeded.

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high, without
actually consuming that much space in normal operation. It's just a
backstop, to avoid completely filling the disk, if there's a sudden burst of
activity. The number of segments preallocated is auto-tuned, based on the
number of segments used in previous checkpoint cycles.

How is wal_keep_segments handled in your approach?

Hmm, haven't thought about that. I think a better unit to set
wal_keep_segments in would also be MB, not segments. Perhaps
max_wal_size should include WAL retained for wal_keep_segments, leaving
less room for checkpoints. Ie. when you you set wal_keep_segments
higher, a xlog-based checkpoint would be triggered earlier, because the
old segments kept for replication would leave less room for new
segments. And setting wal_keep_segments higher than max_wal_size would
be an error.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kevin Grittner

kgrittn@ymail.com

over 12 years ago

In reply to: Heikki Linnakangas (#3)

Re: Redesigning checkpoint_segments

Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

I was not thinking of making it a hard limit. It would be just
like checkpoint_segments from that point of view - if a
checkpoint takes a long time, max_wal_size might still be
exceeded.

Then I suggest we not use exactly that name. I feel quite sure we
would get complaints from people if something labeled as "max" was
exceeded -- especially if they set that to the actual size of a
filesystem dedicated to WAL files.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fujii Masao

masao.fujii@gmail.com

over 12 years ago

In reply to: Heikki Linnakangas (#3)

Re: Redesigning checkpoint_segments

On Thu, Jun 6, 2013 at 3:35 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 05.06.2013 21:16, Fujii Masao wrote:

On Wed, Jun 5, 2013 at 9:16 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I propose that we do something similar, but not exactly the same. Let's
have

a setting, max_wal_size, to control the max. disk space reserved for WAL.
Once that's reached (or you get close enough, so that there are still
some
segments left to consume while the checkpoint runs), a checkpoint is
triggered.

What if max_wal_size is reached while the checkpoint is running? We should
change the checkpoint from spread mode to fast mode?

The checkpoint spreading code already tracks if the checkpoint is "on
schedule", and it takes into account both checkpoint_timeout and
checkpoint_segments. Ie. if you consume segments faster than expected, the
checkpoint will speed up as well. Once checkpoint_segments is reached, the
checkpoint will complete ASAP, with no delays to spread it out.

Yep, right. One problem is that this mechanism doesn't work in the standby.
So, are you planning to 'fix' that so that max_wal_size works well even in
the standby? Or just leave that as it is? According to the remaining part of
your email, you seem to choose the latter, though.

This would still work the same with max_wal_size. A new checkpoint would be
started well before reaching max_wal_size, so that it has enough time to
complete. If the checkpoint "falls behind", it will hurry up until it's back
on schedule. If max_wal_size is reached anyway, it will complete ASAP.

Or, if max_wal_size
is hard limit, we should keep the allocation of new WAL file waiting until
the checkpoint has finished and removed some old WAL files?

I was not thinking of making it a hard limit. It would be just like
checkpoint_segments from that point of view - if a checkpoint takes a long
time, max_wal_size might still be exceeded.

So, if the archive command keeps failing or its speed is very slow
(e.g., because
of using compression tool), max_wal_size can still be extremely exceeded. Right?

I'm wondering if it's worth exposing the option specifying whether to use
max_wal_size as the hard limit or not. If it's not hard limit, the
disk space can
be filled up with WAL files and PANIC can happen. In this case, in order to
restart the database service, we need to enlarge the disk space or relocate
some WAL files to another disk space, and then we need to start up the server.
The normal crash recovery needs to be done. This would lead lots of service
down time.

OTOH, if we use max_wal_size as a hard limit, we can avoid such PANIC
error and long down time. Of course, in this case, once max_wal_size is
reached, we cannot complete any query writing WAL until the checkpoint
has completed and removed old WAL files. During that time, the database
service looks like down from a client, but its down time is shorter than the
PANIC error case. So I'm thinking that some users might want the hard
limit of pg_xlog size.

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high,
without
actually consuming that much space in normal operation. It's just a
backstop, to avoid completely filling the disk, if there's a sudden burst
of
activity. The number of segments preallocated is auto-tuned, based on the
number of segments used in previous checkpoint cycles.

How is wal_keep_segments handled in your approach?

Hmm, haven't thought about that. I think a better unit to set
wal_keep_segments in would also be MB, not segments.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Josh Berkus

josh@agliodbs.com

over 12 years ago

In reply to: Heikki Linnakangas (#1)

Re: Redesigning checkpoint_segments

Heikki,

We shouldn't impose that calculation on the user. It
should be possible to just specify "checkpoint_segments=512MB", and the
system would initiate checkpoints so that the total size of WAL in
pg_xlog stays below 512MB.

Agreed.

For limiting the time required to recover after crash,
checkpoint_segments is awkward because it's difficult to calculate how
long recovery will take, given checkpoint_segments=X. A bulk load can
use up segments really fast, and recovery will be fast, while segments
full of random deletions can need a lot of random I/O to replay, and
take a long time. IMO checkpoint_timeout is a much better way to control
that, although it's not perfect either.

This is true, but I don't see that your proposal changes this at all
(for the better or for the worse).

A third point is that even if you have 10 GB of disk space reserved for
WAL, you don't want to actually consume all that 10 GB, if it's not
required to run the database smoothly.

Agreed.

I propose that we do something similar, but not exactly the same. Let's
have a setting, max_wal_size, to control the max. disk space reserved
for WAL. Once that's reached (or you get close enough, so that there are
still some segments left to consume while the checkpoint runs), a
checkpoint is triggered.

Refinement of the proposal:

1. max_wal_size is a hard limit
2. checkpointing targets 50% of ( max_wal_size - wal_keep_segments )
to avoid lockup if checkpoint takes longer than expected.
3. wal_keep_segments is taken out of max_wal_size.
a. it automatically defaults to 20% of max_wal_size if
max_wal_senders > 0
b. for that reason, we don't allow it to be larger
than 80% of max_wal_size
4. preallocated WAL isn't allowed to shrink smaller than
wal_keep_segements + (max_wal_size * 0.1).

This would mean that I could set my server to:

max_wal_size = 2GB

and ...

* by default, 26 segments (416MB) would be kept for wal_keep_segments.
* checkpoint target would be 77 segments (1.2GB)
* preallocated WAL will always be at least 39 segments (624MB),
including keep_segments.

now, if I had a fairly low transaction database, but wanted to make sure
I could recover from an 8-hour break in replication, I might bump up
wal_keep_segments to 1GB. In that case:

* 64 segments (1GB) would be kept.
* checkpoints would target 96 segments (1.5GB)
* preallocated WAL would always be at least 77 segments (1.2GB)

Hmm, haven't thought about that. I think a better unit to set
wal_keep_segments in would also be MB, not segments.

Well, the ideal unit from the user's point of view is *time*, not space.
That is, the user wants the master to keep, say, "8 hours of
transaction logs", not any amount of MB. I don't want to complicate
this proposal by trying to deliver that, though.

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high,
without actually consuming that much space in normal operation. It's
just a backstop, to avoid completely filling the disk, if there's a
sudden burst of activity. The number of segments preallocated is
auto-tuned, based on the number of segments used in previous checkpoint
cycles.

"based on"; can you give me your algorithmic thinking here? I'm
thinking we should have some calculation of last cycle size and peak
cycle size so that bursty workloads aren't compromised.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WMb6bfec74fa8fca80e8a441b49b269d5b3137deacae489829d58e242f7e4a89b90f32438270edcff7eacfc6f6933f34d2@asav-3.01.com

Robert Haas

robertmhaas@gmail.com

over 12 years ago

In reply to: Fujii Masao (#5)

Re: Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 3:24 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

OTOH, if we use max_wal_size as a hard limit, we can avoid such PANIC
error and long down time. Of course, in this case, once max_wal_size is
reached, we cannot complete any query writing WAL until the checkpoint
has completed and removed old WAL files. During that time, the database
service looks like down from a client, but its down time is shorter than the
PANIC error case. So I'm thinking that some users might want the hard
limit of pg_xlog size.

I wonder if we could tie this in with the recent proposal from the
Heroku guys to have a way to slow down WAL writing. Maybe we have
several limits:

- When limit #1 is passed (or checkpoint_timeout elapses), we start a
spread checkpoint.

- If it looks like we're going to exceed limit #2 before the
checkpoint completes, we attempt to perform the checkpoint more
quickly, by reducing the delay between buffer writes. If we actually
exceed limit #2, we try to complete the checkpoint as fast as
possible.

- If it looks like we're going to exceed limit #3 before the
checkpoint completes, we start exerting back-pressure on writers by
making them wait every time they write WAL, probably in proportion to
the number of bytes written. We keep ratcheting up the wait until
we've slowed down writers enough that will finish within limit #3. As
we reach limit #3, the wait goes to infinity; only read-only
operations can proceed until the checkpoint finishes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Joshua D. Drake

jd@commandprompt.com

over 12 years ago

In reply to: Robert Haas (#7)

Re: Redesigning checkpoint_segments

On 06/05/2013 05:37 PM, Robert Haas wrote:

On Wed, Jun 5, 2013 at 3:24 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

OTOH, if we use max_wal_size as a hard limit, we can avoid such PANIC
error and long down time. Of course, in this case, once max_wal_size is
reached, we cannot complete any query writing WAL until the checkpoint
has completed and removed old WAL files. During that time, the database
service looks like down from a client, but its down time is shorter than the
PANIC error case. So I'm thinking that some users might want the hard
limit of pg_xlog size.

I wonder if we could tie this in with the recent proposal from the
Heroku guys to have a way to slow down WAL writing. Maybe we have
several limits:

I didn't see that proposal, link? Because the idea of slowing down
wal-writing sounds insane.

--
Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
a rose in the deeps of my heart. - W.B. Yeats

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 12 years ago

In reply to: Joshua D. Drake (#8)

Re: Redesigning checkpoint_segments

On Thu, Jun 6, 2013 at 10:00 AM, Joshua D. Drake <jd@commandprompt.com>wrote:

On 06/05/2013 05:37 PM, Robert Haas wrote:

On Wed, Jun 5, 2013 at 3:24 PM, Fujii Masao <masao.fujii@gmail.com>
wrote:

OTOH, if we use max_wal_size as a hard limit, we can avoid such PANIC
error and long down time. Of course, in this case, once max_wal_size is
reached, we cannot complete any query writing WAL until the checkpoint
has completed and removed old WAL files. During that time, the database
service looks like down from a client, but its down time is shorter than
the
PANIC error case. So I'm thinking that some users might want the hard
limit of pg_xlog size.

I wonder if we could tie this in with the recent proposal from the
Heroku guys to have a way to slow down WAL writing. Maybe we have
several limits:

I didn't see that proposal, link? Because the idea of slowing down
wal-writing sounds insane.

Here it is:
/messages/by-id/CAM3SWZQcyNxvPaskr-pxm8DeqH7_qevW7uqbhPCsg1FpSxKpoQ@mail.gmail.com
--
Michael

#10

Daniel Farina

daniel@heroku.com

over 12 years ago

In reply to: Joshua D. Drake (#8)

Re: Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 6:00 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

I didn't see that proposal, link? Because the idea of slowing down
wal-writing sounds insane.

It's not as insane as introducing an archiving gap, PANICing and
crashing, or running this hunk o junk I wrote
http://github.com/fdr/ratchet

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Joshua D. Drake

jd@commandprompt.com

over 12 years ago

In reply to: Robert Haas (#7)

Re: Redesigning checkpoint_segments

On 06/05/2013 05:37 PM, Robert Haas wrote:

- If it looks like we're going to exceed limit #3 before the
checkpoint completes, we start exerting back-pressure on writers by
making them wait every time they write WAL, probably in proportion to
the number of bytes written. We keep ratcheting up the wait until
we've slowed down writers enough that will finish within limit #3. As
we reach limit #3, the wait goes to infinity; only read-only
operations can proceed until the checkpoint finishes.

Alright, perhaps I am dense. I have read both this thread and the other
one on better handling of archive command
(/messages/by-id/CAM3SWZQcyNxvPaskr-pxm8DeqH7_qevW7uqbhPCsg1FpSxKpoQ@mail.gmail.com).
I recognize there are brighter minds than mine on this thread but I just
honestly don't get it.

1. WAL writes are already fast. They are the fastest write we have
because it is sequential.

2. We don't want them to be slow. We want data written to disk as
quickly as possible without adversely affecting production. That's the
point.

3. The spread checkpoints have always confused me. If anything we want a
checkpoint to be fast and short because:

4. Bgwriter. We should be adjusting bgwriter so that it is writing
everything in a manner that allows any checkpoint to be in the range of
never noticed.

Now perhaps my customers workloads are different but for us:

1. Checkpoint timeout is set as high as reasonable, usually 30 minutes
to an hour. I wish I could set them even further out.

2. Bgwriter is set to be aggressive but not obtrusive. Usually adjusting
based on an actual amount of IO bandwidth it may take per second based
on their IO constraints. (Note I know that wal_writer comes into play
here but I honestly don't remember where and am reading up on it to
refresh my memory).

3. The biggest issue we see with checkpoint segments is not running out
of space because really.... 10GB is how many checkpoint segments? It is
with wal_keep_segments. If we don't want to fill up the pg_xlog
directory, put the wal logs that are for keep_segments elsewhere.

Other oddities:

Yes checkpoint_segments is awkward. We shouldn't have to set it at all.
It should be gone. Basically we start with X amount perhaps to be set at
initdb time. That X amount changes dynamically based on the amount of
data being written. In order to not suffer from recycling and creation
penalties we always keep X+N where N is enough to keep up with new data.

Along with the above, I don't see any reason for checkpoint_timeout.
Because of bgwriter we should be able to rather indefinitely not worry
about checkpoints (with a few exceptions such as pg_start_backup()).
Perhaps a setting that causes a checkpoint to happen based on some
non-artificial threshold (timeout) such as amount of data currently in
need of a checkpoint?

Heikki said, "I propose that we do something similar, but not exactly
the same. Let's have a setting, max_wal_size, to control the max. disk
space reserved for WAL. Once that's reached (or you get close enough, so
that there are still some segments left to consume while the checkpoint
runs), a checkpoint is triggered.

This makes sense except I don't see a need for the parameter. Why not
just specify how the algorithm works and adhere to that without the need
for another GUC? Perhaps at any given point we save 10% of available
space (within a 16MB calculation) for pg_xlog, you hit it, we checkpoint
and LOG EXACTLY WHY.

Instead of "running out of disk space PANIC" we should just write to an
emergency location within PGDATA and log very loudly that the SA isn't
paying attention. Perhaps if that area starts to get to an unhappy place
we immediately bounce into read-only mode and log even more loudly that
the SA should be fired. I would think read-only mode is safer and more
polite than an PANIC crash.

I do not think we should worry about filling up the hard disk except to
protect against data loss in the event. It is not user unfriendly to
assume that a user will pay attention to disk space. Really?

Open to people telling me I am off in left field. Sorry if it is noise.

Sincerely,

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Joshua D. Drake

jd@commandprompt.com

over 12 years ago

In reply to: Daniel Farina (#10)

Re: Redesigning checkpoint_segments

On 06/05/2013 06:23 PM, Daniel Farina wrote:

On Wed, Jun 5, 2013 at 6:00 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

I didn't see that proposal, link? Because the idea of slowing down
wal-writing sounds insane.

It's not as insane as introducing an archiving gap, PANICing and
crashing, or running this hunk o junk I wrote
http://github.com/fdr/ratchet

Well certainly we shouldn't PANIC and crash but that is a simple fix.
You have a backup write location and start logging really loudly that
you are using it.

Sincerely,

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Daniel Farina

daniel@heroku.com

over 12 years ago

In reply to: Joshua D. Drake (#12)

Re: Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 8:23 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

It's not as insane as introducing an archiving gap, PANICing and
crashing, or running this hunk o junk I wrote
http://github.com/fdr/ratchet

Well certainly we shouldn't PANIC and crash but that is a simple fix. You
have a backup write location and start logging really loudly that you are
using it.

If I told you there were some of us who would prefer to attenuate the
rate that things get written rather than cancel or delay archiving for
a long period of time, would that explain the framing of the problem?

Or, is it that you understand that's what I want, but find the notion
of such a operation hard to relate to?

Or, am I misunderstanding your confusion?

Or, none of the above?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Joshua D. Drake

jd@commandprompt.com

over 12 years ago

In reply to: Daniel Farina (#13)

Re: Redesigning checkpoint_segments

On 6/5/2013 10:07 PM, Daniel Farina wrote:

If I told you there were some of us who would prefer to attenuate the
rate that things get written rather than cancel or delay archiving for
a long period of time, would that explain the framing of the problem?

I understand that based on what you said above.

Or, is it that you understand that's what I want, but find the notion
of such a operation hard to relate to?

I think this is where I am at. To me, you don't attenuate the rate that
things get written, you fix the problem in needing to do so. The problem
is one of provisioning. Please note that I am not suggesting there
aren't improvements to be made, there absolutely are. I just wonder if
we are looking in the right place (outside of some obvious badness like
the PANIC running out of disk space).

Or, am I misunderstanding your confusion?

To be honest part of my confusion was just trying to parse all the bits
that people were talking about into a cohesive, "this is the actual
problem".

Sincerely,

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Peter Geoghegan

pg@heroku.com

over 12 years ago

In reply to: Joshua D. Drake (#14)

Re: Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 10:27 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

I just wonder if we are looking in the right place (outside of some obvious
badness like the PANIC running out of disk space).

So you don't think we should PANIC on running out of disk space? If
you don't think we should do that, and you don't think that WAL
writing should be throttled, what's the alternative?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Daniel Farina

daniel@heroku.com

over 12 years ago

In reply to: Joshua D. Drake (#14)

Re: Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 10:27 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

On 6/5/2013 10:07 PM, Daniel Farina wrote:

If I told you there were some of us who would prefer to attenuate the
rate that things get written rather than cancel or delay archiving for
a long period of time, would that explain the framing of the problem?

I understand that based on what you said above.

Or, is it that you understand that's what I want, but find the notion
of such a operation hard to relate to?

I think this is where I am at. To me, you don't attenuate the rate that
things get written, you fix the problem in needing to do so. The problem is
one of provisioning. Please note that I am not suggesting there aren't
improvements to be made, there absolutely are. I just wonder if we are
looking in the right place (outside of some obvious badness like the PANIC
running out of disk space).

Okay, well, I don't see the fact that the block device is faster than
the archive command as a "problem," it's just an artifact of the
ratios of performance of stuff in the system. If one views archives
as a must-have, there's not much other choice than to attenuate.

An alternative is to buy a slower block device. That'd accomplish the
same effect, but it's a pretty bizarre and heavyhanded way to go about
it, and not easily adaptive to, say, if I made the archive command
faster (in my case, I well could, with some work).

So, I don't think it's all that unnatural to allow for the flexibility
of a neat attenuation technique, and it's pretty important too.
Methinks. Disagree?

Final thought: I can't really tell users to knock off what they're
doing on a large scale. It's better to not provide abrupt changes in
service (like crashing or turning off everything for extended periods
while the archive uploads). So, smoothness and predictability is
desirable.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Joshua D. Drake

jd@commandprompt.com

over 12 years ago

In reply to: Peter Geoghegan (#15)

Re: Redesigning checkpoint_segments

On 6/5/2013 10:54 PM, Peter Geoghegan wrote:

On Wed, Jun 5, 2013 at 10:27 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

I just wonder if we are looking in the right place (outside of some obvious
badness like the PANIC running out of disk space).

So you don't think we should PANIC on running out of disk space? If
you don't think we should do that, and you don't think that WAL
writing should be throttled, what's the alternative?

As I mentioned in my previous email:

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Daniel Farina

daniel@heroku.com

over 12 years ago

In reply to: Joshua D. Drake (#17)

Re: Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 11:05 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

On 6/5/2013 10:54 PM, Peter Geoghegan wrote:

On Wed, Jun 5, 2013 at 10:27 PM, Joshua D. Drake <jd@commandprompt.com>
wrote:

I just wonder if we are looking in the right place (outside of some
obvious
badness like the PANIC running out of disk space).

So you don't think we should PANIC on running out of disk space? If
you don't think we should do that, and you don't think that WAL
writing should be throttled, what's the alternative?

As I mentioned in my previous email:

Instead of "running out of disk space PANIC" we should just write to an
emergency location within PGDATA and log very loudly that the SA isn't
paying attention. Perhaps if that area starts to get to an unhappy place we
immediately bounce into read-only mode and log even more loudly that the SA
should be fired. I would think read-only mode is safer and more polite than
an PANIC crash.

I do not think we should worry about filling up the hard disk except to
protect against data loss in the event. It is not user unfriendly to assume
that a user will pay attention to disk space. Really?

Okay, then I will say it's user unfriendly, especially for a transient
use of space, and particularly if there's no knob for said SA to
attenuate what's going on. You appear to assume the SA can lean on
the application to knock off whatever is going on or provision more
disk in time, or that disk is reliable enough to meet one's goals. In
my case, none of these precepts are true or desirable.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Harold Giménez

harold.gimenez@gmail.com

over 12 years ago

In reply to: Joshua D. Drake (#17)

Re: Redesigning checkpoint_segments

Hi,

On Wed, Jun 5, 2013 at 11:05 PM, Joshua D. Drake <jd@commandprompt.com>wrote:

On 6/5/2013 10:54 PM, Peter Geoghegan wrote:

On Wed, Jun 5, 2013 at 10:27 PM, Joshua D. Drake <jd@commandprompt.com>
wrote:

Instead of "running out of disk space PANIC" we should just write to an
emergency location within PGDATA

This merely buys you some time, but with aggressive and sustained write
throughput you are left on the same spot. Practically speaking it's the
same situation as increasing the pg_xlog disk space.

and log very loudly that the SA isn't paying attention. Perhaps if that
area starts to get to an unhappy place we immediately bounce into read-only
mode and log even more loudly that the SA should be fired. I would think
read-only mode is safer and more polite than an PANIC crash.

I agree it is better than PANIC, but read-only mode is definitely also a
form of throttling; a much more abrupt and unfriendly one if I may add.

Regards,

-Harold

#20

Joshua D. Drake

jd@commandprompt.com

over 12 years ago

In reply to: Daniel Farina (#18)

Re: Redesigning checkpoint_segments

On 6/5/2013 11:09 PM, Daniel Farina wrote:

Instead of "running out of disk space PANIC" we should just write to an
emergency location within PGDATA and log very loudly that the SA isn't
paying attention. Perhaps if that area starts to get to an unhappy place we
immediately bounce into read-only mode and log even more loudly that the SA
should be fired. I would think read-only mode is safer and more polite than
an PANIC crash.

I do not think we should worry about filling up the hard disk except to
protect against data loss in the event. It is not user unfriendly to assume
that a user will pay attention to disk space. Really?
Okay, then I will say it's user unfriendly, especially for a transient
use of space, and particularly if there's no knob for said SA to
attenuate what's going on. You appear to assume the SA can lean on
the application to knock off whatever is going on or provision more
disk in time, or that disk is reliable enough to meet one's goals. In
my case, none of these precepts are true or desirable.

I have zero doubt that in your case it is true and desirable. I just
don't know that it is a positive solution to the problem as a whole.
Your case is rather limited to your environment, which is rather limited
to the type of user that your environment has. Which lends itself to the
idea that this should be a Heroku Postgres thing, not a .Org wide thing.

Sincerely,

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Peter Geoghegan

pg@heroku.com

over 12 years ago

In reply to: Joshua D. Drake (#20)

Re: Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 11:28 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

I have zero doubt that in your case it is true and desirable. I just don't
know that it is a positive solution to the problem as a whole. Your case is
rather limited to your environment, which is rather limited to the type of
user that your environment has. Which lends itself to the idea that this
should be a Heroku Postgres thing, not a .Org wide thing.

If you look through the -general archives, or on stack overflow you'll
find ample evidence that it is a problem that lots of people have.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Joshua D. Drake

jd@commandprompt.com

over 12 years ago

In reply to: Harold Giménez (#19)

Re: Redesigning checkpoint_segments

On 6/5/2013 11:25 PM, Harold Gimï¿½nez wrote:

Instead of "running out of disk space PANIC" we should just write
to an emergency location within PGDATA

This merely buys you some time, but with aggressive and sustained
write throughput you are left on the same spot. Practically speaking
it's the same situation as increasing the pg_xlog disk space.

Except that you likely can't increase pg_xlog space (easily). The point
here is to have overflow, think swap space.

I agree it is better than PANIC, but read-only mode is definitely also
a form of throttling; a much more abrupt and unfriendly one if I may add.

I would think read only is less unfriendly than an all out failure.
Consider if done correctly, the database would move back into read-write
mode once the problem was resolved.

#23

Joshua D. Drake

jd@commandprompt.com

over 12 years ago

In reply to: Peter Geoghegan (#21)

Re: Redesigning checkpoint_segments

On 6/5/2013 11:31 PM, Peter Geoghegan wrote:

On Wed, Jun 5, 2013 at 11:28 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

I have zero doubt that in your case it is true and desirable. I just don't
know that it is a positive solution to the problem as a whole. Your case is
rather limited to your environment, which is rather limited to the type of
user that your environment has. Which lends itself to the idea that this
should be a Heroku Postgres thing, not a .Org wide thing.

If you look through the -general archives, or on stack overflow you'll
find ample evidence that it is a problem that lots of people have.

Not to be unkind but the problems of the uniformed certainly are not the
problems of the informed. Or perhaps they are certainly the problems of
the informed :P. I do read -general and I don't see it much honestly. I
don't watch stackoverflow that much but I am sure it probably does come
up here, sometimes but I bet I can point once again to a lack of
provisioning on their part.

This reminds me of the time that someone from Heroku said at PgEast,
with a show of hands how many people here don't backup there database to
S3. Almost everyone in the audience raised their hands.

Again, I don't question your need but just because it is hot and now
doesn't mean it is healthy. I honestly do no see the requirement you are
trying to represent as a need for the wider, production community.

(in short, not a single one of my customers would benefit from it, and
90% of them are running databases Heroku can't.)

That is not a slight, honestly. I think your service is cool. I am just
being honest.

Sincerely,

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Joshua D. Drake (#11)

Re: Redesigning checkpoint_segments

On 06.06.2013 06:20, Joshua D. Drake wrote:

3. The spread checkpoints have always confused me. If anything we want a
checkpoint to be fast and short because:

(I'm sure you know this, but:) If you perform a checkpoint as fast and
short as possible, the sudden burst of writes and fsyncs will overwhelm
the I/O subsystem, and slow down queries. That's what we saw before
spread checkpoints: when a checkpoint happens, the response times of
queries jumped up.

4. Bgwriter. We should be adjusting bgwriter so that it is writing
everything in a manner that allows any checkpoint to be in the range of
never noticed.

Oh, I see where you're going. Yeah, that would be one way to do it.
However, spread checkpoints has pretty much the same effect. Imagine
that you tune your system like this: disable bgwriter altogether, and
set checkpoint_completion_target=0.9. With that, there will be a
checkpoint in progress most of the time, because by the time one
checkpoint completes, it's almost time to begin the next one already. In
that case, the checkpointer will be slowly performing the writes, all
the time, in the background, without affecting queries. The effect is
the same as what you described above, except that it's the checkpointer
doing the writing, not bgwriter.

As it happens, that's pretty much what you get with the default settings.

Now perhaps my customers workloads are different but for us:

1. Checkpoint timeout is set as high as reasonable, usually 30 minutes
to an hour. I wish I could set them even further out.

2. Bgwriter is set to be aggressive but not obtrusive. Usually adjusting
based on an actual amount of IO bandwidth it may take per second based
on their IO constraints. (Note I know that wal_writer comes into play
here but I honestly don't remember where and am reading up on it to
refresh my memory).

I've heard people just turning off bgwriter because it doesn't have much
effect anyway. You might want to try that, and if checkpoints cause I/O
spikes, raise checkpoint_completion_target instead.

3. The biggest issue we see with checkpoint segments is not running out
of space because really.... 10GB is how many checkpoint segments? It is
with wal_keep_segments. If we don't want to fill up the pg_xlog
directory, put the wal logs that are for keep_segments elsewhere.

Yeah, wal_keep_segments is a hack. We should replace it with something
else, like having a registry of standbys in the master, and how far
they've streamed. That way the master could keep around the amount of
WAL actually needed by them, not more not less. But that's a different
story.

Other oddities:

Yes checkpoint_segments is awkward. We shouldn't have to set it at all.
It should be gone.

The point of having checkpoint_segments or max_wal_size is to put a
limit (albeit a soft one) on the amount of disk space used. If you don't
care about that, I guess we could allow max_wal_size=-1 to mean
infinite, and checkpoints would be driven off purely based on time, not
WAL consumption.

Basically we start with X amount perhaps to be set at
initdb time. That X amount changes dynamically based on the amount of
data being written. In order to not suffer from recycling and creation
penalties we always keep X+N where N is enough to keep up with new data.

To clarify, here you're referring to controlling the number of WAL
segments preallocated/recycled, rather than how often checkpoints are
triggered. Currently, both are derived from checkpoint_segments, but I
proposed to separate them. The above is exactly what I proposed to do
for the preallocation/recycling, it would be tuned automatically, but
you still need something like max_wal_size for the other thing, to
trigger a checkpoint if too much WAL is being consumed.

Along with the above, I don't see any reason for checkpoint_timeout.
Because of bgwriter we should be able to rather indefinitely not worry
about checkpoints (with a few exceptions such as pg_start_backup()).
Perhaps a setting that causes a checkpoint to happen based on some
non-artificial threshold (timeout) such as amount of data currently in
need of a checkpoint?

Either I'm not understanding what you said, or you're confused. The
point of checkpoint_timeout is put a limit on the time it will take to
recover in case of crash. The relation between the two,
checkpoint_timeout and how long it will take to recover after a crash,
it not straightforward, but that's the best we have.

Bgwriter does not worry about checkpoints. By "amount of data currently
in need of a checkpoint", do you mean the number of dirty buffers in
shared_buffers, or something else? I don't see how or why that should
affect when you perform a checkpoint.

Heikki said, "I propose that we do something similar, but not exactly
the same. Let's have a setting, max_wal_size, to control the max. disk
space reserved for WAL. Once that's reached (or you get close enough, so
that there are still some segments left to consume while the checkpoint
runs), a checkpoint is triggered.

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high,
without actually consuming that much space in normal operation. It's
just a backstop, to avoid completely filling the disk, if there's a
sudden burst of activity. The number of segments preallocated is
auto-tuned, based on the number of segments used in previous checkpoint
cycles. "

This makes sense except I don't see a need for the parameter. Why not
just specify how the algorithm works and adhere to that without the need
for another GUC?

Because you want to limit the amount of disk space used for WAL. It's a
soft limit, but still.

Perhaps at any given point we save 10% of available
space (within a 16MB calculation) for pg_xlog, you hit it, we checkpoint
and LOG EXACTLY WHY.

Ah, but we don't know how much disk space is available. Even if we did,
there might be quotas or other constraints on the amount that we can
actually use. Or the DBA might not want PostgreSQL to use up all the
space, because there are other processes on the same system that need it.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Josh Berkus (#6)

Re: Redesigning checkpoint_segments

On 05.06.2013 23:16, Josh Berkus wrote:

For limiting the time required to recover after crash,
checkpoint_segments is awkward because it's difficult to calculate how
long recovery will take, given checkpoint_segments=X. A bulk load can
use up segments really fast, and recovery will be fast, while segments
full of random deletions can need a lot of random I/O to replay, and
take a long time. IMO checkpoint_timeout is a much better way to control
that, although it's not perfect either.

This is true, but I don't see that your proposal changes this at all
(for the better or for the worse).

Right, it doesn't. I explained this to justify that it's OK to replace
checkpoint_segments with max_wal_size. If someone is trying to use
checkpoint_segments to limit the time required to recover after crash,
he might find the current checkpoint_segments setting more intuitive
than my proposed max_wal_size. checkpoint_segments means "perform a
checkpoint every X segments", so you know that after a crash, you will
have to replay at most X segments (except that
checkpoint_completion_target complicates that already). With
max_wal_size, the relationship is not as clear.

What I tried to argue is that I don't think that's a serious concern.

I propose that we do something similar, but not exactly the same. Let's
have a setting, max_wal_size, to control the max. disk space reserved
for WAL. Once that's reached (or you get close enough, so that there are
still some segments left to consume while the checkpoint runs), a
checkpoint is triggered.

Refinement of the proposal:

1. max_wal_size is a hard limit

I'd like to punt on that until later. Making it a hard limit would be a
much bigger patch, and needs a lot of discussion how it should behave
(switch to read-only mode, progressively slow down WAL writes, or what?)
and how to implement it.

But I think there's a clear evolution path here; with current
checkpoint_segments, it's not sensible to treat that as a hard limit.
Once we have something like max_wal_size, defined in MB, it's much more
sensible. So turning it into a hard limit could be a follow-up patch, if
someone wants to step up to the plate.

2. checkpointing targets 50% of ( max_wal_size - wal_keep_segments )
to avoid lockup if checkpoint takes longer than expected.

Will also have to factor in checkpoint_completion_target.

Hmm, haven't thought about that. I think a better unit to set
wal_keep_segments in would also be MB, not segments.

Well, the ideal unit from the user's point of view is *time*, not space.
That is, the user wants the master to keep, say, "8 hours of
transaction logs", not any amount of MB. I don't want to complicate
this proposal by trying to deliver that, though.

OTOH, if you specify it in terms of time, then you don't have any limit
on the amount of disk space required.

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high,
without actually consuming that much space in normal operation. It's
just a backstop, to avoid completely filling the disk, if there's a
sudden burst of activity. The number of segments preallocated is
auto-tuned, based on the number of segments used in previous checkpoint
cycles.

"based on"; can you give me your algorithmic thinking here? I'm
thinking we should have some calculation of last cycle size and peak
cycle size so that bursty workloads aren't compromised.

Yeah, something like that :-). I was thinking of letting the estimate
decrease like a moving average, but react to any increases immediately.
Same thing we do in bgwriter to track buffer allocations:

/*
* Track a moving average of recent buffer allocations. Here, rather than
* a true average we want a fast-attack, slow-decline behavior: we
* immediately follow any increase.
*/
if (smoothed_alloc <= (float) recent_alloc)
smoothed_alloc = recent_alloc;
else
smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
smoothing_samples;

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Joshua D. Drake

jd@commandprompt.com

over 12 years ago

In reply to: Heikki Linnakangas (#24)

Re: Redesigning checkpoint_segments

On 6/6/2013 1:11 AM, Heikki Linnakangas wrote:

(I'm sure you know this, but:) If you perform a checkpoint as fast and
short as possible, the sudden burst of writes and fsyncs will
overwhelm the I/O subsystem, and slow down queries. That's what we saw
before spread checkpoints: when a checkpoint happens, the response
times of queries jumped up.

That isn't quite right. Previously we had lock issues as well and
checkpoints would take considerable time to complete. What I am talking
about is that the background writer (and wal writer where applicable)
have done all the work before a checkpoint is even called. Consider that
everyone of my clients that I am active with sets the
checkpoint_completion_target to 0.9. With a proper bgwriter config this
works.

4. Bgwriter. We should be adjusting bgwriter so that it is writing
everything in a manner that allows any checkpoint to be in the range of
never noticed.

Oh, I see where you're going.

O.k. good. I am not nuts :D

Yeah, that would be one way to do it. However, spread checkpoints has
pretty much the same effect. Imagine that you tune your system like
this: disable bgwriter altogether, and set
checkpoint_completion_target=0.9. With that, there will be a
checkpoint in progress most of the time, because by the time one
checkpoint completes, it's almost time to begin the next one already.
In that case, the checkpointer will be slowly performing the writes,
all the time, in the background, without affecting queries. The effect
is the same as what you described above, except that it's the
checkpointer doing the writing, not bgwriter.

O.k. if that is true, then we have redundant systems and we need to
remove one of them.

Yeah, wal_keep_segments is a hack. We should replace it with something
else, like having a registry of standbys in the master, and how far
they've streamed. That way the master could keep around the amount of
WAL actually needed by them, not more not less. But that's a different
story.

Other oddities:

Yes checkpoint_segments is awkward. We shouldn't have to set it at all.
It should be gone.

The point of having checkpoint_segments or max_wal_size is to put a
limit (albeit a soft one) on the amount of disk space used. If you
don't care about that, I guess we could allow max_wal_size=-1 to mean
infinite, and checkpoints would be driven off purely based on time,
not WAL consumption.

I would not only agree with that, I would argue that max_wal_size
doesn't need to be there at least as a default. Perhaps as an "advanced"
configuration option that only those in the know see.

Basically we start with X amount perhaps to be set at
initdb time. That X amount changes dynamically based on the amount of
data being written. In order to not suffer from recycling and creation
penalties we always keep X+N where N is enough to keep up with new data.

To clarify, here you're referring to controlling the number of WAL
segments preallocated/recycled, rather than how often checkpoints are
triggered. Currently, both are derived from checkpoint_segments, but I
proposed to separate them. The above is exactly what I proposed to do
for the preallocation/recycling, it would be tuned automatically, but
you still need something like max_wal_size for the other thing, to
trigger a checkpoint if too much WAL is being consumed.

You think so? I agree with 90% of this paragraph but it seems to me that
we can find an algortihm that manages this without the idea of
max_wal_size (at least as a user settable).

Along with the above, I don't see any reason for checkpoint_timeout.
Because of bgwriter we should be able to rather indefinitely not worry
about checkpoints (with a few exceptions such as pg_start_backup()).
Perhaps a setting that causes a checkpoint to happen based on some
non-artificial threshold (timeout) such as amount of data currently in
need of a checkpoint?

Either I'm not understanding what you said, or you're confused. The
point of checkpoint_timeout is put a limit on the time it will take to
recover in case of crash. The relation between the two,
checkpoint_timeout and how long it will take to recover after a crash,
it not straightforward, but that's the best we have.

I may be confused but it is my understanding that bgwriter writes out
the data from the shared buffer cache that is dirty based on an interval
and a max pages written. If we are writing data continuously, we don't
need checkpoints except for special cases (like pg_start_backup())?

Bgwriter does not worry about checkpoints. By "amount of data
currently in need of a checkpoint", do you mean the number of dirty
buffers in shared_buffers, or something else? I don't see how or why
that should affect when you perform a checkpoint.

Heikki said, "I propose that we do something similar, but not exactly
the same. Let's have a setting, max_wal_size, to control the max. disk
space reserved for WAL. Once that's reached (or you get close enough, so
that there are still some segments left to consume while the checkpoint
runs), a checkpoint is triggered.

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high,
without actually consuming that much space in normal operation. It's
just a backstop, to avoid completely filling the disk, if there's a
sudden burst of activity. The number of segments preallocated is
auto-tuned, based on the number of segments used in previous checkpoint
cycles. "

This makes sense except I don't see a need for the parameter. Why not
just specify how the algorithm works and adhere to that without the need
for another GUC?

Because you want to limit the amount of disk space used for WAL. It's
a soft limit, but still.

Why? This is the point that confuses me. Why do we care? We don't care
how much disk space PGDATA takes... why do we all of a sudden care about
pg_xlog?

Perhaps at any given point we save 10% of available
space (within a 16MB calculation) for pg_xlog, you hit it, we checkpoint
and LOG EXACTLY WHY.

Ah, but we don't know how much disk space is available. Even if we
did, there might be quotas or other constraints on the amount that we
can actually use. Or the DBA might not want PostgreSQL to use up all
the space, because there are other processes on the same system that
need it.

We could however know how much disk space is available.

Sincerely,

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Joshua D. Drake (#26)

Re: Redesigning checkpoint_segments

On 06.06.2013 11:42, Joshua D. Drake wrote:

On 6/6/2013 1:11 AM, Heikki Linnakangas wrote:

Yes checkpoint_segments is awkward. We shouldn't have to set it at all.
It should be gone.

The point of having checkpoint_segments or max_wal_size is to put a
limit (albeit a soft one) on the amount of disk space used. If you
don't care about that, I guess we could allow max_wal_size=-1 to mean
infinite, and checkpoints would be driven off purely based on time,
not WAL consumption.

I would not only agree with that, I would argue that max_wal_size
doesn't need to be there at least as a default. Perhaps as an "advanced"
configuration option that only those in the know see.

Well, we have checkpoint_segments=3 as the default currently, which in
the proposed scheme would be about equal to max_wal_size=120MB. For
better or worse, our defaults are generally geared towards small
systems, and that sounds about right for that.

Basically we start with X amount perhaps to be set at
initdb time. That X amount changes dynamically based on the amount of
data being written. In order to not suffer from recycling and creation
penalties we always keep X+N where N is enough to keep up with new data.

To clarify, here you're referring to controlling the number of WAL
segments preallocated/recycled, rather than how often checkpoints are
triggered. Currently, both are derived from checkpoint_segments, but I
proposed to separate them. The above is exactly what I proposed to do
for the preallocation/recycling, it would be tuned automatically, but
you still need something like max_wal_size for the other thing, to
trigger a checkpoint if too much WAL is being consumed.

You think so? I agree with 90% of this paragraph but it seems to me that
we can find an algortihm that manages this without the idea of
max_wal_size (at least as a user settable).

We are in a violent agreement :-). max_wal_size would not directly
affect the preallocation of segments. The preallocation would be driven
off the actual number of segments used in previous checkpoint cycles,
not on max_wal_size.

Now, max_wal_size would affect when checkpoints happen (ie. if you're
about to reach max_wal_size, a checkpoint would be triggered), which
would in turn affect the number of segments used between cycles. But
there would be no direct connection between the two; the code to
calculate how much to preallocate would not refer to max_wal_size.

Maybe max_wal_size should set an upper limit on how much to preallocate,
though. If you want to limit the WAL size, we probably shouldn't exceed
it on purpose by preallocating segments, even if the algorithm based on
previous cycles suggests says we should. This situation would arise if
the checkpoints can't keep up, so that each checkpoint cycle is longer
than we'd want, and we'd exceed max_wal_size because of that.

This makes sense except I don't see a need for the parameter. Why not
just specify how the algorithm works and adhere to that without the need
for another GUC?

Because you want to limit the amount of disk space used for WAL. It's
a soft limit, but still.

Why? This is the point that confuses me. Why do we care? We don't care
how much disk space PGDATA takes... why do we all of a sudden care about
pg_xlog?

Hmm, dunno. We always have had checkpoint_segments setting to limit
that, I was just thinking of retaining that functionality.

A few reasons spring to mind: First, running out of WAL space leads to a
PANIC, which is not nice (I know, we talked about fixing that).
Secondly, because we can. If a user inserts 10 GB of data into a table,
we'll have to just store it, but with WAL, we can always issue a
checkpoint to shrink it. People have asked for quotas for user data too,
so some people do want to limit disk usage.

Mind you, it's possible to have a tiny database with a high TPS rate,
such that the WAL grows really big compared to the size of the user
data. Something with a small hot table that's updated a lot. In such a
scenario, limiting the WAL size make sense, and it won't affect
performance much either because checkpointing a small database is very
cheap.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Kevin Grittner (#4)

Re: Redesigning checkpoint_segments

On 05.06.2013 22:18, Kevin Grittner wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> wrote:

I was not thinking of making it a hard limit. It would be just
like checkpoint_segments from that point of view - if a
checkpoint takes a long time, max_wal_size might still be
exceeded.

Then I suggest we not use exactly that name. I feel quite sure we
would get complaints from people if something labeled as "max" was
exceeded -- especially if they set that to the actual size of a
filesystem dedicated to WAL files.

You're probably right. Any suggestions for a better name?
wal_size_soft_limit?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Fujii Masao (#5)

Re: Redesigning checkpoint_segments

On 05.06.2013 22:24, Fujii Masao wrote:

On Thu, Jun 6, 2013 at 3:35 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

The checkpoint spreading code already tracks if the checkpoint is "on
schedule", and it takes into account both checkpoint_timeout and
checkpoint_segments. Ie. if you consume segments faster than expected, the
checkpoint will speed up as well. Once checkpoint_segments is reached, the
checkpoint will complete ASAP, with no delays to spread it out.

Yep, right. One problem is that this mechanism doesn't work in the standby.

Sure it does:

commit 71815306e9e1ba7e95752779d2ad51d0c2b9c747
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed Jun 9 15:04:07 2010 +0000

In standby mode, respect checkpoint_segments in addition to
checkpoint_timeout to trigger restartpoints. We used to deliberately only
do time-based restartpoints, because if checkpoint_segments is small we
would spend time doing restartpoints more often than really necessary.
But now that restartpoints are done in bgwriter, they're not as
disruptive as they used to be. Secondly, because streaming replication
stores the streamed WAL files in pg_xlog, we want to clean it up more
often to avoid running out of disk space when checkpoint_timeout is large
and checkpoint_segments small.

Patch by Fujii Masao, with some minor changes by me.

One problam with that is that if you set checkpoint_segments (or
max_wal_size, under the proposal) lower in the standby than in the
master, we can't do restartpoints any more frequently than checkpoints
have happened in the master. I wasn't planning to do anything about that.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Kevin Grittner

kgrittn@ymail.com

over 12 years ago

In reply to: Heikki Linnakangas (#28)

Re: Redesigning checkpoint_segments

Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 05.06.2013 22:18, Kevin Grittner wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> wrote:

I was not thinking of making it a hard limit. It would be just
like checkpoint_segments from that point of view - if a
checkpoint takes a long time, max_wal_size might still be
exceeded.

Then I suggest we not use exactly that name. I feel quite sure we
would get complaints from people if something labeled as "max" was
exceeded -- especially if they set that to the actual size of a
filesystem dedicated to WAL files.

You're probably right. Any suggestions for a better name?
wal_size_soft_limit?

After reading later posts on the thread, I would be inclined to
support making it a hard limit and adapting the behavior to match.
I'm pretty sure I've seen at least one case where a separate
filesystem has been allocated for WAL which has been unexpectedly
filled. People would like some way to deal with that.

I'm also concerned about the "spin up" from idle to high activity.
Perhaps a "min" should also be present, to mitigate repeated short
checkpoint cycles for "bursty" environments?

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Kevin Grittner (#30)

Re: Redesigning checkpoint_segments

On 06.06.2013 15:31, Kevin Grittner wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> wrote:

On 05.06.2013 22:18, Kevin Grittner wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> wrote:

I was not thinking of making it a hard limit. It would be just
like checkpoint_segments from that point of view - if a
checkpoint takes a long time, max_wal_size might still be
exceeded.

Then I suggest we not use exactly that name. I feel quite sure we
would get complaints from people if something labeled as "max" was
exceeded -- especially if they set that to the actual size of a
filesystem dedicated to WAL files.

You're probably right. Any suggestions for a better name?
wal_size_soft_limit?

After reading later posts on the thread, I would be inclined to
support making it a hard limit and adapting the behavior to match.

Well, that's a lot more difficult to implement. And even if we have a
hard limit, I think many people would still want to have a soft limit
that would trigger a checkpoint, but would not stop WAL writes from
happening. So what would we call that?

I'd love to see a hard limit too, but I see that as an orthogonal feature.

How about calling the (soft) limit "checkpoint_wal_size"? That goes well
together with checkpoint_timeout, meaning that a checkpoint will be
triggered if you're about to exceed the given size.

I'm also concerned about the "spin up" from idle to high activity.
Perhaps a "min" should also be present, to mitigate repeated short
checkpoint cycles for "bursty" environments?

With my proposal, you wouldn't get repeated short checkpoint cycles with
bursts. The checkpoint interval would be controlled by
checkpoint_timeout, and checkpoint_wal_size. If there is a lot of
activity, then checkpoints will happen more frequently, as
checkpoint_wal_size is reached sooner. But it would not depend on the
activity in previous checkpoint cycles, only the current one, so it
would not make a difference if you have a continuously high load, or a
bursty one.

The history would matter for the calculation of how many segments to
preallocate/recycle, however. Under the proposal, that would be
calculated separately from checkpoint_wal_size, and for that we'd use
some kind of a moving average of how many segments were used in previous
cycles. A min setting might be useful for that. We could also try to
make WAL file creation cheaper, ie. by using posix_fallocate(), as was
proposed in another thread, and doing it in bgwriter or walwriter. That
would make it less important to get the estimate right, from a
performance point of view, although you'd still want to get it right to
avoid running out of disk space (having the segments preallocated
ensures that they are available when needed).

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Josh Berkus

josh@agliodbs.com

over 12 years ago

In reply to: Heikki Linnakangas (#1)

Re: Redesigning checkpoint_segments

Daniel,

So your suggestion is that if archiving is falling behind, we should
introduce delays on COMMIT in order to slow down the rate of WAL writing?

Just so I'm clear.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WMdccc81c5285b9c2ae249013532e563f6b5b47538f560db35eaa14be3fa7aee66474d2fa7198e46b2eb0e08c6f05688c5@asav-3.01.com

#33

Josh Berkus

josh@agliodbs.com

over 12 years ago

In reply to: Heikki Linnakangas (#1)

Re: Redesigning checkpoint_segments

Then I suggest we not use exactly that name. I feel quite sure we
would get complaints from people if something labeled as "max" was
exceeded -- especially if they set that to the actual size of a
filesystem dedicated to WAL files.

You're probably right. Any suggestions for a better name?
wal_size_soft_limit?

"checkpoint_size_limit", or something similar. That is, what you're
defining is:

"this is the size at which we trigger a checkpoint even if
checkpoint_timeout has not been exceeded".

However, I think it's worth considering: if we're doing this "sizing
checkpoints based on prior cycles" thing, do we really need a size_limit
*at all* for most users? I can see how a hard limit is useful, but not
how a soft limit is.

Most of our users most of the time don't care how large WAL is as long
as it doesn't exceed disk space. And on most databases, hitting
checkpoint_timeout is more frequent than hitting checkpoint_segments --
at least in my substantial performance-tuning experience. So I think
most users would prefer a setting which essentially says "make WAL as
big as it has to be in order to maximize throughput", and wouldn't worry
about the disk space.

Yeah, something like that :-). I was thinking of letting the estimate
decrease like a moving average, but react to any increases immediately.
Same thing we do in bgwriter to track buffer allocations:

Seems reasonable. Given the behavior of xlog, I'd want to adjust the
algo so that peak usage on a 24-hour basis would affect current
preallocation. That is, if a site regularly has a peak from 2-3pm where
they're using 180 segments/cycle, then they should still be somewhat
higher at 2am than a database which doesn't have that peak. I'm pretty
sure that the bgwriter's moving average cycles much shorter time scales
than that.

Well, the ideal unit from the user's point of view is *time*, not space.
That is, the user wants the master to keep, say, "8 hours of
transaction logs", not any amount of MB. I don't want to complicate
this proposal by trying to deliver that, though.

OTOH, if you specify it in terms of time, then you don't have any limit
on the amount of disk space required.

Well, the best setup from my perspective as a remote DBA for a lot of
clients would be two-factor:

wal_keep_time: ##hr
wal_keep_size_limit: ##GB

That is, we would try to keep ##hr of WAL around for the standbys,
unless that amount exceeded ##GB (at which point we'd write a warning to
the logs). If max_wal_size was a hard limit, we wouldn't need
wal_keep_size_limit, of course.

However, to some degree Andres' work will render all this
wal_keep_segments stuff obsolete by letting the master track what
segment was last consumed by each replica, so I don't think it's worth
pursuing this line of thinking a lot further.

In any case, I'm just pointing out that we need to think of
wal_keep_segments as part of the total WAL size, and not as something
seperate, because that's confusing our users.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM3f9aef80aa9e6586052ae550b7f3c3c500249b99d559fecaf3d496aed03b12a3c6743f5468a1e3a46e9f3957d7d435be@asav-3.01.com

#34

Jeff Janes

jeff.janes@gmail.com

over 12 years ago

In reply to: Joshua D. Drake (#11)

Re: Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 8:20 PM, Joshua D. Drake <jd@commandprompt.com>wrote:

On 06/05/2013 05:37 PM, Robert Haas wrote:

- If it looks like we're going to exceed limit #3 before the

checkpoint completes, we start exerting back-pressure on writers by
making them wait every time they write WAL, probably in proportion to
the number of bytes written. We keep ratcheting up the wait until
we've slowed down writers enough that will finish within limit #3. As
we reach limit #3, the wait goes to infinity; only read-only
operations can proceed until the checkpoint finishes.

Alright, perhaps I am dense. I have read both this thread and the other
one on better handling of archive command (http://www.postgresql.org/**
message-id/CAM3SWZQcyNxvPaskr-**pxm8DeqH7_**qevW7uqbhPCsg1FpSxKpoQ@mail.**
gmail.com</messages/by-id/CAM3SWZQcyNxvPaskr-pxm8DeqH7_qevW7uqbhPCsg1FpSxKpoQ@mail.gmail.com>).
I recognize there are brighter minds than mine on this thread but I just
honestly don't get it.

1. WAL writes are already fast. They are the fastest write we have because
it is sequential.

2. We don't want them to be slow. We want data written to disk as quickly
as possible without adversely affecting production. That's the point.

If speed of archiving is the fundamental bottleneck on the system, how does
that bottleneck get communicated forward to the user? PANICs are a
horrible way of doing it, throttling the writing of WAL (and hence the
acceptance of COMMITs) seems like a reasonable alternative . Maybe speed
of archiving is not the fundamental bottleneck on your systems, but...

3. The spread checkpoints have always confused me. If anything we want a
checkpoint to be fast and short because:

4. Bgwriter. We should be adjusting bgwriter so that it is writing
everything in a manner that allows any checkpoint to be in the range of
never noticed.

They do different things. One writes buffers out to make room for incoming
ones. One writes them out (and fsyncs the underlying files) to allow redo
pointer to advance (limiting soft recovery time) and xlogs to be recycled
(limiting disk space).

Now perhaps my customers workloads are different but for us:

1. Checkpoint timeout is set as high as reasonable, usually 30 minutes to
an hour. I wish I could set them even further out.

Yeah, I think the limit of 1 hr is rather nanny-ish. I know what I'm
doing, and I want the freedom to go longer if that is what I want to do.

2. Bgwriter is set to be aggressive but not obtrusive. Usually adjusting
based on an actual amount of IO bandwidth it may take per second based on
their IO constraints. (Note I know that wal_writer comes into play here but
I honestly don't remember where and am reading up on it to refresh my
memory).

I find bgwriter to be almost worthless, at least since the fsync queue
compaction code went in. When io is free-flowing the kernel accepts writes
almost instantaneously, and so the backends can write out dirty buffers
themselves very quickly and it is not worth off-loading to a background
process. When IO is constipated, it would be worth off-loading except in
those circumstances the bgwriter cannot possibly keep up.

3. The biggest issue we see with checkpoint segments is not running out of
space because really.... 10GB is how many checkpoint segments? It is with
wal_keep_segments. If we don't want to fill up the pg_xlog directory, put
the wal logs that are for keep_segments elsewhere.

Which is what archiving does. But then you have a to put a lot of thought
into how to clean up the archive, assuming your policy is not to keep it
forever. keep_segments can be a nice compromise.

Other oddities:

Yes checkpoint_segments is awkward. We shouldn't have to set it at all. It
should be gone. Basically we start with X amount perhaps to be set at
initdb time. That X amount changes dynamically based on the amount of data
being written. In order to not suffer from recycling and creation penalties
we always keep X+N where N is enough to keep up with new data.

Along with the above, I don't see any reason for checkpoint_timeout.
Because of bgwriter we should be able to rather indefinitely not worry
about checkpoints (with a few exceptions such as pg_start_backup()).
Perhaps a setting that causes a checkpoint to happen based on some
non-artificial threshold (timeout) such as amount of data currently in need
of a checkpoint?

Without checkpoints, how would the redo pointer ever advance?

If the system is io limited during recovery, then checkpoint_segments is a
fairly natural way to put a limit on how long recovery from a soft crash
will take. If the system is CPU limited during recovery, then
checkpoint_timeout is a fairly natural way to put a limit on how long
recovery will take. It is probably possible to come with a single merged
setting that is better than both of those in almost all circumstances, but
how much work would that take to get right?

...

Instead of "running out of disk space PANIC" we should just write to an

emergency location within PGDATA and log very loudly that the SA isn't
paying attention.

If the SA isn't paying attention, who is it that we are loudly saying these
things to?

If whatever caused archiving to break also caused the archiving failure
emails to not be delivered, about the only way you can get louder is by
refusing new requests from the end user.

Perhaps if that area starts to get to an unhappy place we immediately
bounce into read-only mode and log even more loudly that the SA should be
fired. I would think read-only mode is safer and more polite than an PANIC
crash.

Isn't that effectively what throttling WAL writing is?

Cheers,

Jeff

#35

Jeff Janes

jeff.janes@gmail.com

over 12 years ago

In reply to: Joshua D. Drake (#26)

Re: Redesigning checkpoint_segments

On Thu, Jun 6, 2013 at 1:42 AM, Joshua D. Drake <jd@commandprompt.com>wrote:

I may be confused but it is my understanding that bgwriter writes out the
data from the shared buffer cache that is dirty based on an interval and a
max pages written.

It primarily writes out based on how many buffers have recently needed to
be evicted in order to make room to read in new ones. There are secondary
clamp limits based on an interval (it does enough work to circle the buffer
pool once every 2 minutes) and another on max pages written but the main
one is based on recent usage. I've never really understood the point of
those secondary clamps.

This makes sense except I don't see a need for the parameter. Why not
just specify how the algorithm works and adhere to that without the need
for another GUC?

Because you want to limit the amount of disk space used for WAL. It's a
soft limit, but still.

Why? This is the point that confuses me. Why do we care? We don't care how
much disk space PGDATA takes... why do we all of a sudden care about
pg_xlog?

Presumably someone cares about disk space of PGDATA, but it is probably a
different person, at a different time, on a different time scale. PGDATA
is a long term planning issue, pg_xlog is an operational issue. If the
checkpoint had completed 30 seconds earlier or the archive_command had
completed 30 seconds earlier (or the commit rate had been throttled for 30
seconds), then pg_xlog would not have run out of space in the first place.
Having averted the crisis, maybe it will never arise again, or maybe it
will but we will be able to avoid it again. If we delay running out of
room on PGDATA for 30 seconds, well, we still ran out of room.

Cheers,

Jeff

#36

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Josh Berkus (#33)

1 attachment(s)

Re: Redesigning checkpoint_segments

On 06.06.2013 20:24, Josh Berkus wrote:

Yeah, something like that :-). I was thinking of letting the estimate
decrease like a moving average, but react to any increases immediately.
Same thing we do in bgwriter to track buffer allocations:

Seems reasonable.

Here's a patch implementing that. Docs not updated yet. I did not change
the way checkpoint_segments triggers checkpoints - that'll can be a
separate patch. This only decouples the segment preallocation behavior
from checkpoint_segments. With the patch, you can set
checkpoint_segments really high, without consuming that much disk space
all the time.

Given the behavior of xlog, I'd want to adjust the
algo so that peak usage on a 24-hour basis would affect current
preallocation. That is, if a site regularly has a peak from 2-3pm where
they're using 180 segments/cycle, then they should still be somewhat
higher at 2am than a database which doesn't have that peak. I'm pretty
sure that the bgwriter's moving average cycles much shorter time scales
than that.

Makes sense. I didn't implement that in the attached, though.

Having a separate option to specify a minimum number of segments (or
rather minimum size in MB) to keep preallocated would at least allow a
DBA to set that manually, based on the observed peak. I didn't implement
such a manual option in the attached, but that would be easy.

- Heikki

Attachments:

dynamic-xlogfileslop-1.patchtext/x-diff; name=dynamic-xlogfileslop-1.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 40b780c..5244ce1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -89,18 +89,11 @@ bool		XLOG_DEBUG = false;
 #endif
 
 /*
- * XLOGfileslop is the maximum number of preallocated future XLOG segments.
- * When we are done with an old XLOG segment file, we will recycle it as a
- * future XLOG segment as long as there aren't already XLOGfileslop future
- * segments; else we'll delete it.  This could be made a separate GUC
- * variable, but at present I think it's sufficient to hardwire it as
- * 2*CheckPointSegments+1.	Under normal conditions, a checkpoint will free
- * no more than 2*CheckPointSegments log segments, and we want to recycle all
- * of them; the +1 allows boundary cases to happen without wasting a
- * delete/create-segment cycle.
+ * Estimated distance between checkpoints, in bytes, and measured distance of
+ * previous checkpoint cycle.
  */
-#define XLOGfileslop	(2*CheckPointSegments + 1)
-
+static double CheckPointDistanceEstimate = 0;
+static double PrevCheckPointDistance = 0;
 
 /*
  * GUC support
@@ -668,7 +661,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
-static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr);
+static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr);
 static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
 static void CleanupBackupHistory(void);
@@ -1458,11 +1451,85 @@ AdvanceXLInsertBuffer(bool new_segment)
 }
 
 /*
+ * XLOGfileslop is the maximum number of preallocated future XLOG segments.
+ * When we are done with an old XLOG segment file, we will recycle it as a
+ * future XLOG segment as long as there aren't already XLOGfileslop future
+ * segments; else we'll delete it.
+ */
+static int
+XLOGfileslop(XLogRecPtr PriorRedoPtr, XLogRecPtr CurrPtr)
+{
+	double		nsegments;
+	double		targetPtr;
+	double		distance;
+
+	/*
+	 * The number segments to preallocate/recycle is based on two things:
+	 * an estimate of how much WAL is consumed between checkpoints, and the
+	 * current distance from the prior checkpoint (ie. the point at which
+	 * we're about to truncate the WAL) to the current WAL insert location.
+	 *
+	 * First, calculate how much WAL space the system would need, if it ran
+	 * steady, using the estimated amount of WAL generated between every
+	 * checkpoint cycle. Then see how much WAL is actually in use at the moment
+	 * (= the distance between Prior redo pointer and current WAL insert
+	 * location). The difference between the two is how much WAL we should keep
+	 * preallocated, so that backends won't have to create new WAL segments.
+	 *
+	 * The reason we do these calculations from the prior checkpoint, not the
+	 * one that just finished, is that this behaves better if some checkpoint
+	 * cycles are abnormally short, like if you perform a manual checkpoint
+	 * right after a timed one. The manual checkpoint will make almost
+	 * a full cycle's worth of WAL segments available for recycling, because
+	 * the segments from the prior's prior, fully-sized checkpoint cycle are
+	 * no longer needed. However, the next checkpoint will make only few
+	 * segments available for recycling, the ones generated between the timed
+	 * checkpoint and the manual one right after that. If at the manual
+	 * checkpoint we only retained enough segments to get us to the next timed
+	 * one, and removed the rest, then at the next checkpoint we would not have
+	 * enough segments around for recycling, to get us to the checkpoint after
+	 * that. Basing the calculations on the distance from the prior redo
+	 * pointer largely fixes that problem.
+	 */
+
+	/*
+	 * First calculate the expected distance from the redo pointer of a prior
+	 * checkpoint to the point where the next one finishes, assuming that
+	 * the system runs steady all the time.
+	 */
+	distance = (2 + CheckPointCompletionTarget) * CheckPointDistanceEstimate;
+
+	/* add 10% for good measure */
+	distance *= 1.10;
+
+	/*
+	 * Based on that, calculate the expected point where the next checkpoint
+	 * finishes.
+	 */
+	targetPtr = (double) PriorRedoPtr + distance;
+
+	/*
+	 * How many segments do we need to get from the current insert location
+	 * to the end of next checkpoint? That's how many segments we should keep
+	 * preallocated.
+	 */
+	if (targetPtr > CurrPtr)
+		nsegments = (targetPtr - CurrPtr) / XLOG_SEG_SIZE;
+	else
+		nsegments = 0;
+
+	/* add one segment to round up. */
+	nsegments += 1.0;
+
+	return (int) nsegments;
+}
+
+/*
  * Check whether we've consumed enough xlog space that a checkpoint is needed.
  *
  * new_segno indicates a log file that has just been filled up (or read
- * during recovery). We measure the distance from RedoRecPtr to new_segno
- * and see if that exceeds CheckPointSegments.
+ * during recovery). We measure the distance from RedoRecPtr to new_segno,
+ * and estimate based on that if we're about to exceed checkpoint_segments.
  *
  * Note: it is caller's responsibility that RedoRecPtr is up-to-date.
  */
@@ -2357,9 +2424,14 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	 * pre-existing file.  Otherwise, cope with possibility that someone else
 	 * has created the file while we were filling ours: if so, use ours to
 	 * pre-create a future log segment.
+	 *
+	 * XXX: We don't have a good estimate of how many WAL files we should keep
+	 * preallocated here. Quite arbitrarily, use max_advance=5. That's good
+	 * enough for current use of this function; this only gets called when
+	 * there are no more preallocated WAL segments available.
 	 */
 	installed_segno = logsegno;
-	max_advance = XLOGfileslop;
+	max_advance = CheckPointSegments;
 	if (!InstallXLogFileSegment(&installed_segno, tmppath,
 								*use_existent, &max_advance,
 								use_lock))
@@ -2888,7 +2960,7 @@ UpdateLastRemovedPtr(char *filename)
  * whether we want to recycle rather than delete no-longer-wanted log files.
  */
 static void
-RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr)
+RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 {
 	XLogSegNo	endlogSegNo;
 	int			max_advance;
@@ -2907,7 +2979,7 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr)
 	 * segments up to XLOGfileslop segments beyond the current XLOG location.
 	 */
 	XLByteToPrevSeg(endptr, endlogSegNo);
-	max_advance = XLOGfileslop;
+	max_advance = XLOGfileslop(PriorRedoPtr, endptr);
 
 	xldir = AllocateDir(XLOGDIR);
 	if (xldir == NULL)
@@ -6708,7 +6780,8 @@ LogCheckpointEnd(bool restartpoint)
 		elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); "
 			 "%d transaction log file(s) added, %d removed, %d recycled; "
 			 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
-			 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s",
+			 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+			 "distance=%d KB, estimate=%d KB",
 			 CheckpointStats.ckpt_bufs_written,
 			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 			 CheckpointStats.ckpt_segs_added,
@@ -6719,12 +6792,14 @@ LogCheckpointEnd(bool restartpoint)
 			 total_secs, total_usecs / 1000,
 			 CheckpointStats.ckpt_sync_rels,
 			 longest_secs, longest_usecs / 1000,
-			 average_secs, average_usecs / 1000);
+			 average_secs, average_usecs / 1000,
+			 (int) (PrevCheckPointDistance / 1024.0), (int) (CheckPointDistanceEstimate / 1024.0));
 	else
 		elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
 			 "%d transaction log file(s) added, %d removed, %d recycled; "
 			 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
-			 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s",
+			 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+			 "distance=%d KB, estimate=%d KB",
 			 CheckpointStats.ckpt_bufs_written,
 			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 			 CheckpointStats.ckpt_segs_added,
@@ -6735,7 +6810,45 @@ LogCheckpointEnd(bool restartpoint)
 			 total_secs, total_usecs / 1000,
 			 CheckpointStats.ckpt_sync_rels,
 			 longest_secs, longest_usecs / 1000,
-			 average_secs, average_usecs / 1000);
+			 average_secs, average_usecs / 1000,
+			 (int) (PrevCheckPointDistance / 1024.0), (int) (CheckPointDistanceEstimate / 1024.0));
+}
+
+/*
+ * Update the estimate of distance between checkpoints.
+ *
+ * The estimate is maintained for calculating the number of WAL segments to
+ * keep preallocated, see XLOGFileSlop().
+ */
+static void
+UpdateCheckPointDistanceEstimate(uint64 nbytes)
+{
+	/*
+	 * To estimate the number of segments consumed between checkpoints, keep
+	 * a moving average of the actual number of segments consumed in previous
+	 * checkpoint cycles. However, if the load is bursty, with quiet periods and
+	 * busy periods, we want to cater for the peak load. So instead of a plain
+	 * moving average, we let the average decline slowly if the previous cycle
+	 * used less segments than estimated, but increase it immediately if it
+	 * used more.
+	 *
+	 * When checkpoints are triggered by checkpoint_segments, this should
+	 * converge to (1.0 + checkpoint_completion_target) * CheckpointSegments,
+	 *
+	 * XXX should we differentiate between explicitly triggered checkpoints,
+	 * and others? The slow-decline will largely mask them out, if they only
+	 * happen every now and then. If they are frequent, maybe the estimate
+	 * really should count them in as any others; if you issue a manual
+	 * checkpoint every 5 minutes and never let a timed checkpoint happen, it
+	 * makes sense to base the preallocation on that 5 minute interval rather
+	 * than whatever checkpoint_timeout is set to.
+	 */
+	PrevCheckPointDistance = nbytes;
+	if (CheckPointDistanceEstimate < nbytes)
+		CheckPointDistanceEstimate = nbytes;
+	else
+		CheckPointDistanceEstimate =
+			(0.90 * CheckPointDistanceEstimate + 0.10 * (double) nbytes);
 }
 
 /*
@@ -6775,7 +6888,7 @@ CreateCheckPoint(int flags)
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	XLogRecData rdata;
 	uint32		freespace;
-	XLogSegNo	_logSegNo;
+	XLogRecPtr	PriorRedoPtr;
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
@@ -7084,10 +7197,10 @@ CreateCheckPoint(int flags)
 				(errmsg("concurrent transaction log activity while database system is shutting down")));
 
 	/*
-	 * Select point at which we can truncate the log, which we base on the
-	 * prior checkpoint's earliest info.
+	 * Remember the prior checkpoint's redo pointer, used later to determine
+	 * the point at which we can truncate the log.
 	 */
-	XLByteToSeg(ControlFile->checkPointCopy.redo, _logSegNo);
+	PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
 	/*
 	 * Update the control file.
@@ -7141,11 +7254,17 @@ CreateCheckPoint(int flags)
 	 * Delete old log files (those no longer needed even for previous
 	 * checkpoint or the standbys in XLOG streaming).
 	 */
-	if (_logSegNo)
+	if (PriorRedoPtr != InvalidXLogRecPtr)
 	{
+		XLogSegNo	_logSegNo;
+
+		/* Update the average distance between checkpoints. */
+		UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
+
+		XLByteToSeg(PriorRedoPtr, _logSegNo);
 		KeepLogSeg(recptr, &_logSegNo);
 		_logSegNo--;
-		RemoveOldXlogFiles(_logSegNo, recptr);
+		RemoveOldXlogFiles(_logSegNo, PriorRedoPtr, recptr);
 	}
 
 	/*
@@ -7333,7 +7452,7 @@ CreateRestartPoint(int flags)
 {
 	XLogRecPtr	lastCheckPointRecPtr;
 	CheckPoint	lastCheckPoint;
-	XLogSegNo	_logSegNo;
+	XLogRecPtr	PriorRedoPtr;
 	TimestampTz xtime;
 
 	/* use volatile pointer to prevent code rearrangement */
@@ -7429,10 +7548,10 @@ CreateRestartPoint(int flags)
 	CheckPointGuts(lastCheckPoint.redo, flags);
 
 	/*
-	 * Select point at which we can truncate the xlog, which we base on the
-	 * prior checkpoint's earliest info.
+	 * Remember the prior checkpoint's redo pointer, used later to determine
+	 * the point at which we can truncate the log.
 	 */
-	XLByteToSeg(ControlFile->checkPointCopy.redo, _logSegNo);
+	PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
 	/*
 	 * Update pg_control, using current time.  Check that it still shows
@@ -7459,12 +7578,15 @@ CreateRestartPoint(int flags)
 	 * checkpoint/restartpoint) to prevent the disk holding the xlog from
 	 * growing full.
 	 */
-	if (_logSegNo)
+	if (PriorRedoPtr != InvalidXLogRecPtr)
 	{
 		XLogRecPtr	receivePtr;
 		XLogRecPtr	replayPtr;
 		TimeLineID	replayTLI;
 		XLogRecPtr	endptr;
+		XLogSegNo	_logSegNo;
+
+		XLByteToSeg(PriorRedoPtr, _logSegNo);
 
 		/*
 		 * Get the current end of xlog replayed or received, whichever is
@@ -7493,7 +7615,7 @@ CreateRestartPoint(int flags)
 		if (RecoveryInProgress())
 			ThisTimeLineID = replayTLI;
 
-		RemoveOldXlogFiles(_logSegNo, endptr);
+		RemoveOldXlogFiles(_logSegNo, PriorRedoPtr, endptr);
 
 		/*
 		 * Make more log segments if needed.  (Do this after recycling old log

#37

Josh Berkus

josh@agliodbs.com

over 12 years ago

In reply to: Heikki Linnakangas (#1)

Re: Redesigning checkpoint_segments

Given the behavior of xlog, I'd want to adjust the
algo so that peak usage on a 24-hour basis would affect current
preallocation. That is, if a site regularly has a peak from 2-3pm where
they're using 180 segments/cycle, then they should still be somewhat
higher at 2am than a database which doesn't have that peak. I'm pretty
sure that the bgwriter's moving average cycles much shorter time scales
than that.

Makes sense. I didn't implement that in the attached, though.

It's possible that it won't matter. Performance testing will tell us.

Having a separate option to specify a minimum number of segments (or
rather minimum size in MB) to keep preallocated would at least allow a
DBA to set that manually, based on the observed peak. I didn't implement
such a manual option in the attached, but that would be easy.

Yeah, I'd really like to get away from adding manual options which need
to be used in non-specialty cases. I think we'll need one at some point
-- there are DB applications which are VERY bursty -- but let's not
start there and see if we can make reasonable autotuning work.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM66666f63e947fc467a9bd18aeefcd58346aa6ca8934a109195564e1bbb3f341e3d56b97e6b2bbb993f91e56b81804d78@asav-3.01.com

#38

Greg Smith

greg@2ndQuadrant.com

over 12 years ago

In reply to: Joshua D. Drake (#26)

1 attachment(s)

Re: Redesigning checkpoint_segments

On 6/6/13 4:42 AM, Joshua D. Drake wrote:

On 6/6/2013 1:11 AM, Heikki Linnakangas wrote:

(I'm sure you know this, but:) If you perform a checkpoint as fast and
short as possible, the sudden burst of writes and fsyncs will
overwhelm the I/O subsystem, and slow down queries. That's what we saw
before spread checkpoints: when a checkpoint happens, the response
times of queries jumped up.

That isn't quite right. Previously we had lock issues as well and
checkpoints would take considerable time to complete. What I am talking
about is that the background writer (and wal writer where applicable)
have done all the work before a checkpoint is even called.

That is not possible, and if you look deeper at a lot of workloads
you'll eventually see why. I'd recommend grabbing snapshots of
pg_buffercache output from a lot of different types of servers and see
what the usage count distribution looks like. That's what did in order
to create all of the behaviors the current background writer code caters
to. Attached is a small spreadsheet that shows the main two extremes
here, from one of my old talks. "Effective buffer cache system" is full
of usage count 5 pages, while the "Minimally effective buffer cache" one
is all usage count 1 or 0. We don't have redundant systems here; we
have two that aim at distinctly different workloads. That's one reason
why splitting them apart ended up being necessary to move forward, they
really don't overlap very much on some servers.

Sampling a few servers that way was where the controversial idea of
scanning the whole buffer pool every few minutes even without activity
came from too. I found a bursty real world workload where that was
necessary to keep buffers clean usefully, and that heuristic helped them
a lot. I too would like to visit the exact logic used, but I could cook
up a test case where it's useful again if people really doubt it has any
value. There's one in the 2007 archives somewhere.

The reason the checkpointer code has to do this work, and it has to
spread the writes out, is that on some systems the hot data set hits a
high usage count. If shared_buffers is 8GB and at any moment 6GB of it
has a usage count of 5, which absolutely happens on many busy servers,
the background writer will do almost nothing useful. It won't and
shouldn't touch buffers unless their usage count is low. Those heavily
referenced blocks will only be written to disk once per checkpoint cycle.

Without the spreading, in this example you will drop 6GB into "Dirty
Memory" on a Linux server, call fdatasync, and the server might stop
doing any work at all for *minutes* of time. Easiest way to see it
happen is to set checkpoint_completion_target to 0, put the filesystem
on ext3, and have a server with lots of RAM. I have a monitoring tool
that graphs Dirty Memory over time because this problem is so nasty even
with the spreading code in place.

There is this idea that pops up sometimes that a background writer write
is better than a checkpoint one. This is backwards. A dirty block must
be written at least once per checkpoint. If you only write it once per
checkpoint, inside of the checkpoint process, that is the ideal. It's
what you want for best performance when it's possible.

At the same time, some workloads churn through a lot of low usage count
data, rather than building up a large block of high usage count stuff.
On those your best hope for low latency is to crank up the background
writer and let it try to stay ahead of backends with the writes. The
checkpointer won't have nearly as much work to do in that situation.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#39

Greg Smith

greg@2ndQuadrant.com

over 12 years ago

In reply to: Heikki Linnakangas (#25)

Re: Redesigning checkpoint_segments

On 6/6/13 4:41 AM, Heikki Linnakangas wrote:

I was thinking of letting the estimate
decrease like a moving average, but react to any increases immediately.
Same thing we do in bgwriter to track buffer allocations:

Combine what your submitted patch does and this idea, and you'll have
something I prototyped a few years ago. I took the logic and tested it
out in user space by parsing the output from log_checkpoints to see how
many segments were being used. That approach coughed out a value about
as good for checkpoint_segments as I picked by hand.

The main problem was it liked to over-tune the segments based on a small
bursts of activity, leaving a value higher than you might want to use
the rest of the time. The background writer didn't worry about this
very much because the cost of making a mistake for one 200ms cycle was
pretty low. Setting checkpoint_segments high is a more expensive issue.
When I set these by hand, I'll aim more to cover a 99th percentile of
the maximum segments number rather than every worst case seen.

I don't think that improvement is worth spending very much effort on
though. The moving average approach is more than good enough in most
cases. I've wanted checkpoint_segments to go away in exactly this
fashion for a while.

The general complaint the last time I suggested a change in this area,
to make checkpoint_segments larger for the average user, was that some
people had seen workloads where that was counterproductive. Pretty sure
Kevin Grittner said he'd seen that happen. That's how I remember this
general idea dying the last time, and I still don't have enough data to
refute that doesn't happen.

As far as the UI, if it's a soft limit I'd suggest wal_size_target for
the name. What I would like to see is a single number here in memory
units that replaces both checkpoint_segments and wal_keep_segments. If
you're willing to use a large chunk of disk space to handle either one
of activity spikes or the class of replication issues wal_keep_segments
targets, I don't see why you'd want to ban using that space for the
other one too.

To put some perspective on how far we've been able to push this in the
field with minimal gripes, the repmgr tool requires wal_keep_segments be

=5000, which works out to 78GB. I still see some people use 73GB SAS

drives in production servers for their WAL files, but that's the only
time I've seen that number become scary when deploying repmgr.
Meanwhile, the highest value for checkpoint_segments I've set based on
real activity levels was 1024, on a server where checkpoint_timeout is
15 minutes (and can be no shorter without checkpoint spikes). At no
point during that fairly difficult but of tuning work did
checkpoint_segments do anything but get in the way.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Robert Haas

robertmhaas@gmail.com

over 12 years ago

In reply to: Greg Smith (#39)

Re: Redesigning checkpoint_segments

On Thu, Jun 6, 2013 at 10:43 PM, Greg Smith <greg@2ndquadrant.com> wrote:

The general complaint the last time I suggested a change in this area, to
make checkpoint_segments larger for the average user, was that some people
had seen workloads where that was counterproductive. Pretty sure Kevin
Grittner said he'd seen that happen. That's how I remember this general
idea dying the last time, and I still don't have enough data to refute that
doesn't happen.

My guess is that, with Heikki's patch, a lot of the value of keeping
checkpoint_segments low should go away - because if there wasn't much
activity, checkpoint_segments will in effect remain low, even the
configured value is not so low. And if activity is high, well then
larger checkpoint_segments will be better anyway.

(As to why smaller checkpoint_segments can help, here's my guess: if
checkpoint_segments is relatively small, then when we recycle a
segment we're likely to find its data already in cache. That's a lot
better than reading it back in from disk just to overwrite the data.)

As far as the UI, if it's a soft limit I'd suggest wal_size_target for the
name. What I would like to see is a single number here in memory units that
replaces both checkpoint_segments and wal_keep_segments. If you're willing
to use a large chunk of disk space to handle either one of activity spikes
or the class of replication issues wal_keep_segments targets, I don't see
why you'd want to ban using that space for the other one too.

This isn't really making sense to me. I don't think we should assume
that someone who wants to keep WAL around for replication also wants
to wait longer between checkpoints. Those are two quite different
things.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Kevin Grittner

kgrittn@ymail.com

over 12 years ago

In reply to: Robert Haas (#40)

Re: Redesigning checkpoint_segments

Robert Haas <robertmhaas@gmail.com> wrote:

(As to why smaller checkpoint_segments can help, here's my guess:
if checkpoint_segments is relatively small, then when we recycle
a segment we're likely to find its data already in cache. That's
a lot better than reading it back in from disk just to overwrite
the data.)

My recollection on this topic is that before pg_upgrade Wisconsin
Courts had to upgrade all of the geographically distributed
databases to a new PostgreSQL version, and that was being done with
pg_dump piped to psql in conjunction with the rollout of new
hardware (according to the four-year replacement policy). The
upgrade process involved a DBA staying late centrally while the
conversion ran, a field tech staying late on the client site to
haul off the old box once successful conversion was confirmed, a
business analyst staying late to confirm proper operation after the
conversion, and a web programmer staying late to confirm that all
web interfaces showed proper data flow post-conversion. Every
minute shaved off of the upgrade process saved a lot of staff time,
so the DBA team tested the conversion process very carefully.

Some findings were unsurprising, like that a direct connection
between the servers using a cross-wired network patch cable was
faster than plugging both machines into the same switch. But we
tested all of our assumptions, and re-tested the surprising ones.
One such surprise was that the conversion ran faster, even on a
"largish" database of around 200GB, with 3 checkpoint_segments than
with larger settings. The difference was significant and
repeatable. My personal theory was that segments were being
recycled and overwritten while still in the battery-backed
controller cache, so writes from multiple cycles evaporated in the
cache, reducing total physical disk writes. Greg Smith blew that
theory out of the water by finding the same behavior on his laptop,
which did not have a write-back cache. AFAIK, this mystery remains
unsolved, although Robert's idea above sounds plausible.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

Robert Haas

robertmhaas@gmail.com

over 12 years ago

In reply to: Kevin Grittner (#41)

Re: Redesigning checkpoint_segments

On Fri, Jun 7, 2013 at 3:14 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Some findings were unsurprising, like that a direct connection
between the servers using a cross-wired network patch cable was
faster than plugging both machines into the same switch. But we
tested all of our assumptions, and re-tested the surprising ones.
One such surprise was that the conversion ran faster, even on a
"largish" database of around 200GB, with 3 checkpoint_segments than
with larger settings.

I can't account for that finding, because my experience is that small
checkpoint_segments settings lead to *terrible* bulk restore
performance.

*scratches head*

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Kevin Grittner

kgrittn@ymail.com

over 12 years ago

In reply to: Robert Haas (#42)

Re: Redesigning checkpoint_segments

Robert Haas <robertmhaas@gmail.com> wrote:

Kevin Grittner <kgrittn@ymail.com> wrote:

One such surprise was that the conversion ran faster, even on a
"largish" database of around 200GB, with 3 checkpoint_segments
than with larger settings.

!

I can't account for that finding, because my experience is that
small checkpoint_segments settings lead to *terrible* bulk
restore performance.

*scratches head*

Perhaps it was due to some of the "running with scissors" settings
we used for the upgrade process that we don't normally use, like
fsync = off and full_page_writes = off. We also used a larger than
usual maintenance_work_mem which reduced disk sorts, possibly
helping the WAL files to remain cached on the controller.

Maybe it also helped keep data flowing to the actual disks, so that
it didn't alternate between "idle" and "glutted" states, although I
don't have any evidence to support that theory.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Greg Smith

greg@2ndQuadrant.com

over 12 years ago

In reply to: Robert Haas (#40)

Re: Redesigning checkpoint_segments

On 6/7/13 2:43 PM, Robert Haas wrote:

name. What I would like to see is a single number here in memory units that
replaces both checkpoint_segments and wal_keep_segments.

This isn't really making sense to me. I don't think we should assume
that someone who wants to keep WAL around for replication also wants
to wait longer between checkpoints. Those are two quite different
things.

It's been years since I saw anyone actually using checkpoint_segments as
that sort of limit. I see a lot of sites pushing the segments limit up
and then using checkpoint_timeout carefully. It's pretty natural to say
"I don't want to go more than X minutes between checkpoints". The case
for wanting to say "I don't want to go more than X MB between
checkpoints" instead, motivated by not wanting too much activity to
queue between them, I'm just not seeing demand for that now.

The main reason I do see people paying attention to checkpoint_segments
still is to try and put a useful bound on WAL disk space usage. That's
the use case I think overlaps with wal_keep_segments such that you might
replace both of them. I think we really only need one control that
limits how much WAL space is expected inside of pg_xlog, and it should
be easy and obvious how to set it.

The more I look at this checkpoint_segments patch, the more I wonder why
it's worth even bothering with anything but a disk space control here.
checkpoint_segments is turning into an internal implementation detail
most sites I see wouldn't miss at all. Rather than put work into
autotuning it, I'd be happy to eliminate checkpoint_segments altogther,
in favor of a WAL disk space limit.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45

Craig Ringer

craig@2ndquadrant.com

over 12 years ago

In reply to: Josh Berkus (#32)

Re: Redesigning checkpoint_segments

On 06/07/2013 01:00 AM, Josh Berkus wrote:

Daniel,

So your suggestion is that if archiving is falling behind, we should
introduce delays on COMMIT in order to slow down the rate of WAL writing?

Delaying commit wouldn't be enough; consider a huge COPY, which can
produce a lot of WAL at a high rate without a convenient point to delay at.

I expect a delay after writing an xlog record would make a more suitable
write-rate throttle, though I'd want to be sure the extra branch didn't
hurt performance significantly. Branch prediction hints would help;
since we don't *care* if the delay branch is slow and causes pipeline
stalls, tagging the no-delay branch as likely would probably deal with
that concern for supported compilers/platforms.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46

Craig Ringer

craig@2ndquadrant.com

over 12 years ago

In reply to: Joshua D. Drake (#23)

Re: Redesigning checkpoint_segments

On 06/06/2013 03:21 PM, Joshua D. Drake wrote:

Not to be unkind but the problems of the uniformed certainly are not
the problems of the informed. Or perhaps they are certainly the
problems of the informed :P.

I'm not convinced that's a particularly good argument not to improve
something. Sure, it might be a usability issue not a purely technical
issue, but that IMO doesn't make it much less worth fixing.

Bad usability puts people off early, before they can become productive
and helpful community members. It also puts others off trying the
software at all by reputation alone.

In any case, I don't think this is an issue of the informed vs
uninformed. It's also a matter of operational sanity at scale. "The
sysadmin" can't watch 100,000 individual servers and jump in to make
minute tweaks - nor should they have to when some auto-tuning could
obviate the need.

The same issue exists with vacuum - it's hard for basic users to
understand, so they misconfigure it and often achieve the opposite
results to what they need. It's been getting better, but some
feedback-based control would make a world of difference when running Pg.

In this I really have to agree with Hekki and Daniel - more usable and
preferably feedback-tuned defaults would be really, really nice to have,
though I'd want good visibility (logging, SHOW commands, etc) into what
they were doing and options to override for special cases.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

Peter Eisentraut

peter_e@gmx.net

over 12 years ago

In reply to: Heikki Linnakangas (#36)

Re: Redesigning checkpoint_segments

On 6/6/13 4:09 PM, Heikki Linnakangas wrote:

On 06.06.2013 20:24, Josh Berkus wrote:

Yeah, something like that :-). I was thinking of letting the estimate
decrease like a moving average, but react to any increases immediately.
Same thing we do in bgwriter to track buffer allocations:

Seems reasonable.

Here's a patch implementing that. Docs not updated yet. I did not change
the way checkpoint_segments triggers checkpoints - that'll can be a
separate patch. This only decouples the segment preallocation behavior
from checkpoint_segments. With the patch, you can set
checkpoint_segments really high, without consuming that much disk space
all the time.

I don't understand what this patch, by itself, will accomplish in terms
of the originally stated goals of making checkpoint_segments easier to
tune, and controlling disk space used. To some degree, it makes both of
these things worse, because you can no longer use checkpoint_segments to
control the disk space. Instead, it is replaced by magic.

What sort of behavior are you expecting to come out of this? In testing,
I didn't see much of a difference. Although I'd expect that this would
actually preallocate fewer segments than the old formula.

Two small issues in the code:

Code change doesn't match comment:

+        *
+        * XXX: We don't have a good estimate of how many WAL files we should keep
+        * preallocated here. Quite arbitrarily, use max_advance=5. That's good
+        * enough for current use of this function; this only gets called when
+        * there are no more preallocated WAL segments available.
         */
        installed_segno = logsegno;
-       max_advance = XLOGfileslop;
+       max_advance = CheckPointSegments;

KB should be kB.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

Josh Berkus

josh@agliodbs.com

over 12 years ago

In reply to: Heikki Linnakangas (#1)

Re: Redesigning checkpoint_segments

On 07/03/2013 11:28 AM, Peter Eisentraut wrote:

On 6/6/13 4:09 PM, Heikki Linnakangas wrote:
I don't understand what this patch, by itself, will accomplish in terms
of the originally stated goals of making checkpoint_segments easier to
tune, and controlling disk space used. To some degree, it makes both of
these things worse, because you can no longer use checkpoint_segments to
control the disk space. Instead, it is replaced by magic.

What sort of behavior are you expecting to come out of this? In testing,
I didn't see much of a difference. Although I'd expect that this would
actually preallocate fewer segments than the old formula.

Since I haven't seen a reply to Peter's comments from Heikki, I'm
marking this patch "returned with feedback". I know, it's a very busy
CF, and I'm sure that you just couldn't get back to this one. We'll
address it in September?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM716859c53cc5fb2cce9bcc9c6c78f48effed4fab2b0db86e2431d7a6b2372bd368a83c892ceba6c86a66859bf019d51e@asav-1.01.com

#49

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Peter Eisentraut (#47)

1 attachment(s)

Re: Redesigning checkpoint_segments

On 03.07.2013 21:28, Peter Eisentraut wrote:

On 6/6/13 4:09 PM, Heikki Linnakangas wrote:

Here's a patch implementing that. Docs not updated yet. I did not change
the way checkpoint_segments triggers checkpoints - that'll can be a
separate patch. This only decouples the segment preallocation behavior
from checkpoint_segments. With the patch, you can set
checkpoint_segments really high, without consuming that much disk space
all the time.

I don't understand what this patch, by itself, will accomplish in terms
of the originally stated goals of making checkpoint_segments easier to
tune, and controlling disk space used. To some degree, it makes both of
these things worse, because you can no longer use checkpoint_segments to
control the disk space. Instead, it is replaced by magic.

The patch addressed the third point in my first post:

A third point is that even if you have 10 GB of disk space reserved
for WAL, you don't want to actually consume all that 10 GB, if it's
not required to run the database smoothly. There are several reasons
for that: backups based on a filesystem-level snapshot are larger
than necessary, if there are a lot of preallocated WAL segments and
in a virtualized or shared system, there might be other VMs or
applications that could make use of the disk space. On the other
hand, you don't want to run out of disk space while writing WAL -
that can lead to a PANIC in the worst case.

What sort of behavior are you expecting to come out of this? In testing,
I didn't see much of a difference. Although I'd expect that this would
actually preallocate fewer segments than the old formula.

For example, if you set checkpoint_segments to 200, and you temporarily
generate 100 segments of WAL during an initial data load, but the normal
workload generates only 20 segments between checkpoints. Without the
patch, you will permanently have about 120 segments in pg_xlog, created
by the spike. With the patch, the extra segments will be gradually
removed after the data load, down to the level needed by the constant
workload. That would be about 50 segments, assuming the default
checkpoint_completion_target=0.5.

Here's a bigger patch, which does more. It is based on the ideas in the
post I started this thread with, with feedback incorporated from the
long discussion. With this patch, WAL disk space usage is controlled by
two GUCs:

min_recycle_wal_size
checkpoint_wal_size

These GUCs act as soft minimum and maximum on overall WAL size. At each
checkpoint, the checkpointer removes enough old WAL files to keep
pg_xlog usage below checkpoint_wal_size, and recycles enough new WAL
files to reach min_recycle_wal_size. Between those limits, there is a
self-tuning mechanism to recycle just enough WAL files to get to end of
the next checkpoint without running out of preallocated WAL files. To
estimate how many files are needed for that, a moving average of how
much WAL is generated between checkpoints is calculated. The moving
average is updated with "fast-rise slow-decline" behavior, to cater for
peak rather than true average use to some extent.

As today, checkpoints are triggered based on time or WAL usage,
whichever comes first. WAL-based checkpoints are triggered based on the
good old formula: CheckPointSegments = (checkpoint_max_wal_size / (2.0 +
checkpoint_completion_target)) / 16MB. CheckPointSegments controls that
like before, but it is now an internal variable derived from
checkpoint_wal_size, not visible to users.

These settings are fairly intuitive for a DBA to tune. You begin by
figuring out how much disk space you can afford to spend on WAL, and set
checkpoint_wal_size to that (with some safety margin, of course). Then
you set checkpoint_timeout based on how long you're willing to wait for
recovery to finish. Finally, if you have infrequent batch jobs that need
a lot more WAL than the system otherwise needs, you can set
min_recycle_wal_size to keep enough WAL preallocated for the spikes.

You can also set min_recycle_wal_size = checkpoint_wal_size, which gets
you the same behavior as without the patch, except that it's more
intuitive to set it in terms of "MB of WAL space required", instead of
"# of segments between checkpoints".

Does that make sense? I'd love to hear feedback on how people setting up
production databases would like to tune these things. The reason for the
auto-tuning between the min and max is to be able to set reasonable
defaults e.g for embedded systems that don't have a DBA to do tuning.
Currently, it's very difficult to come up with a reasonable default
value for checkpoint_segments which would work well for a wide range of
systems. The PostgreSQL default of 3 is way way too low for most
systems. On the other hand, if you set it to, say, 20, that's a lot of
wasted space for a small database that's not updated much. With this
patch, you can set "max_wal_size=1GB" and if the database ends up
actually only needing 100 MB of WAL, it will only use that much and not
waste 900 MB for useless preallocated WAL files.

These GUCs are still soft limits. If the system is busy enough that the
checkpointer can't reach its target, it can exceed checkpoint_wal_size.
Making it a hard limit is a much bigger task than I'm willing to tackle
right now.

- Heikki

Attachments:

redesign-checkpoint-segments-1.patchtext/x-diff; name=redesign-checkpoint-segments-1.patchDownload

*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1036,1042 **** include 'filename'
          usually require a corresponding increase in
          <varname>checkpoint_segments</varname>, in order to spread out the
          process of writing large quantities of new or changed data over a
!         longer period of time.
         </para>
  
         <para>
--- 1036,1042 ----
          usually require a corresponding increase in
          <varname>checkpoint_segments</varname>, in order to spread out the
          process of writing large quantities of new or changed data over a
!         longer period of time. FIXME: What should we suggest here now?
         </para>
  
         <para>
***************
*** 1958,1974 **** include 'filename'
       <title>Checkpoints</title>
  
      <variablelist>
!      <varlistentry id="guc-checkpoint-segments" xreflabel="checkpoint_segments">
!       <term><varname>checkpoint_segments</varname> (<type>integer</type>)</term>
        <indexterm>
!        <primary><varname>checkpoint_segments</> configuration parameter</primary>
        </indexterm>
        <listitem>
         <para>
!         Maximum number of log file segments between automatic WAL
!         checkpoints (each segment is normally 16 megabytes). The default
!         is three segments.  Increasing this parameter can increase the
!         amount of time needed for crash recovery.
          This parameter can only be set in the <filename>postgresql.conf</>
          file or on the server command line.
         </para>
--- 1958,1977 ----
       <title>Checkpoints</title>
  
      <variablelist>
!      <varlistentry id="guc-checkpoint-wal-size" xreflabel="checkpoint_wal_size">
!       <term><varname>checkpoint_wal_size</varname> (<type>integer</type>)</term>
        <indexterm>
!        <primary><varname>checkpoint_wal_size</> configuration parameter</primary>
        </indexterm>
        <listitem>
         <para>
!         Maximum size to let the WAL grow to between automatic WAL
!         checkpoints. This is a soft limit; WAL size can exceed
!         <varname>checkpoint_wal_size</> under special circumstances, like
!         under heavy load, a failing <varname>archive_command</>, or a high
!         <varname>wal_keep_segments</> setting. The default is 256 MB.
!         Increasing this parameter can increase the amount of time needed for
!         crash recovery.
          This parameter can only be set in the <filename>postgresql.conf</>
          file or on the server command line.
         </para>
***************
*** 2028,2033 **** include 'filename'
--- 2031,2054 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-min-recycle-wal-size" xreflabel="min_recycle_wal_size">
+       <term><varname>min_recycle_wal_size</varname> (<type>integer</type>)</term>
+       <indexterm>
+        <primary><varname>min_recycle_wal_size</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         As long as WAL disk usage stays below this setting, old WAL files are
+         always recycled for future use at a checkpoint, rather than removed.
+         This can be used to ensure that enough WAL space is reserved to
+         handle spikes in WAL usage, for example when running large batch
+         jobs. The default is 80 MB.
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       </variablelist>
       </sect2>
       <sect2 id="runtime-config-wal-archiving">
*** a/doc/src/sgml/perform.sgml
--- b/doc/src/sgml/perform.sgml
***************
*** 1302,1320 **** SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
     </para>
    </sect2>
  
!   <sect2 id="populate-checkpoint-segments">
!    <title>Increase <varname>checkpoint_segments</varname></title>
  
     <para>
!     Temporarily increasing the <xref
!     linkend="guc-checkpoint-segments"> configuration variable can also
      make large data loads faster.  This is because loading a large
      amount of data into <productname>PostgreSQL</productname> will
      cause checkpoints to occur more often than the normal checkpoint
      frequency (specified by the <varname>checkpoint_timeout</varname>
      configuration variable). Whenever a checkpoint occurs, all dirty
      pages must be flushed to disk. By increasing
!     <varname>checkpoint_segments</varname> temporarily during bulk
      data loads, the number of checkpoints that are required can be
      reduced.
     </para>
--- 1302,1320 ----
     </para>
    </sect2>
  
!   <sect2 id="populate-checkpoint-wal-size">
!    <title>Increase <varname>checkpoint_wal_size</varname></title>
  
     <para>
!     Increasing the <xref
!     linkend="guc-checkpoint-wal-size"> configuration variable can also
      make large data loads faster.  This is because loading a large
      amount of data into <productname>PostgreSQL</productname> will
      cause checkpoints to occur more often than the normal checkpoint
      frequency (specified by the <varname>checkpoint_timeout</varname>
      configuration variable). Whenever a checkpoint occurs, all dirty
      pages must be flushed to disk. By increasing
!     <varname>checkpoint-wal-size</varname> temporarily during bulk
      data loads, the number of checkpoints that are required can be
      reduced.
     </para>
***************
*** 1419,1425 **** SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
        <para>
         Set appropriate (i.e., larger than normal) values for
         <varname>maintenance_work_mem</varname> and
!        <varname>checkpoint_segments</varname>.
        </para>
       </listitem>
       <listitem>
--- 1419,1425 ----
        <para>
         Set appropriate (i.e., larger than normal) values for
         <varname>maintenance_work_mem</varname> and
!        <varname>checkpoint_wal_size</varname>.
        </para>
       </listitem>
       <listitem>
***************
*** 1486,1492 **** SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
  
      So when loading a data-only dump, it is up to you to drop and recreate
      indexes and foreign keys if you wish to use those techniques.
!     It's still useful to increase <varname>checkpoint_segments</varname>
      while loading the data, but don't bother increasing
      <varname>maintenance_work_mem</varname>; rather, you'd do that while
      manually recreating indexes and foreign keys afterwards.
--- 1486,1492 ----
  
      So when loading a data-only dump, it is up to you to drop and recreate
      indexes and foreign keys if you wish to use those techniques.
!     It's still useful to increase <varname>checkpoint_wal_size</varname>
      while loading the data, but don't bother increasing
      <varname>maintenance_work_mem</varname>; rather, you'd do that while
      manually recreating indexes and foreign keys afterwards.
***************
*** 1542,1548 **** SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
  
       <listitem>
        <para>
!        Increase <xref linkend="guc-checkpoint-segments"> and <xref
         linkend="guc-checkpoint-timeout"> ; this reduces the frequency
         of checkpoints, but increases the storage requirements of
         <filename>/pg_xlog</>.
--- 1542,1548 ----
  
       <listitem>
        <para>
!        Increase <xref linkend="guc-checkpoint-wal-size"> and <xref
         linkend="guc-checkpoint-timeout"> ; this reduces the frequency
         of checkpoints, but increases the storage requirements of
         <filename>/pg_xlog</>.
*** a/doc/src/sgml/wal.sgml
--- b/doc/src/sgml/wal.sgml
***************
*** 471,479 ****
    <para>
     The server's checkpointer process automatically performs
     a checkpoint every so often.  A checkpoint is begun every <xref
!    linkend="guc-checkpoint-segments"> log segments, or every <xref
!    linkend="guc-checkpoint-timeout"> seconds, whichever comes first.
!    The default settings are 3 segments and 300 seconds (5 minutes), respectively.
     If no WAL has been written since the previous checkpoint, new checkpoints
     will be skipped even if <varname>checkpoint_timeout</> has passed.
     (If WAL archiving is being used and you want to put a lower limit on how
--- 471,480 ----
    <para>
     The server's checkpointer process automatically performs
     a checkpoint every so often.  A checkpoint is begun every <xref
!    linkend="guc-checkpoint-timeout"> seconds, or if
!    <xref linkend="guc-checkpoint-wal-size"> is about to be exceeded, whichever
!    comes first.
!    The default settings are 5 minutes and 256 MB, respectively.
     If no WAL has been written since the previous checkpoint, new checkpoints
     will be skipped even if <varname>checkpoint_timeout</> has passed.
     (If WAL archiving is being used and you want to put a lower limit on how
***************
*** 485,492 ****
    </para>
  
    <para>
!    Reducing <varname>checkpoint_segments</varname> and/or
!    <varname>checkpoint_timeout</varname> causes checkpoints to occur
     more often. This allows faster after-crash recovery, since less work
     will need to be redone. However, one must balance this against the
     increased cost of flushing dirty data pages more often. If
--- 486,493 ----
    </para>
  
    <para>
!    Reducing <varname>checkpoint_timeout</varname> and/or
!    <varname>checkpoint_wal_size</varname> causes checkpoints to occur
     more often. This allows faster after-crash recovery, since less work
     will need to be redone. However, one must balance this against the
     increased cost of flushing dirty data pages more often. If
***************
*** 509,519 ****
     parameter.  If checkpoints happen closer together than
     <varname>checkpoint_warning</> seconds,
     a message will be output to the server log recommending increasing
!    <varname>checkpoint_segments</varname>.  Occasional appearance of such
     a message is not cause for alarm, but if it appears often then the
     checkpoint control parameters should be increased. Bulk operations such
     as large <command>COPY</> transfers might cause a number of such warnings
!    to appear if you have not set <varname>checkpoint_segments</> high
     enough.
    </para>
  
--- 510,520 ----
     parameter.  If checkpoints happen closer together than
     <varname>checkpoint_warning</> seconds,
     a message will be output to the server log recommending increasing
!    <varname>checkpoint_wal_size</varname>.  Occasional appearance of such
     a message is not cause for alarm, but if it appears often then the
     checkpoint control parameters should be increased. Bulk operations such
     as large <command>COPY</> transfers might cause a number of such warnings
!    to appear if you have not set <varname>checkpoint_wal_size</> high
     enough.
    </para>
  
***************
*** 524,533 ****
     <xref linkend="guc-checkpoint-completion-target">, which is
     given as a fraction of the checkpoint interval.
     The I/O rate is adjusted so that the checkpoint finishes when the
!    given fraction of <varname>checkpoint_segments</varname> WAL segments
!    have been consumed since checkpoint start, or the given fraction of
!    <varname>checkpoint_timeout</varname> seconds have elapsed,
!    whichever is sooner.  With the default value of 0.5,
     <productname>PostgreSQL</> can be expected to complete each checkpoint
     in about half the time before the next checkpoint starts.  On a system
     that's very close to maximum I/O throughput during normal operation,
--- 525,534 ----
     <xref linkend="guc-checkpoint-completion-target">, which is
     given as a fraction of the checkpoint interval.
     The I/O rate is adjusted so that the checkpoint finishes when the
!    given fraction of
!    <varname>checkpoint_timeout</varname> seconds have elapsed, or before
!    <varname>checkpoint_wal_size</varname> is exceeded, whichever is sooner.
!    With the default value of 0.5,
     <productname>PostgreSQL</> can be expected to complete each checkpoint
     in about half the time before the next checkpoint starts.  On a system
     that's very close to maximum I/O throughput during normal operation,
***************
*** 544,561 ****
    </para>
  
    <para>
!    There will always be at least one WAL segment file, and will normally
!    not be more than (2 + <varname>checkpoint_completion_target</varname>) * <varname>checkpoint_segments</varname> + 1
!    or <varname>checkpoint_segments</> + <xref linkend="guc-wal-keep-segments"> + 1
!    files.  Each segment file is normally 16 MB (though this size can be
!    altered when building the server).  You can use this to estimate space
!    requirements for <acronym>WAL</acronym>.
!    Ordinarily, when old log segment files are no longer needed, they
!    are recycled (that is, renamed to become future segments in the numbered
!    sequence). If, due to a short-term peak of log output rate, there
!    are more than 3 * <varname>checkpoint_segments</varname> + 1
!    segment files, the unneeded segment files will be deleted instead
!    of recycled until the system gets back under this limit.
    </para>
  
    <para>
--- 545,577 ----
    </para>
  
    <para>
!    The number of WAL segment files in <filename>pg_xlog</> directory depends on
!    <varname>checkpoint_wal_size</>, <varname>wal_recycle_min_size</> and the
!    amount of WAL generated in previous checkpoint cycles. When old log
!    segment files are no longer needed, they are removed or recycled (that is,
!    renamed to become future segments in the numbered sequence). If, due to a
!    short-term peak of log output rate, <varname>checkpoint_wal_size</> is
!    exceeded, the unneeded segment files will be removed until the system
!    gets back under this limit. Below that limit, the system recycles enough
!    WAL files to cover the estimated need until the next checkpoint, and
!    removes the rest. The estimate is based on a moving average of the number
!    of WAL files used in previous checkpoint cycles. The moving average
!    is increased immediately if the actual usage exceeds the estimate, so it
!    accommodates peak usage rather average usage to some extent.
!    <varname>wal_recycle_min_size</> puts a minimum on the amount of WAL files
!    recycled for future usage; that much WAL is always recycled for future use,
!    even if the system is idle and the WAL usage estimate suggests that little
!    WAL is needed.
!   </para>
! 
!   <para>
!    Independently of <varname>checkpoint_wal_size</varname>,
!    <xref linkend="guc-wal-keep-segments"> + 1 most recent WAL files are
!    kept at all times. Also, if WAL archiving is used, old segments can not be
!    removed or recycled until they are archived. If WAL archiving cannot keep up
!    with the pace that WAL is generated, or if <varname>archive_command</varname>
!    fails repeatedly, old WAL files will accumulate in <filename>pg_xlog</>
!    until the situation is resolved.
    </para>
  
    <para>
***************
*** 570,578 ****
     master because restartpoints can only be performed at checkpoint records.
     A restartpoint is triggered when a checkpoint record is reached if at
     least <varname>checkpoint_timeout</> seconds have passed since the last
!    restartpoint. In standby mode, a restartpoint is also triggered if at
!    least <varname>checkpoint_segments</> log segments have been replayed
!    since the last restartpoint.
    </para>
  
    <para>
--- 586,593 ----
     master because restartpoints can only be performed at checkpoint records.
     A restartpoint is triggered when a checkpoint record is reached if at
     least <varname>checkpoint_timeout</> seconds have passed since the last
!    restartpoint, or if WAL size is about to exceed
!    <varname>checkpoint_wal_size</>.
    </para>
  
    <para>
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 71,77 **** extern uint32 bootstrap_data_checksum_version;
  
  
  /* User-settable parameters */
! int			CheckPointSegments = 3;
  int			wal_keep_segments = 0;
  int			XLOGbuffers = -1;
  int			XLogArchiveTimeout = 0;
--- 71,78 ----
  
  
  /* User-settable parameters */
! int			checkpoint_wal_size = 262144;	/* 256 MB */
! int			min_recycle_wal_size = 81920;	/* 80 MB */
  int			wal_keep_segments = 0;
  int			XLOGbuffers = -1;
  int			XLogArchiveTimeout = 0;
***************
*** 86,108 **** int			CommitDelay = 0;	/* precommit delay in microseconds */
  int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
  int			num_xloginsert_slots = 8;
  
  #ifdef WAL_DEBUG
  bool		XLOG_DEBUG = false;
  #endif
  
! /*
!  * XLOGfileslop is the maximum number of preallocated future XLOG segments.
!  * When we are done with an old XLOG segment file, we will recycle it as a
!  * future XLOG segment as long as there aren't already XLOGfileslop future
!  * segments; else we'll delete it.  This could be made a separate GUC
!  * variable, but at present I think it's sufficient to hardwire it as
!  * 2*CheckPointSegments+1.	Under normal conditions, a checkpoint will free
!  * no more than 2*CheckPointSegments log segments, and we want to recycle all
!  * of them; the +1 allows boundary cases to happen without wasting a
!  * delete/create-segment cycle.
!  */
! #define XLOGfileslop	(2*CheckPointSegments + 1)
! 
  
  /*
   * GUC support
--- 87,105 ----
  int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
  int			num_xloginsert_slots = 8;
  
+ /*
+  * Max distance from last checkpoint, before triggering a new xlog-based
+  * checkpoint.
+  */
+ int			CheckPointSegments;
+ 
  #ifdef WAL_DEBUG
  bool		XLOG_DEBUG = false;
  #endif
  
! /* Estimated distance between checkpoints, in bytes */
! static double CheckPointDistanceEstimate = 0;
! static double PrevCheckPointDistance = 0;
  
  /*
   * GUC support
***************
*** 740,746 **** static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
  static bool XLogCheckpointNeeded(XLogSegNo new_segno);
  static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
  static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
! 					   bool find_free, int *max_advance,
  					   bool use_lock);
  static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
  			 int source, bool notexistOk);
--- 737,743 ----
  static bool XLogCheckpointNeeded(XLogSegNo new_segno);
  static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
  static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
! 					   bool find_free, XLogSegNo max_segno,
  					   bool use_lock);
  static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
  			 int source, bool notexistOk);
***************
*** 753,759 **** static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
  static void XLogFileClose(void);
  static void PreallocXlogFiles(XLogRecPtr endptr);
! static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr);
  static void UpdateLastRemovedPtr(char *filename);
  static void ValidateXLOGDirectoryStructure(void);
  static void CleanupBackupHistory(void);
--- 750,756 ----
  static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
  static void XLogFileClose(void);
  static void PreallocXlogFiles(XLogRecPtr endptr);
! static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr);
  static void UpdateLastRemovedPtr(char *filename);
  static void ValidateXLOGDirectoryStructure(void);
  static void CleanupBackupHistory(void);
***************
*** 2548,2553 **** AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
--- 2545,2653 ----
  }
  
  /*
+  * Calculate CheckPointSegments based on checkpoint_wal_size and
+  * checkpoint_completion_target.
+  */
+ static void
+ CalculateCheckpointSegments(void)
+ {
+ 	double		target;
+ 
+ 	/*-------
+ 	 * Calculate the distance at which to trigger a checkpoint, to avoid
+ 	 * exceeding checkpoint_wal_size. This is based on two assumptions:
+ 	 *
+ 	 * a) we keep WAL for two checkpoint cycles, back to the "prev" checkpoint.
+ 	 * b) during checkpoint, we consume checkpoint_completion_target *
+ 	 *    number of segments consumed between checkpoints.
+ 	 *-------
+ 	 */
+ 	target = (double ) checkpoint_wal_size / (double) (XLOG_SEG_SIZE / 1024);
+ 	target = target / (2.0 + CheckPointCompletionTarget);
+ 
+ 	/* round down */
+ 	CheckPointSegments = (int) target;
+ 
+ 	if (CheckPointSegments < 1)
+ 		CheckPointSegments = 1;
+ }
+ 
+ void
+ assign_checkpoint_wal_size(int newval, void *extra)
+ {
+ 	checkpoint_wal_size = newval;
+ 	CalculateCheckpointSegments();
+ }
+ 
+ void
+ assign_checkpoint_completion_target(double newval, void *extra)
+ {
+ 	CheckPointCompletionTarget = newval;
+ 	CalculateCheckpointSegments();
+ }
+ 
+ /*
+  * At a checkpoint, how many WAL segments to recycle as preallocated future
+  * XLOG segments? Returns the highest segment that should be preallocated.
+  */
+ static XLogSegNo
+ XLOGfileslop(XLogRecPtr PriorRedoPtr)
+ {
+ 	double		nsegments;
+ 	XLogSegNo	minSegNo;
+ 	XLogSegNo	maxSegNo;
+ 	double		distance;
+ 	XLogSegNo	recycleSegNo;
+ 
+ 	/*
+ 	 * Calculate the segment numbers that min_recycle_wal_size and
+ 	 * checkpoint_wal_size correspond to. Always recycle enough segments
+ 	 * to meet the minimum, and remove enough segments to stay below the
+ 	 * maximum.
+ 	 */
+ 	nsegments = (double) min_recycle_wal_size / (double) (XLOG_SEG_SIZE / 1024);
+ 	minSegNo = PriorRedoPtr / XLOG_SEG_SIZE + (int) nsegments;
+ 	nsegments = (double) checkpoint_wal_size / (double) (XLOG_SEG_SIZE / 1024);
+ 	maxSegNo =  PriorRedoPtr / XLOG_SEG_SIZE + (int) nsegments;
+ 
+ 	/*
+ 	 * Between those limits, recycle enough segments to get us through to the
+ 	 * estimated end of next checkpoint.
+ 	 *
+ 	 * To estimate where the next checkpoint will finish, assume that the
+ 	 * system runs steadily consuming CheckPointDistanceEstimate
+ 	 * bytes between every checkpoint.
+ 	 *
+ 	 * The reason this calculation is done from the prior checkpoint, not the
+ 	 * one that just finished, is that this behaves better if some checkpoint
+ 	 * cycles are abnormally short, like if you perform a manual checkpoint
+ 	 * right after a timed one. The manual checkpoint will make almost a full
+ 	 * cycle's worth of WAL segments available for recycling, because the
+ 	 * segments from the prior's prior, fully-sized checkpoint cycle are no
+ 	 * longer needed. However, the next checkpoint will make only few segments
+ 	 * available for recycling, the ones generated between the timed
+ 	 * checkpoint and the manual one right after that. If at the manual
+ 	 * checkpoint we only retained enough segments to get us to the next timed
+ 	 * one, and removed the rest, then at the next checkpoint we would not
+ 	 * have enough segments around for recycling, to get us to the checkpoint
+ 	 * after that. Basing the calculations on the distance from the prior redo
+ 	 * pointer largely fixes that problem.
+ 	 */
+ 	distance = (2.0 + CheckPointCompletionTarget) * CheckPointDistanceEstimate;
+ 	/* add 10% for good measure. */
+ 	distance *= 1.10;
+ 
+ 	recycleSegNo = (XLogSegNo) ceil(((double) PriorRedoPtr + distance) / XLOG_SEG_SIZE);
+ 
+ 	if (recycleSegNo < minSegNo)
+ 		recycleSegNo = minSegNo;
+ 	if (recycleSegNo > maxSegNo)
+ 		recycleSegNo = maxSegNo;
+ 
+ 	return recycleSegNo;
+ }
+ 
+ /*
   * Check whether we've consumed enough xlog space that a checkpoint is needed.
   *
   * new_segno indicates a log file that has just been filled up (or read
***************
*** 3345,3351 **** XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
  	char		path[MAXPGPATH];
  	char		tmppath[MAXPGPATH];
  	XLogSegNo	installed_segno;
! 	int			max_advance;
  	int			fd;
  	bool		zero_fill = true;
  
--- 3445,3451 ----
  	char		path[MAXPGPATH];
  	char		tmppath[MAXPGPATH];
  	XLogSegNo	installed_segno;
! 	XLogSegNo	max_segno;
  	int			fd;
  	bool		zero_fill = true;
  
***************
*** 3472,3480 **** XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
  	 * pre-create a future log segment.
  	 */
  	installed_segno = logsegno;
! 	max_advance = XLOGfileslop;
  	if (!InstallXLogFileSegment(&installed_segno, tmppath,
! 								*use_existent, &max_advance,
  								use_lock))
  	{
  		/*
--- 3572,3590 ----
  	 * pre-create a future log segment.
  	 */
  	installed_segno = logsegno;
! 
! 	/*
! 	 * XXX: What should we use as max_segno? We used to use XLOGfileslop when
! 	 * that was a constant, but that was always a bit dubious: normally, at a
! 	 * checkpoint, XLOGfileslop was the offset from the checkpoint record,
! 	 * but here, it was the offset from the insert location. We can't do the
! 	 * normal XLOGfileslop calculation here because we don't have access to
! 	 * the prior checkpoint's redo location. So somewhat arbitrarily, just
! 	 * use CheckPointSegments.
! 	 */
! 	max_segno = logsegno + CheckPointSegments;
  	if (!InstallXLogFileSegment(&installed_segno, tmppath,
! 								*use_existent, max_segno,
  								use_lock))
  	{
  		/*
***************
*** 3597,3603 **** XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno)
  	/*
  	 * Now move the segment into place with its final name.
  	 */
! 	if (!InstallXLogFileSegment(&destsegno, tmppath, false, NULL, false))
  		elog(ERROR, "InstallXLogFileSegment should not have failed");
  }
  
--- 3707,3713 ----
  	/*
  	 * Now move the segment into place with its final name.
  	 */
! 	if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false))
  		elog(ERROR, "InstallXLogFileSegment should not have failed");
  }
  
***************
*** 3617,3638 **** XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno)
   * number at or after the passed numbers.  If FALSE, install the new segment
   * exactly where specified, deleting any existing segment file there.
   *
!  * *max_advance: maximum number of segno slots to advance past the starting
!  * point.  Fail if no free slot is found in this range.  On return, reduced
!  * by the number of slots skipped over.  (Irrelevant, and may be NULL,
!  * when find_free is FALSE.)
   *
   * use_lock: if TRUE, acquire ControlFileLock while moving file into
   * place.  This should be TRUE except during bootstrap log creation.  The
   * caller must *not* hold the lock at call.
   *
   * Returns TRUE if the file was installed successfully.  FALSE indicates that
!  * max_advance limit was exceeded, or an error occurred while renaming the
   * file into place.
   */
  static bool
  InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
! 					   bool find_free, int *max_advance,
  					   bool use_lock)
  {
  	char		path[MAXPGPATH];
--- 3727,3747 ----
   * number at or after the passed numbers.  If FALSE, install the new segment
   * exactly where specified, deleting any existing segment file there.
   *
!  * max_segno: maximum segment number to install the new file as.  Fail if no
!  * free slot is found between *segno and max_segno. (Ignored when find_free
!  * is FALSE.)
   *
   * use_lock: if TRUE, acquire ControlFileLock while moving file into
   * place.  This should be TRUE except during bootstrap log creation.  The
   * caller must *not* hold the lock at call.
   *
   * Returns TRUE if the file was installed successfully.  FALSE indicates that
!  * max_segno limit was exceeded, or an error occurred while renaming the
   * file into place.
   */
  static bool
  InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
! 					   bool find_free, XLogSegNo max_segno,
  					   bool use_lock)
  {
  	char		path[MAXPGPATH];
***************
*** 3656,3662 **** InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
  		/* Find a free slot to put it in */
  		while (stat(path, &stat_buf) == 0)
  		{
! 			if (*max_advance <= 0)
  			{
  				/* Failed to find a free slot within specified range */
  				if (use_lock)
--- 3765,3771 ----
  		/* Find a free slot to put it in */
  		while (stat(path, &stat_buf) == 0)
  		{
! 			if ((*segno) >= max_segno)
  			{
  				/* Failed to find a free slot within specified range */
  				if (use_lock)
***************
*** 3664,3670 **** InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
  				return false;
  			}
  			(*segno)++;
- 			(*max_advance)--;
  			XLogFilePath(path, ThisTimeLineID, *segno);
  		}
  	}
--- 3773,3778 ----
***************
*** 3997,4010 **** UpdateLastRemovedPtr(char *filename)
  /*
   * Recycle or remove all log files older or equal to passed segno
   *
!  * endptr is current (or recent) end of xlog; this is used to determine
   * whether we want to recycle rather than delete no-longer-wanted log files.
   */
  static void
! RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr)
  {
  	XLogSegNo	endlogSegNo;
! 	int			max_advance;
  	DIR		   *xldir;
  	struct dirent *xlde;
  	char		lastoff[MAXFNAMELEN];
--- 4105,4119 ----
  /*
   * Recycle or remove all log files older or equal to passed segno
   *
!  * endptr is current (or recent) end of xlog, and PriorRedoRecPtr is the
!  * redo pointer of the previous checkpoint. These are used to determine
   * whether we want to recycle rather than delete no-longer-wanted log files.
   */
  static void
! RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
  {
  	XLogSegNo	endlogSegNo;
! 	XLogSegNo	recycleSegNo;
  	DIR		   *xldir;
  	struct dirent *xlde;
  	char		lastoff[MAXFNAMELEN];
***************
*** 4016,4026 **** RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr)
  	struct stat statbuf;
  
  	/*
! 	 * Initialize info about where to try to recycle to.  We allow recycling
! 	 * segments up to XLOGfileslop segments beyond the current XLOG location.
  	 */
  	XLByteToPrevSeg(endptr, endlogSegNo);
! 	max_advance = XLOGfileslop;
  
  	xldir = AllocateDir(XLOGDIR);
  	if (xldir == NULL)
--- 4125,4134 ----
  	struct stat statbuf;
  
  	/*
! 	 * Initialize info about where to try to recycle to.
  	 */
  	XLByteToPrevSeg(endptr, endlogSegNo);
! 	recycleSegNo = XLOGfileslop(PriorRedoPtr);
  
  	xldir = AllocateDir(XLOGDIR);
  	if (xldir == NULL)
***************
*** 4069,4088 **** RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr)
  				 * for example can create symbolic links pointing to a
  				 * separate archive directory.
  				 */
! 				if (lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
  					InstallXLogFileSegment(&endlogSegNo, path,
! 										   true, &max_advance, true))
  				{
  					ereport(DEBUG2,
  							(errmsg("recycled transaction log file \"%s\"",
  									xlde->d_name)));
  					CheckpointStats.ckpt_segs_recycled++;
  					/* Needn't recheck that slot on future iterations */
! 					if (max_advance > 0)
! 					{
! 						endlogSegNo++;
! 						max_advance--;
! 					}
  				}
  				else
  				{
--- 4177,4193 ----
  				 * for example can create symbolic links pointing to a
  				 * separate archive directory.
  				 */
! 				if (endlogSegNo <= recycleSegNo &&
! 					lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
  					InstallXLogFileSegment(&endlogSegNo, path,
! 										   true, recycleSegNo, true))
  				{
  					ereport(DEBUG2,
  							(errmsg("recycled transaction log file \"%s\"",
  									xlde->d_name)));
  					CheckpointStats.ckpt_segs_recycled++;
  					/* Needn't recheck that slot on future iterations */
! 					endlogSegNo++;
  				}
  				else
  				{
***************
*** 7863,7869 **** LogCheckpointEnd(bool restartpoint)
  		elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); "
  			 "%d transaction log file(s) added, %d removed, %d recycled; "
  			 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
! 			 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s",
  			 CheckpointStats.ckpt_bufs_written,
  			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
  			 CheckpointStats.ckpt_segs_added,
--- 7968,7975 ----
  		elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); "
  			 "%d transaction log file(s) added, %d removed, %d recycled; "
  			 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
! 			 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
! 			 "distance=%d KB, estimate=%d KB",
  			 CheckpointStats.ckpt_bufs_written,
  			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
  			 CheckpointStats.ckpt_segs_added,
***************
*** 7874,7885 **** LogCheckpointEnd(bool restartpoint)
  			 total_secs, total_usecs / 1000,
  			 CheckpointStats.ckpt_sync_rels,
  			 longest_secs, longest_usecs / 1000,
! 			 average_secs, average_usecs / 1000);
  	else
  		elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
  			 "%d transaction log file(s) added, %d removed, %d recycled; "
  			 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
! 			 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s",
  			 CheckpointStats.ckpt_bufs_written,
  			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
  			 CheckpointStats.ckpt_segs_added,
--- 7980,7994 ----
  			 total_secs, total_usecs / 1000,
  			 CheckpointStats.ckpt_sync_rels,
  			 longest_secs, longest_usecs / 1000,
! 			 average_secs, average_usecs / 1000,
! 			 (int) (PrevCheckPointDistance / 1024.0),
! 			 (int) (CheckPointDistanceEstimate / 1024.0));
  	else
  		elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
  			 "%d transaction log file(s) added, %d removed, %d recycled; "
  			 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
! 			 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
! 			 "distance=%d KB, estimate=%d KB",
  			 CheckpointStats.ckpt_bufs_written,
  			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
  			 CheckpointStats.ckpt_segs_added,
***************
*** 7890,7896 **** LogCheckpointEnd(bool restartpoint)
  			 total_secs, total_usecs / 1000,
  			 CheckpointStats.ckpt_sync_rels,
  			 longest_secs, longest_usecs / 1000,
! 			 average_secs, average_usecs / 1000);
  }
  
  /*
--- 7999,8046 ----
  			 total_secs, total_usecs / 1000,
  			 CheckpointStats.ckpt_sync_rels,
  			 longest_secs, longest_usecs / 1000,
! 			 average_secs, average_usecs / 1000,
! 			 (int) (PrevCheckPointDistance / 1024.0),
! 			 (int) (CheckPointDistanceEstimate / 1024.0));
! }
! 
! /*
!  * Update the estimate of distance between checkpoints.
!  *
!  * The estimate is used to calculate the number of WAL segments to keep
!  * preallocated, see XLOGFileSlop().
!  */
! static void
! UpdateCheckPointDistanceEstimate(uint64 nbytes)
! {
! 	/*
! 	 * To estimate the number of segments consumed between checkpoints, keep
! 	 * a moving average of the actual number of segments consumed in previous
! 	 * checkpoint cycles. However, if the load is bursty, with quiet periods
! 	 * and busy periods, we want to cater for the peak load. So instead of a
! 	 * plain moving average, let the average decline slowly if the previous
! 	 * cycle used less WAL than estimated, but bump it up immediately if it
! 	 * used more.
! 	 *
! 	 * When checkpoints are triggered by checkpoint_wal_size, this should
! 	 * converge to CheckpointSegments * XLOG_SEG_SIZE,
! 	 *
! 	 * Note: This doesn't pay any attention to what caused the checkpoint.
! 	 * Checkpoints triggered manually with CHECKPOINT command, or by e.g
! 	 * starting a base backup, are counted the same as those created
! 	 * automatically. The slow-decline will largely mask them out, if they are
! 	 * not frequent. If they are frequent, it seems reasonable to count them
! 	 * in as any others; if you issue a manual checkpoint every 5 minutes and
! 	 * never let a timed checkpoint happen, it makes sense to base the
! 	 * preallocation on that 5 minute interval rather than whatever
! 	 * checkpoint_timeout is set to.
! 	 */
! 	PrevCheckPointDistance = nbytes;
! 	if (CheckPointDistanceEstimate < nbytes)
! 		CheckPointDistanceEstimate = nbytes;
! 	else
! 		CheckPointDistanceEstimate =
! 			(0.90 * CheckPointDistanceEstimate + 0.10 * (double) nbytes);
  }
  
  /*
***************
*** 7932,7938 **** CreateCheckPoint(int flags)
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecData rdata;
  	uint32		freespace;
! 	XLogSegNo	_logSegNo;
  	XLogRecPtr	curInsert;
  	VirtualTransactionId *vxids;
  	int			nvxids;
--- 8082,8088 ----
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecData rdata;
  	uint32		freespace;
! 	XLogRecPtr	PriorRedoPtr;
  	XLogRecPtr	curInsert;
  	VirtualTransactionId *vxids;
  	int			nvxids;
***************
*** 8237,8246 **** CreateCheckPoint(int flags)
  				(errmsg("concurrent transaction log activity while database system is shutting down")));
  
  	/*
! 	 * Select point at which we can truncate the log, which we base on the
! 	 * prior checkpoint's earliest info.
  	 */
! 	XLByteToSeg(ControlFile->checkPointCopy.redo, _logSegNo);
  
  	/*
  	 * Update the control file.
--- 8387,8396 ----
  				(errmsg("concurrent transaction log activity while database system is shutting down")));
  
  	/*
! 	 * Remember the prior checkpoint's redo pointer, used later to determine
! 	 * the point where the log can be truncated.
  	 */
! 	PriorRedoPtr = ControlFile->checkPointCopy.redo;
  
  	/*
  	 * Update the control file.
***************
*** 8294,8304 **** CreateCheckPoint(int flags)
  	 * Delete old log files (those no longer needed even for previous
  	 * checkpoint or the standbys in XLOG streaming).
  	 */
! 	if (_logSegNo)
  	{
  		KeepLogSeg(recptr, &_logSegNo);
  		_logSegNo--;
! 		RemoveOldXlogFiles(_logSegNo, recptr);
  	}
  
  	/*
--- 8444,8460 ----
  	 * Delete old log files (those no longer needed even for previous
  	 * checkpoint or the standbys in XLOG streaming).
  	 */
! 	if (PriorRedoPtr != InvalidXLogRecPtr)
  	{
+ 		XLogSegNo	_logSegNo;
+ 
+ 		/* Update the average distance between checkpoints. */
+ 		UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
+ 
+ 		XLByteToSeg(PriorRedoPtr, _logSegNo);
  		KeepLogSeg(recptr, &_logSegNo);
  		_logSegNo--;
! 		RemoveOldXlogFiles(_logSegNo, PriorRedoPtr, recptr);
  	}
  
  	/*
***************
*** 8486,8492 **** CreateRestartPoint(int flags)
  {
  	XLogRecPtr	lastCheckPointRecPtr;
  	CheckPoint	lastCheckPoint;
! 	XLogSegNo	_logSegNo;
  	TimestampTz xtime;
  
  	/* use volatile pointer to prevent code rearrangement */
--- 8642,8648 ----
  {
  	XLogRecPtr	lastCheckPointRecPtr;
  	CheckPoint	lastCheckPoint;
! 	XLogRecPtr	PriorRedoPtr;
  	TimestampTz xtime;
  
  	/* use volatile pointer to prevent code rearrangement */
***************
*** 8554,8560 **** CreateRestartPoint(int flags)
  	/*
  	 * Update the shared RedoRecPtr so that the startup process can calculate
  	 * the number of segments replayed since last restartpoint, and request a
! 	 * restartpoint if it exceeds checkpoint_segments.
  	 *
  	 * Like in CreateCheckPoint(), hold off insertions to update it, although
  	 * during recovery this is just pro forma, because no WAL insertions are
--- 8710,8716 ----
  	/*
  	 * Update the shared RedoRecPtr so that the startup process can calculate
  	 * the number of segments replayed since last restartpoint, and request a
! 	 * restartpoint if it exceeds CheckPointSegments.
  	 *
  	 * Like in CreateCheckPoint(), hold off insertions to update it, although
  	 * during recovery this is just pro forma, because no WAL insertions are
***************
*** 8585,8594 **** CreateRestartPoint(int flags)
  	CheckPointGuts(lastCheckPoint.redo, flags);
  
  	/*
! 	 * Select point at which we can truncate the xlog, which we base on the
! 	 * prior checkpoint's earliest info.
  	 */
! 	XLByteToSeg(ControlFile->checkPointCopy.redo, _logSegNo);
  
  	/*
  	 * Update pg_control, using current time.  Check that it still shows
--- 8741,8750 ----
  	CheckPointGuts(lastCheckPoint.redo, flags);
  
  	/*
! 	 * Remember the prior checkpoint's redo pointer, used later to determine
! 	 * the point at which we can truncate the log.
  	 */
! 	PriorRedoPtr = ControlFile->checkPointCopy.redo;
  
  	/*
  	 * Update pg_control, using current time.  Check that it still shows
***************
*** 8615,8626 **** CreateRestartPoint(int flags)
  	 * checkpoint/restartpoint) to prevent the disk holding the xlog from
  	 * growing full.
  	 */
! 	if (_logSegNo)
  	{
  		XLogRecPtr	receivePtr;
  		XLogRecPtr	replayPtr;
  		TimeLineID	replayTLI;
  		XLogRecPtr	endptr;
  
  		/*
  		 * Get the current end of xlog replayed or received, whichever is
--- 8771,8785 ----
  	 * checkpoint/restartpoint) to prevent the disk holding the xlog from
  	 * growing full.
  	 */
! 	if (PriorRedoPtr != InvalidXLogRecPtr)
  	{
  		XLogRecPtr	receivePtr;
  		XLogRecPtr	replayPtr;
  		TimeLineID	replayTLI;
  		XLogRecPtr	endptr;
+ 		XLogSegNo	_logSegNo;
+ 
+ 		XLByteToSeg(PriorRedoPtr, _logSegNo);
  
  		/*
  		 * Get the current end of xlog replayed or received, whichever is
***************
*** 8649,8655 **** CreateRestartPoint(int flags)
  		if (RecoveryInProgress())
  			ThisTimeLineID = replayTLI;
  
! 		RemoveOldXlogFiles(_logSegNo, endptr);
  
  		/*
  		 * Make more log segments if needed.  (Do this after recycling old log
--- 8808,8814 ----
  		if (RecoveryInProgress())
  			ThisTimeLineID = replayTLI;
  
! 		RemoveOldXlogFiles(_logSegNo, PriorRedoPtr, endptr);
  
  		/*
  		 * Make more log segments if needed.  (Do this after recycling old log
*** a/src/backend/postmaster/checkpointer.c
--- b/src/backend/postmaster/checkpointer.c
***************
*** 482,488 **** CheckpointerMain(void)
  				"checkpoints are occurring too frequently (%d seconds apart)",
  									   elapsed_secs,
  									   elapsed_secs),
! 						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
  
  			/*
  			 * Initialize checkpointer-private variables used during
--- 482,488 ----
  				"checkpoints are occurring too frequently (%d seconds apart)",
  									   elapsed_secs,
  									   elapsed_secs),
! 						 errhint("Consider increasing the configuration parameter \"checkpoint_wal_size\".")));
  
  			/*
  			 * Initialize checkpointer-private variables used during
***************
*** 760,770 **** IsCheckpointOnSchedule(double progress)
  		return false;
  
  	/*
! 	 * Check progress against WAL segments written and checkpoint_segments.
  	 *
  	 * We compare the current WAL insert location against the location
  	 * computed before calling CreateCheckPoint. The code in XLogInsert that
! 	 * actually triggers a checkpoint when checkpoint_segments is exceeded
  	 * compares against RedoRecptr, so this is not completely accurate.
  	 * However, it's good enough for our purposes, we're only calculating an
  	 * estimate anyway.
--- 760,770 ----
  		return false;
  
  	/*
! 	 * Check progress against WAL segments written and CheckPointSegments.
  	 *
  	 * We compare the current WAL insert location against the location
  	 * computed before calling CreateCheckPoint. The code in XLogInsert that
! 	 * actually triggers a checkpoint when CheckPointSegments is exceeded
  	 * compares against RedoRecptr, so this is not completely accurate.
  	 * However, it's good enough for our purposes, we're only calculating an
  	 * estimate anyway.
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 1981,1996 **** static struct config_int ConfigureNamesInt[] =
  	},
  
  	{
! 		{"checkpoint_segments", PGC_SIGHUP, WAL_CHECKPOINTS,
! 			gettext_noop("Sets the maximum distance in log segments between automatic WAL checkpoints."),
! 			NULL
  		},
! 		&CheckPointSegments,
! 		3, 1, INT_MAX,
  		NULL, NULL, NULL
  	},
  
  	{
  		{"checkpoint_timeout", PGC_SIGHUP, WAL_CHECKPOINTS,
  			gettext_noop("Sets the maximum time between automatic WAL checkpoints."),
  			NULL,
--- 1981,2008 ----
  	},
  
  	{
! 		{"min_recycle_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS,
! 			gettext_noop("Sets the minimum size to shrink the WAL to."),
! 			NULL,
! 			GUC_UNIT_KB
  		},
! 		&min_recycle_wal_size,
! 		81920, 32768, INT_MAX,
  		NULL, NULL, NULL
  	},
  
  	{
+ 		{"checkpoint_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS,
+ 			gettext_noop("Sets the maximum WAL size that triggers a checkpoint."),
+ 			NULL,
+ 			GUC_UNIT_KB
+ 		},
+ 		&checkpoint_wal_size,
+ 		262144, 32768, INT_MAX,
+ 		NULL, assign_checkpoint_wal_size, NULL
+ 	},
+ 
+ 	{
  		{"checkpoint_timeout", PGC_SIGHUP, WAL_CHECKPOINTS,
  			gettext_noop("Sets the maximum time between automatic WAL checkpoints."),
  			NULL,
***************
*** 2573,2579 **** static struct config_real ConfigureNamesReal[] =
  		},
  		&CheckPointCompletionTarget,
  		0.5, 0.0, 1.0,
! 		NULL, NULL, NULL
  	},
  
  	/* End-of-list marker */
--- 2585,2591 ----
  		},
  		&CheckPointCompletionTarget,
  		0.5, 0.0, 1.0,
! 		NULL, assign_checkpoint_completion_target, NULL
  	},
  
  	/* End-of-list marker */
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 181,187 **** extern XLogRecPtr XactLastRecEnd;
  extern bool reachedConsistency;
  
  /* these variables are GUC parameters related to XLOG */
! extern int	CheckPointSegments;
  extern int	wal_keep_segments;
  extern int	XLOGbuffers;
  extern int	XLogArchiveTimeout;
--- 181,188 ----
  extern bool reachedConsistency;
  
  /* these variables are GUC parameters related to XLOG */
! extern int	min_recycle_wal_size;
! extern int	checkpoint_wal_size;
  extern int	wal_keep_segments;
  extern int	XLOGbuffers;
  extern int	XLogArchiveTimeout;
***************
*** 192,197 **** extern bool fullPageWrites;
--- 193,200 ----
  extern bool log_checkpoints;
  extern int	num_xloginsert_slots;
  
+ extern int	CheckPointSegments;
+ 
  /* WAL levels */
  typedef enum WalLevel
  {
***************
*** 319,324 **** extern bool CheckPromoteSignal(void);
--- 322,330 ----
  extern void WakeupRecovery(void);
  extern void SetWalWriterSleeping(bool sleeping);
  
+ extern void assign_checkpoint_wal_size(int newval, void *extra);
+ extern void assign_checkpoint_completion_target(double newval, void *extra);
+ 
  /*
   * Starting/stopping a base backup
   */

#50

Josh Berkus

josh@agliodbs.com

over 12 years ago

In reply to: Heikki Linnakangas (#1)

Re: Redesigning checkpoint_segments

On 08/23/2013 02:08 PM, Heikki Linnakangas wrote:

Here's a bigger patch, which does more. It is based on the ideas in the
post I started this thread with, with feedback incorporated from the
long discussion. With this patch, WAL disk space usage is controlled by
two GUCs:

min_recycle_wal_size
checkpoint_wal_size

<snip>

These settings are fairly intuitive for a DBA to tune. You begin by
figuring out how much disk space you can afford to spend on WAL, and set
checkpoint_wal_size to that (with some safety margin, of course). Then
you set checkpoint_timeout based on how long you're willing to wait for
recovery to finish. Finally, if you have infrequent batch jobs that need
a lot more WAL than the system otherwise needs, you can set
min_recycle_wal_size to keep enough WAL preallocated for the spikes.

We'll want to rename them to make it even *more* intuitive.

But ... do I understand things correctly that checkpoint wouldn't "kick
in" until you hit checkpoint_wal_size? If that's the case, isn't real
disk space usage around 2X checkpoint_wal_size if spread checkpoint is
set to 0.9? Or does checkpoint kick in sometime earlier?

except that it's more
intuitive to set it in terms of "MB of WAL space required", instead of
"# of segments between checkpoints".

Yes, it certainly is. We'll need to caution people that fractions of
16MB will be ignored.

Does that make sense? I'd love to hear feedback on how people setting up
production databases would like to tune these things. The reason for the
auto-tuning between the min and max is to be able to set reasonable
defaults e.g for embedded systems that don't have a DBA to do tuning.
Currently, it's very difficult to come up with a reasonable default
value for checkpoint_segments which would work well for a wide range of
systems. The PostgreSQL default of 3 is way way too low for most
systems. On the other hand, if you set it to, say, 20, that's a lot of
wasted space for a small database that's not updated much. With this
patch, you can set "max_wal_size=1GB" and if the database ends up
actually only needing 100 MB of WAL, it will only use that much and not
waste 900 MB for useless preallocated WAL files.

This sounds good, aside from the potential 2X issue I mention above.

Mind you, what admins really want is a hard limit on WAL size, so that
they can create a partition and not worry about PG running out of WAL
space. But ...

Making it a hard limit is a much bigger task than I'm willing to tackle
right now.

... agreed. And this approach could be built on for a hard limit later on.

As a note, pgBench would be a terrible test for this patch; we really
need something which creates uneven traffic. I'll see if I can devise
something.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WMf4eae74088872defa64962ccd34910464adeab749ae2df6f403823b8a67fb1db0b0e61c61c8b67237e2dc93f18e7e262@asav-1.01.com

#51

Amit Kapila

amit.kapila16@gmail.com

over 12 years ago

In reply to: Heikki Linnakangas (#49)

Re: Redesigning checkpoint_segments

On Sat, Aug 24, 2013 at 2:38 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 03.07.2013 21:28, Peter Eisentraut wrote:

On 6/6/13 4:09 PM, Heikki Linnakangas wrote:

Here's a patch implementing that. Docs not updated yet. I did not change
the way checkpoint_segments triggers checkpoints - that'll can be a
separate patch. This only decouples the segment preallocation behavior
from checkpoint_segments. With the patch, you can set
checkpoint_segments really high, without consuming that much disk space
all the time.

I don't understand what this patch, by itself, will accomplish in terms
of the originally stated goals of making checkpoint_segments easier to
tune, and controlling disk space used. To some degree, it makes both of
these things worse, because you can no longer use checkpoint_segments to
control the disk space. Instead, it is replaced by magic.

The patch addressed the third point in my first post:

A third point is that even if you have 10 GB of disk space reserved
for WAL, you don't want to actually consume all that 10 GB, if it's
not required to run the database smoothly. There are several reasons
for that: backups based on a filesystem-level snapshot are larger
than necessary, if there are a lot of preallocated WAL segments and
in a virtualized or shared system, there might be other VMs or
applications that could make use of the disk space. On the other
hand, you don't want to run out of disk space while writing WAL -
that can lead to a PANIC in the worst case.

What sort of behavior are you expecting to come out of this? In testing,
I didn't see much of a difference. Although I'd expect that this would
actually preallocate fewer segments than the old formula.

For example, if you set checkpoint_segments to 200, and you temporarily
generate 100 segments of WAL during an initial data load, but the normal
workload generates only 20 segments between checkpoints. Without the patch,
you will permanently have about 120 segments in pg_xlog, created by the
spike. With the patch, the extra segments will be gradually removed after
the data load, down to the level needed by the constant workload. That would
be about 50 segments, assuming the default checkpoint_completion_target=0.5.

Here's a bigger patch, which does more. It is based on the ideas in the post
I started this thread with, with feedback incorporated from the long
discussion. With this patch, WAL disk space usage is controlled by two GUCs:

min_recycle_wal_size
checkpoint_wal_size

I think it will be helpful for users to configure using wal size
rather than by number of segments and
your idea to keep WAL size under control can be helpful to users.

These GUCs act as soft minimum and maximum on overall WAL size. At each
checkpoint, the checkpointer removes enough old WAL files to keep pg_xlog
usage below checkpoint_wal_size, and recycles enough new WAL files to reach
min_recycle_wal_size. Between those limits, there is a self-tuning mechanism
to recycle just enough WAL files to get to end of the next checkpoint
without running out of preallocated WAL files. To estimate how many files
are needed for that, a moving average of how much WAL is generated between
checkpoints is calculated. The moving average is updated with "fast-rise
slow-decline" behavior, to cater for peak rather than true average use to
some extent.

As today, checkpoints are triggered based on time or WAL usage, whichever
comes first. WAL-based checkpoints are triggered based on the good old
formula: CheckPointSegments = (checkpoint_max_wal_size / (2.0 +
checkpoint_completion_target)) / 16MB. CheckPointSegments controls that like
before, but it is now an internal variable derived from checkpoint_wal_size,
not visible to users.

a.
In XLogFileInit(),
/*
! * XXX: What should we use as max_segno? We used to use XLOGfileslop when
! * that was a constant, but that was always a bit dubious: normally, at a
! * checkpoint, XLOGfileslop was the offset from the checkpoint record,
! * but here, it was the offset from the insert location. We can't do the
! * normal XLOGfileslop calculation here because we don't have access to
! * the prior checkpoint's redo location. So somewhat arbitrarily, just
! * use CheckPointSegments.
! */
! max_segno = logsegno + CheckPointSegments;
if (!InstallXLogFileSegment(&installed_segno, tmppath,
! *use_existent, max_segno,
use_lock))

Earlier max_advance is same when InstallXLogFileSegment is called from
RemoveOldXlogFiles() and XLogFileInit(),
but now they will be different (and it seems there is no direct
relation between these 2 numbers), so will it be okay for scenario
when someone else has created the file while this function was
filling, because it needs to restore as future segment which will be
decided based on max_segno?

b. Do createrestartpoint need to update the
CheckPointDistanceEstimate, as when it will try to remove old xlog
files, it needs recycleSegNo which is calculated using
CheckPointDistanceEstimate?

c. New variables are not present in postgresql.conf after initdb.

These settings are fairly intuitive for a DBA to tune. You begin by figuring
out how much disk space you can afford to spend on WAL, and set
checkpoint_wal_size to that (with some safety margin, of course). Then you
set checkpoint_timeout based on how long you're willing to wait for recovery
to finish. Finally, if you have infrequent batch jobs that need a lot more
WAL than the system otherwise needs, you can set min_recycle_wal_size to
keep enough WAL preallocated for the spikes.

You can also set min_recycle_wal_size = checkpoint_wal_size, which gets you
the same behavior as without the patch, except that it's more intuitive to
set it in terms of "MB of WAL space required", instead of "# of segments
between checkpoints".

Does that make sense? I'd love to hear feedback on how people setting up
production databases would like to tune these things. The reason for the
auto-tuning between the min and max is to be able to set reasonable defaults
e.g for embedded systems that don't have a DBA to do tuning. Currently, it's
very difficult to come up with a reasonable default value for
checkpoint_segments which would work well for a wide range of systems. The
PostgreSQL default of 3 is way way too low for most systems. On the other
hand, if you set it to, say, 20, that's a lot of wasted space for a small
database that's not updated much. With this patch, you can set
"max_wal_size=1GB" and if the database ends up actually only needing 100 MB
of WAL, it will only use that much and not waste 900 MB for useless
preallocated WAL files.

As a developer, I would love to have configuration knob such as
min_recycle_wal_size, but not sure how many users will be comfortable
setting this value, actually few users I had talked about this earlier
are interested in setting max WAL size which can allow them to set an
upper limit on space required by WAL.
Can't we think of doing the calculation of files to recycle only based
on CheckPointDistanceEstimate.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

Bruce Momjian

bruce@momjian.us

almost 12 years ago

In reply to: Heikki Linnakangas (#49)

Re: Redesigning checkpoint_segments

On Sat, Aug 24, 2013 at 12:08:30AM +0300, Heikki Linnakangas wrote:

You can also set min_recycle_wal_size = checkpoint_wal_size, which
gets you the same behavior as without the patch, except that it's
more intuitive to set it in terms of "MB of WAL space required",
instead of "# of segments between checkpoints".

Does that make sense? I'd love to hear feedback on how people
setting up production databases would like to tune these things. The
reason for the auto-tuning between the min and max is to be able to
set reasonable defaults e.g for embedded systems that don't have a
DBA to do tuning. Currently, it's very difficult to come up with a
reasonable default value for checkpoint_segments which would work
well for a wide range of systems. The PostgreSQL default of 3 is way
way too low for most systems. On the other hand, if you set it to,
say, 20, that's a lot of wasted space for a small database that's
not updated much. With this patch, you can set "max_wal_size=1GB"
and if the database ends up actually only needing 100 MB of WAL, it
will only use that much and not waste 900 MB for useless
preallocated WAL files.

Where are we on this?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53

Heikki Linnakangas

hlinnakangas@vmware.com

about 11 years ago

In reply to: Josh Berkus (#50)

1 attachment(s)

Re: Redesigning checkpoint_segments

(reviving an old thread)

On 08/24/2013 12:53 AM, Josh Berkus wrote:

On 08/23/2013 02:08 PM, Heikki Linnakangas wrote:

Here's a bigger patch, which does more. It is based on the ideas in the
post I started this thread with, with feedback incorporated from the
long discussion. With this patch, WAL disk space usage is controlled by
two GUCs:

min_recycle_wal_size
checkpoint_wal_size

<snip>

These settings are fairly intuitive for a DBA to tune. You begin by
figuring out how much disk space you can afford to spend on WAL, and set
checkpoint_wal_size to that (with some safety margin, of course). Then
you set checkpoint_timeout based on how long you're willing to wait for
recovery to finish. Finally, if you have infrequent batch jobs that need
a lot more WAL than the system otherwise needs, you can set
min_recycle_wal_size to keep enough WAL preallocated for the spikes.

We'll want to rename them to make it even *more* intuitive.

Sure, I'm all ears.

But ... do I understand things correctly that checkpoint wouldn't "kick
in" until you hit checkpoint_wal_size? If that's the case, isn't real
disk space usage around 2X checkpoint_wal_size if spread checkpoint is
set to 0.9? Or does checkpoint kick in sometime earlier?

It kicks in earlier, so that the checkpoint *completes* just when
checkpoint_wal_size of WAL is used up. So the real disk usage is
checkpoint_wal_size.

There is still an internal variable called CheckPointSegments that
triggers the checkpoint, but it is now derived from checkpoint_wal_size
(see CalculateCheckpointSegments function):

CheckPointSegments = (checkpoint_wal_size / 16 MB) / (2 +
checkpoint_completion_target)

This is the same formula we've always had in the manual for calculating
the amount of WAL space used, but in reverse. I.e. we calculate
CheckPointSegments based on the desired disk space usage, not the other
way round.

As a note, pgBench would be a terrible test for this patch; we really
need something which creates uneven traffic. I'll see if I can devise
something.

Attached is a rebased version of this patch. Everyone, please try this
out on whatever workloads you have, and let me know:

a) How does the auto-tuning of the number of recycled segments work?
Does pg_xlog reach a steady-state size, or does it fluctuate a lot?

b) Are the two GUCs, checkpoint_wal_size, and min_recycle_wal_size,
intuitive to set?

- Heikki

Attachments:

redesign-checkpoint-segments-2.patchtext/x-diff; name=redesign-checkpoint-segments-2.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6bcb106..34f9466 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1325,7 +1325,7 @@ include_dir 'conf.d'
         40% of RAM to <varname>shared_buffers</varname> will work better than a
         smaller amount.  Larger settings for <varname>shared_buffers</varname>
         usually require a corresponding increase in
-        <varname>checkpoint_segments</varname>, in order to spread out the
+        <varname>checkpoint_wal_size</varname>, in order to spread out the
         process of writing large quantities of new or changed data over a
         longer period of time.
        </para>
@@ -2394,18 +2394,21 @@ include_dir 'conf.d'
      <title>Checkpoints</title>
 
     <variablelist>
-     <varlistentry id="guc-checkpoint-segments" xreflabel="checkpoint_segments">
-      <term><varname>checkpoint_segments</varname> (<type>integer</type>)
+     <varlistentry id="guc-checkpoint-wal-size" xreflabel="checkpoint_wal_size">
+      <term><varname>checkpoint_wal_size</varname> (<type>integer</type>)</term>
       <indexterm>
-       <primary><varname>checkpoint_segments</> configuration parameter</primary>
+       <primary><varname>checkpoint_wal_size</> configuration parameter</primary>
       </indexterm>
       </term>
       <listitem>
        <para>
-        Maximum number of log file segments between automatic WAL
-        checkpoints (each segment is normally 16 megabytes). The default
-        is three segments.  Increasing this parameter can increase the
-        amount of time needed for crash recovery.
+        Maximum size to let the WAL grow to between automatic WAL
+        checkpoints. This is a soft limit; WAL size can exceed
+        <varname>checkpoint_wal_size</> under special circumstances, like
+        under heavy load, a failing <varname>archive_command</>, or a high
+        <varname>wal_keep_segments</> setting. The default is 128 MB.
+        Increasing this parameter can increase the amount of time needed for
+        crash recovery.
         This parameter can only be set in the <filename>postgresql.conf</>
         file or on the server command line.
        </para>
@@ -2458,7 +2461,7 @@ include_dir 'conf.d'
         Write a message to the server log if checkpoints caused by
         the filling of checkpoint segment files happen closer together
         than this many seconds (which suggests that
-        <varname>checkpoint_segments</> ought to be raised).  The default is
+        <varname>checkpoint_wal_size</> ought to be raised).  The default is
         30 seconds (<literal>30s</>).  Zero disables the warning.
         No warnings will be generated if <varname>checkpoint_timeout</varname>
         is less than <varname>checkpoint_warning</varname>.
@@ -2468,6 +2471,24 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-min-recycle-wal-size" xreflabel="min_recycle_wal_size">
+      <term><varname>min_recycle_wal_size</varname> (<type>integer</type>)</term>
+      <indexterm>
+       <primary><varname>min_recycle_wal_size</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        As long as WAL disk usage stays below this setting, old WAL files are
+        always recycled for future use at a checkpoint, rather than removed.
+        This can be used to ensure that enough WAL space is reserved to
+        handle spikes in WAL usage, for example when running large batch
+        jobs. The default is 80 MB.
+        This parameter can only be set in the <filename>postgresql.conf</>
+        file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 5a087fb..24022d9 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1328,19 +1328,19 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
    </para>
   </sect2>
 
-  <sect2 id="populate-checkpoint-segments">
-   <title>Increase <varname>checkpoint_segments</varname></title>
+  <sect2 id="populate-checkpoint-wal-size">
+   <title>Increase <varname>checkpoint_wal_size</varname></title>
 
    <para>
     Temporarily increasing the <xref
-    linkend="guc-checkpoint-segments"> configuration variable can also
+    linkend="guc-checkpoint-wal-size"> configuration variable can also
     make large data loads faster.  This is because loading a large
     amount of data into <productname>PostgreSQL</productname> will
     cause checkpoints to occur more often than the normal checkpoint
     frequency (specified by the <varname>checkpoint_timeout</varname>
     configuration variable). Whenever a checkpoint occurs, all dirty
     pages must be flushed to disk. By increasing
-    <varname>checkpoint_segments</varname> temporarily during bulk
+    <varname>checkpoint-wal-size</varname> temporarily during bulk
     data loads, the number of checkpoints that are required can be
     reduced.
    </para>
@@ -1445,7 +1445,7 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
       <para>
        Set appropriate (i.e., larger than normal) values for
        <varname>maintenance_work_mem</varname> and
-       <varname>checkpoint_segments</varname>.
+       <varname>checkpoint_wal_size</varname>.
       </para>
      </listitem>
      <listitem>
@@ -1512,7 +1512,7 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
 
     So when loading a data-only dump, it is up to you to drop and recreate
     indexes and foreign keys if you wish to use those techniques.
-    It's still useful to increase <varname>checkpoint_segments</varname>
+    It's still useful to increase <varname>checkpoint_wal_size</varname>
     while loading the data, but don't bother increasing
     <varname>maintenance_work_mem</varname>; rather, you'd do that while
     manually recreating indexes and foreign keys afterwards.
@@ -1577,7 +1577,7 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
 
      <listitem>
       <para>
-       Increase <xref linkend="guc-checkpoint-segments"> and <xref
+       Increase <xref linkend="guc-checkpoint-wal-size"> and <xref
        linkend="guc-checkpoint-timeout"> ; this reduces the frequency
        of checkpoints, but increases the storage requirements of
        <filename>/pg_xlog</>.
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 1254c03..6cf7772 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -472,9 +472,10 @@
   <para>
    The server's checkpointer process automatically performs
    a checkpoint every so often.  A checkpoint is begun every <xref
-   linkend="guc-checkpoint-segments"> log segments, or every <xref
-   linkend="guc-checkpoint-timeout"> seconds, whichever comes first.
-   The default settings are 3 segments and 300 seconds (5 minutes), respectively.
+   linkend="guc-checkpoint-timeout"> seconds, or if
+   <xref linkend="guc-checkpoint-wal-size"> is about to be exceeded, whichever
+   comes first.
+   The default settings are 5 minutes and 128 MB, respectively.
    If no WAL has been written since the previous checkpoint, new checkpoints
    will be skipped even if <varname>checkpoint_timeout</> has passed.
    (If WAL archiving is being used and you want to put a lower limit on how
@@ -486,8 +487,8 @@
   </para>
 
   <para>
-   Reducing <varname>checkpoint_segments</varname> and/or
-   <varname>checkpoint_timeout</varname> causes checkpoints to occur
+   Reducing <varname>checkpoint_timeout</varname> and/or
+   <varname>checkpoint_wal_size</varname> causes checkpoints to occur
    more often. This allows faster after-crash recovery, since less work
    will need to be redone. However, one must balance this against the
    increased cost of flushing dirty data pages more often. If
@@ -510,11 +511,11 @@
    parameter.  If checkpoints happen closer together than
    <varname>checkpoint_warning</> seconds,
    a message will be output to the server log recommending increasing
-   <varname>checkpoint_segments</varname>.  Occasional appearance of such
+   <varname>checkpoint_wal_size</varname>.  Occasional appearance of such
    a message is not cause for alarm, but if it appears often then the
    checkpoint control parameters should be increased. Bulk operations such
    as large <command>COPY</> transfers might cause a number of such warnings
-   to appear if you have not set <varname>checkpoint_segments</> high
+   to appear if you have not set <varname>checkpoint_wal_size</> high
    enough.
   </para>
 
@@ -525,10 +526,10 @@
    <xref linkend="guc-checkpoint-completion-target">, which is
    given as a fraction of the checkpoint interval.
    The I/O rate is adjusted so that the checkpoint finishes when the
-   given fraction of <varname>checkpoint_segments</varname> WAL segments
-   have been consumed since checkpoint start, or the given fraction of
-   <varname>checkpoint_timeout</varname> seconds have elapsed,
-   whichever is sooner.  With the default value of 0.5,
+   given fraction of
+   <varname>checkpoint_timeout</varname> seconds have elapsed, or before
+   <varname>checkpoint_wal_size</varname> is exceeded, whichever is sooner.
+   With the default value of 0.5,
    <productname>PostgreSQL</> can be expected to complete each checkpoint
    in about half the time before the next checkpoint starts.  On a system
    that's very close to maximum I/O throughput during normal operation,
@@ -545,18 +546,33 @@
   </para>
 
   <para>
-   There will always be at least one WAL segment file, and will normally
-   not be more than (2 + <varname>checkpoint_completion_target</varname>) * <varname>checkpoint_segments</varname> + 1
-   or <varname>checkpoint_segments</> + <xref linkend="guc-wal-keep-segments"> + 1
-   files.  Each segment file is normally 16 MB (though this size can be
-   altered when building the server).  You can use this to estimate space
-   requirements for <acronym>WAL</acronym>.
-   Ordinarily, when old log segment files are no longer needed, they
-   are recycled (that is, renamed to become future segments in the numbered
-   sequence). If, due to a short-term peak of log output rate, there
-   are more than 3 * <varname>checkpoint_segments</varname> + 1
-   segment files, the unneeded segment files will be deleted instead
-   of recycled until the system gets back under this limit.
+   The number of WAL segment files in <filename>pg_xlog</> directory depends on
+   <varname>checkpoint_wal_size</>, <varname>wal_recycle_min_size</> and the
+   amount of WAL generated in previous checkpoint cycles. When old log
+   segment files are no longer needed, they are removed or recycled (that is,
+   renamed to become future segments in the numbered sequence). If, due to a
+   short-term peak of log output rate, <varname>checkpoint_wal_size</> is
+   exceeded, the unneeded segment files will be removed until the system
+   gets back under this limit. Below that limit, the system recycles enough
+   WAL files to cover the estimated need until the next checkpoint, and
+   removes the rest. The estimate is based on a moving average of the number
+   of WAL files used in previous checkpoint cycles. The moving average
+   is increased immediately if the actual usage exceeds the estimate, so it
+   accommodates peak usage rather average usage to some extent.
+   <varname>wal_recycle_min_size</> puts a minimum on the amount of WAL files
+   recycled for future usage; that much WAL is always recycled for future use,
+   even if the system is idle and the WAL usage estimate suggests that little
+   WAL is needed.
+  </para>
+
+  <para>
+   Independently of <varname>checkpoint_wal_size</varname>,
+   <xref linkend="guc-wal-keep-segments"> + 1 most recent WAL files are
+   kept at all times. Also, if WAL archiving is used, old segments can not be
+   removed or recycled until they are archived. If WAL archiving cannot keep up
+   with the pace that WAL is generated, or if <varname>archive_command</varname>
+   fails repeatedly, old WAL files will accumulate in <filename>pg_xlog</>
+   until the situation is resolved.
   </para>
 
   <para>
@@ -571,9 +587,8 @@
    master because restartpoints can only be performed at checkpoint records.
    A restartpoint is triggered when a checkpoint record is reached if at
    least <varname>checkpoint_timeout</> seconds have passed since the last
-   restartpoint. In standby mode, a restartpoint is also triggered if at
-   least <varname>checkpoint_segments</> log segments have been replayed
-   since the last restartpoint.
+   restartpoint, or if WAL size is about to exceed
+   <varname>checkpoint_wal_size</>.
   </para>
 
   <para>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e5dddd4..07aa92b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -79,7 +79,8 @@ extern uint32 bootstrap_data_checksum_version;
 
 
 /* User-settable parameters */
-int			CheckPointSegments = 3;
+int			checkpoint_wal_size = 131072;	/* 128 MB */
+int			min_recycle_wal_size = 81920;	/* 80 MB */
 int			wal_keep_segments = 0;
 int			XLOGbuffers = -1;
 int			XLogArchiveTimeout = 0;
@@ -106,18 +107,14 @@ bool		XLOG_DEBUG = false;
 #define NUM_XLOGINSERT_LOCKS  8
 
 /*
- * XLOGfileslop is the maximum number of preallocated future XLOG segments.
- * When we are done with an old XLOG segment file, we will recycle it as a
- * future XLOG segment as long as there aren't already XLOGfileslop future
- * segments; else we'll delete it.  This could be made a separate GUC
- * variable, but at present I think it's sufficient to hardwire it as
- * 2*CheckPointSegments+1.  Under normal conditions, a checkpoint will free
- * no more than 2*CheckPointSegments log segments, and we want to recycle all
- * of them; the +1 allows boundary cases to happen without wasting a
- * delete/create-segment cycle.
+ * Max distance from last checkpoint, before triggering a new xlog-based
+ * checkpoint.
  */
-#define XLOGfileslop	(2*CheckPointSegments + 1)
+int			CheckPointSegments;
 
+/* Estimated distance between checkpoints, in bytes */
+static double CheckPointDistanceEstimate = 0;
+static double PrevCheckPointDistance = 0;
 
 /*
  * GUC support
@@ -778,7 +775,7 @@ static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
-					   bool find_free, int *max_advance,
+					   bool find_free, XLogSegNo max_segno,
 					   bool use_lock);
 static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			 int source, bool notexistOk);
@@ -791,7 +788,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
-static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr);
+static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr);
 static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
 static void CleanupBackupHistory(void);
@@ -1958,6 +1955,109 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 }
 
 /*
+ * Calculate CheckPointSegments based on checkpoint_wal_size and
+ * checkpoint_completion_target.
+ */
+static void
+CalculateCheckpointSegments(void)
+{
+	double		target;
+
+	/*-------
+	 * Calculate the distance at which to trigger a checkpoint, to avoid
+	 * exceeding checkpoint_wal_size. This is based on two assumptions:
+	 *
+	 * a) we keep WAL for two checkpoint cycles, back to the "prev" checkpoint.
+	 * b) during checkpoint, we consume checkpoint_completion_target *
+	 *    number of segments consumed between checkpoints.
+	 *-------
+	 */
+	target = (double ) checkpoint_wal_size / (double) (XLOG_SEG_SIZE / 1024);
+	target = target / (2.0 + CheckPointCompletionTarget);
+
+	/* round down */
+	CheckPointSegments = (int) target;
+
+	if (CheckPointSegments < 1)
+		CheckPointSegments = 1;
+}
+
+void
+assign_checkpoint_wal_size(int newval, void *extra)
+{
+	checkpoint_wal_size = newval;
+	CalculateCheckpointSegments();
+}
+
+void
+assign_checkpoint_completion_target(double newval, void *extra)
+{
+	CheckPointCompletionTarget = newval;
+	CalculateCheckpointSegments();
+}
+
+/*
+ * At a checkpoint, how many WAL segments to recycle as preallocated future
+ * XLOG segments? Returns the highest segment that should be preallocated.
+ */
+static XLogSegNo
+XLOGfileslop(XLogRecPtr PriorRedoPtr)
+{
+	double		nsegments;
+	XLogSegNo	minSegNo;
+	XLogSegNo	maxSegNo;
+	double		distance;
+	XLogSegNo	recycleSegNo;
+
+	/*
+	 * Calculate the segment numbers that min_recycle_wal_size and
+	 * checkpoint_wal_size correspond to. Always recycle enough segments
+	 * to meet the minimum, and remove enough segments to stay below the
+	 * maximum.
+	 */
+	nsegments = (double) min_recycle_wal_size / (double) (XLOG_SEG_SIZE / 1024);
+	minSegNo = PriorRedoPtr / XLOG_SEG_SIZE + (int) nsegments - 1;
+	nsegments = (double) checkpoint_wal_size / (double) (XLOG_SEG_SIZE / 1024);
+	maxSegNo =  PriorRedoPtr / XLOG_SEG_SIZE + (int) nsegments - 1;
+
+	/*
+	 * Between those limits, recycle enough segments to get us through to the
+	 * estimated end of next checkpoint.
+	 *
+	 * To estimate where the next checkpoint will finish, assume that the
+	 * system runs steadily consuming CheckPointDistanceEstimate
+	 * bytes between every checkpoint.
+	 *
+	 * The reason this calculation is done from the prior checkpoint, not the
+	 * one that just finished, is that this behaves better if some checkpoint
+	 * cycles are abnormally short, like if you perform a manual checkpoint
+	 * right after a timed one. The manual checkpoint will make almost a full
+	 * cycle's worth of WAL segments available for recycling, because the
+	 * segments from the prior's prior, fully-sized checkpoint cycle are no
+	 * longer needed. However, the next checkpoint will make only few segments
+	 * available for recycling, the ones generated between the timed
+	 * checkpoint and the manual one right after that. If at the manual
+	 * checkpoint we only retained enough segments to get us to the next timed
+	 * one, and removed the rest, then at the next checkpoint we would not
+	 * have enough segments around for recycling, to get us to the checkpoint
+	 * after that. Basing the calculations on the distance from the prior redo
+	 * pointer largely fixes that problem.
+	 */
+	distance = (2.0 + CheckPointCompletionTarget) * CheckPointDistanceEstimate;
+	/* add 10% for good measure. */
+	distance *= 1.10;
+
+	recycleSegNo = (XLogSegNo) ceil(((double) PriorRedoPtr + distance) / XLOG_SEG_SIZE);
+
+	if (recycleSegNo < minSegNo)
+		recycleSegNo = minSegNo;
+	if (recycleSegNo > maxSegNo)
+		recycleSegNo = maxSegNo;
+
+	return recycleSegNo;
+}
+
+/*
  * Check whether we've consumed enough xlog space that a checkpoint is needed.
  *
  * new_segno indicates a log file that has just been filled up (or read
@@ -2764,7 +2864,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	char		zbuffer_raw[XLOG_BLCKSZ + MAXIMUM_ALIGNOF];
 	char	   *zbuffer;
 	XLogSegNo	installed_segno;
-	int			max_advance;
+	XLogSegNo	max_segno;
 	int			fd;
 	int			nbytes;
 
@@ -2867,9 +2967,19 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	 * pre-create a future log segment.
 	 */
 	installed_segno = logsegno;
-	max_advance = XLOGfileslop;
+
+	/*
+	 * XXX: What should we use as max_segno? We used to use XLOGfileslop when
+	 * that was a constant, but that was always a bit dubious: normally, at a
+	 * checkpoint, XLOGfileslop was the offset from the checkpoint record,
+	 * but here, it was the offset from the insert location. We can't do the
+	 * normal XLOGfileslop calculation here because we don't have access to
+	 * the prior checkpoint's redo location. So somewhat arbitrarily, just
+	 * use CheckPointSegments.
+	 */
+	max_segno = logsegno + CheckPointSegments;
 	if (!InstallXLogFileSegment(&installed_segno, tmppath,
-								*use_existent, &max_advance,
+								*use_existent, max_segno,
 								use_lock))
 	{
 		/*
@@ -3010,7 +3120,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	/*
 	 * Now move the segment into place with its final name.
 	 */
-	if (!InstallXLogFileSegment(&destsegno, tmppath, false, NULL, false))
+	if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false))
 		elog(ERROR, "InstallXLogFileSegment should not have failed");
 }
 
@@ -3030,22 +3140,21 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
  * number at or after the passed numbers.  If FALSE, install the new segment
  * exactly where specified, deleting any existing segment file there.
  *
- * *max_advance: maximum number of segno slots to advance past the starting
- * point.  Fail if no free slot is found in this range.  On return, reduced
- * by the number of slots skipped over.  (Irrelevant, and may be NULL,
- * when find_free is FALSE.)
+ * max_segno: maximum segment number to install the new file as.  Fail if no
+ * free slot is found between *segno and max_segno. (Ignored when find_free
+ * is FALSE.)
  *
  * use_lock: if TRUE, acquire ControlFileLock while moving file into
  * place.  This should be TRUE except during bootstrap log creation.  The
  * caller must *not* hold the lock at call.
  *
  * Returns TRUE if the file was installed successfully.  FALSE indicates that
- * max_advance limit was exceeded, or an error occurred while renaming the
+ * max_segno limit was exceeded, or an error occurred while renaming the
  * file into place.
  */
 static bool
 InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
-					   bool find_free, int *max_advance,
+					   bool find_free, XLogSegNo max_segno,
 					   bool use_lock)
 {
 	char		path[MAXPGPATH];
@@ -3069,7 +3178,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 		/* Find a free slot to put it in */
 		while (stat(path, &stat_buf) == 0)
 		{
-			if (*max_advance <= 0)
+			if ((*segno) >= max_segno)
 			{
 				/* Failed to find a free slot within specified range */
 				if (use_lock)
@@ -3077,7 +3186,6 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 				return false;
 			}
 			(*segno)++;
-			(*max_advance)--;
 			XLogFilePath(path, ThisTimeLineID, *segno);
 		}
 	}
@@ -3425,14 +3533,15 @@ UpdateLastRemovedPtr(char *filename)
 /*
  * Recycle or remove all log files older or equal to passed segno
  *
- * endptr is current (or recent) end of xlog; this is used to determine
+ * endptr is current (or recent) end of xlog, and PriorRedoRecPtr is the
+ * redo pointer of the previous checkpoint. These are used to determine
  * whether we want to recycle rather than delete no-longer-wanted log files.
  */
 static void
-RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr)
+RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 {
 	XLogSegNo	endlogSegNo;
-	int			max_advance;
+	XLogSegNo	recycleSegNo;
 	DIR		   *xldir;
 	struct dirent *xlde;
 	char		lastoff[MAXFNAMELEN];
@@ -3444,11 +3553,10 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr)
 	struct stat statbuf;
 
 	/*
-	 * Initialize info about where to try to recycle to.  We allow recycling
-	 * segments up to XLOGfileslop segments beyond the current XLOG location.
+	 * Initialize info about where to try to recycle to.
 	 */
 	XLByteToPrevSeg(endptr, endlogSegNo);
-	max_advance = XLOGfileslop;
+	recycleSegNo = XLOGfileslop(PriorRedoPtr);
 
 	xldir = AllocateDir(XLOGDIR);
 	if (xldir == NULL)
@@ -3497,20 +3605,17 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr)
 				 * for example can create symbolic links pointing to a
 				 * separate archive directory.
 				 */
-				if (lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
+				if (endlogSegNo <= recycleSegNo &&
+					lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 					InstallXLogFileSegment(&endlogSegNo, path,
-										   true, &max_advance, true))
+										   true, recycleSegNo, true))
 				{
 					ereport(DEBUG2,
 							(errmsg("recycled transaction log file \"%s\"",
 									xlde->d_name)));
 					CheckpointStats.ckpt_segs_recycled++;
 					/* Needn't recheck that slot on future iterations */
-					if (max_advance > 0)
-					{
-						endlogSegNo++;
-						max_advance--;
-					}
+					endlogSegNo++;
 				}
 				else
 				{
@@ -7598,7 +7703,8 @@ LogCheckpointEnd(bool restartpoint)
 	elog(LOG, "%s complete: wrote %d buffers (%.1f%%); "
 		 "%d transaction log file(s) added, %d removed, %d recycled; "
 		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
-		 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s",
+		 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+		 "distance=%d KB, estimate=%d KB",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 CheckpointStats.ckpt_bufs_written,
 		 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
@@ -7610,7 +7716,48 @@ LogCheckpointEnd(bool restartpoint)
 		 total_secs, total_usecs / 1000,
 		 CheckpointStats.ckpt_sync_rels,
 		 longest_secs, longest_usecs / 1000,
-		 average_secs, average_usecs / 1000);
+		 average_secs, average_usecs / 1000,
+		 (int) (PrevCheckPointDistance / 1024.0),
+		 (int) (CheckPointDistanceEstimate / 1024.0));
+}
+
+/*
+ * Update the estimate of distance between checkpoints.
+ *
+ * The estimate is used to calculate the number of WAL segments to keep
+ * preallocated, see XLOGFileSlop().
+ */
+static void
+UpdateCheckPointDistanceEstimate(uint64 nbytes)
+{
+	/*
+	 * To estimate the number of segments consumed between checkpoints, keep
+	 * a moving average of the actual number of segments consumed in previous
+	 * checkpoint cycles. However, if the load is bursty, with quiet periods
+	 * and busy periods, we want to cater for the peak load. So instead of a
+	 * plain moving average, let the average decline slowly if the previous
+	 * cycle used less WAL than estimated, but bump it up immediately if it
+	 * used more.
+	 *
+	 * When checkpoints are triggered by checkpoint_wal_size, this should
+	 * converge to CheckpointSegments * XLOG_SEG_SIZE,
+	 *
+	 * Note: This doesn't pay any attention to what caused the checkpoint.
+	 * Checkpoints triggered manually with CHECKPOINT command, or by e.g
+	 * starting a base backup, are counted the same as those created
+	 * automatically. The slow-decline will largely mask them out, if they are
+	 * not frequent. If they are frequent, it seems reasonable to count them
+	 * in as any others; if you issue a manual checkpoint every 5 minutes and
+	 * never let a timed checkpoint happen, it makes sense to base the
+	 * preallocation on that 5 minute interval rather than whatever
+	 * checkpoint_timeout is set to.
+	 */
+	PrevCheckPointDistance = nbytes;
+	if (CheckPointDistanceEstimate < nbytes)
+		CheckPointDistanceEstimate = nbytes;
+	else
+		CheckPointDistanceEstimate =
+			(0.90 * CheckPointDistanceEstimate + 0.10 * (double) nbytes);
 }
 
 /*
@@ -7650,7 +7797,7 @@ CreateCheckPoint(int flags)
 	XLogRecPtr	recptr;
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	uint32		freespace;
-	XLogSegNo	_logSegNo;
+	XLogRecPtr	PriorRedoPtr;
 	XLogRecPtr	curInsert;
 	VirtualTransactionId *vxids;
 	int			nvxids;
@@ -7965,10 +8112,10 @@ CreateCheckPoint(int flags)
 				(errmsg("concurrent transaction log activity while database system is shutting down")));
 
 	/*
-	 * Select point at which we can truncate the log, which we base on the
-	 * prior checkpoint's earliest info.
+	 * Remember the prior checkpoint's redo pointer, used later to determine
+	 * the point where the log can be truncated.
 	 */
-	XLByteToSeg(ControlFile->checkPointCopy.redo, _logSegNo);
+	PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
 	/*
 	 * Update the control file.
@@ -8023,11 +8170,17 @@ CreateCheckPoint(int flags)
 	 * Delete old log files (those no longer needed even for previous
 	 * checkpoint or the standbys in XLOG streaming).
 	 */
-	if (_logSegNo)
+	if (PriorRedoPtr != InvalidXLogRecPtr)
 	{
+		XLogSegNo	_logSegNo;
+
+		/* Update the average distance between checkpoints. */
+		UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
+
+		XLByteToSeg(PriorRedoPtr, _logSegNo);
 		KeepLogSeg(recptr, &_logSegNo);
 		_logSegNo--;
-		RemoveOldXlogFiles(_logSegNo, recptr);
+		RemoveOldXlogFiles(_logSegNo, PriorRedoPtr, recptr);
 	}
 
 	/*
@@ -8195,7 +8348,7 @@ CreateRestartPoint(int flags)
 {
 	XLogRecPtr	lastCheckPointRecPtr;
 	CheckPoint	lastCheckPoint;
-	XLogSegNo	_logSegNo;
+	XLogRecPtr	PriorRedoPtr;
 	TimestampTz xtime;
 
 	/*
@@ -8260,7 +8413,7 @@ CreateRestartPoint(int flags)
 	/*
 	 * Update the shared RedoRecPtr so that the startup process can calculate
 	 * the number of segments replayed since last restartpoint, and request a
-	 * restartpoint if it exceeds checkpoint_segments.
+	 * restartpoint if it exceeds CheckPointSegments.
 	 *
 	 * Like in CreateCheckPoint(), hold off insertions to update it, although
 	 * during recovery this is just pro forma, because no WAL insertions are
@@ -8291,10 +8444,10 @@ CreateRestartPoint(int flags)
 	CheckPointGuts(lastCheckPoint.redo, flags);
 
 	/*
-	 * Select point at which we can truncate the xlog, which we base on the
-	 * prior checkpoint's earliest info.
+	 * Remember the prior checkpoint's redo pointer, used later to determine
+	 * the point at which we can truncate the log.
 	 */
-	XLByteToSeg(ControlFile->checkPointCopy.redo, _logSegNo);
+	PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
 	/*
 	 * Update pg_control, using current time.  Check that it still shows
@@ -8321,12 +8474,15 @@ CreateRestartPoint(int flags)
 	 * checkpoint/restartpoint) to prevent the disk holding the xlog from
 	 * growing full.
 	 */
-	if (_logSegNo)
+	if (PriorRedoPtr != InvalidXLogRecPtr)
 	{
 		XLogRecPtr	receivePtr;
 		XLogRecPtr	replayPtr;
 		TimeLineID	replayTLI;
 		XLogRecPtr	endptr;
+		XLogSegNo	_logSegNo;
+
+		XLByteToSeg(PriorRedoPtr, _logSegNo);
 
 		/*
 		 * Get the current end of xlog replayed or received, whichever is
@@ -8355,7 +8511,7 @@ CreateRestartPoint(int flags)
 		if (RecoveryInProgress())
 			ThisTimeLineID = replayTLI;
 
-		RemoveOldXlogFiles(_logSegNo, endptr);
+		RemoveOldXlogFiles(_logSegNo, PriorRedoPtr, endptr);
 
 		/*
 		 * Make more log segments if needed.  (Do this after recycling old log
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 8a79d9b..1183793 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -482,7 +482,7 @@ CheckpointerMain(void)
 				"checkpoints are occurring too frequently (%d seconds apart)",
 									   elapsed_secs,
 									   elapsed_secs),
-						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
+						 errhint("Consider increasing the configuration parameter \"checkpoint_wal_size\".")));
 
 			/*
 			 * Initialize checkpointer-private variables used during
@@ -760,11 +760,11 @@ IsCheckpointOnSchedule(double progress)
 		return false;
 
 	/*
-	 * Check progress against WAL segments written and checkpoint_segments.
+	 * Check progress against WAL segments written and CheckPointSegments.
 	 *
 	 * We compare the current WAL insert location against the location
 	 * computed before calling CreateCheckPoint. The code in XLogInsert that
-	 * actually triggers a checkpoint when checkpoint_segments is exceeded
+	 * actually triggers a checkpoint when CheckPointSegments is exceeded
 	 * compares against RedoRecptr, so this is not completely accurate.
 	 * However, it's good enough for our purposes, we're only calculating an
 	 * estimate anyway.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 77c3494..a2aad83 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2086,16 +2086,28 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
-		{"checkpoint_segments", PGC_SIGHUP, WAL_CHECKPOINTS,
-			gettext_noop("Sets the maximum distance in log segments between automatic WAL checkpoints."),
-			NULL
+		{"min_recycle_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Sets the minimum size to shrink the WAL to."),
+			NULL,
+			GUC_UNIT_KB
 		},
-		&CheckPointSegments,
-		3, 1, INT_MAX,
+		&min_recycle_wal_size,
+		81920, 32768, INT_MAX,
 		NULL, NULL, NULL
 	},
 
 	{
+		{"checkpoint_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Sets the maximum WAL size that triggers a checkpoint."),
+			NULL,
+			GUC_UNIT_KB
+		},
+		&checkpoint_wal_size,
+		131072, 32768, INT_MAX,
+		NULL, assign_checkpoint_wal_size, NULL
+	},
+
+	{
 		{"checkpoint_timeout", PGC_SIGHUP, WAL_CHECKPOINTS,
 			gettext_noop("Sets the maximum time between automatic WAL checkpoints."),
 			NULL,
@@ -2711,7 +2723,7 @@ static struct config_real ConfigureNamesReal[] =
 		},
 		&CheckPointCompletionTarget,
 		0.5, 0.0, 1.0,
-		NULL, NULL, NULL
+		NULL, assign_checkpoint_completion_target, NULL
 	},
 
 	/* End-of-list marker */
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b053659..fc276a8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -197,8 +197,9 @@
 
 # - Checkpoints -
 
-#checkpoint_segments = 3		# in logfile segments, min 1, 16MB each
 #checkpoint_timeout = 5min		# range 30s-1h
+#checkpoint_wal_size = 128MB
+#min_recycle_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_warning = 30s		# 0 disables
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d06fbc0..d15b8f1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -89,7 +89,8 @@ extern XLogRecPtr XactLastRecEnd;
 extern bool reachedConsistency;
 
 /* these variables are GUC parameters related to XLOG */
-extern int	CheckPointSegments;
+extern int	min_recycle_wal_size;
+extern int	checkpoint_wal_size;
 extern int	wal_keep_segments;
 extern int	XLOGbuffers;
 extern int	XLogArchiveTimeout;
@@ -100,6 +101,8 @@ extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool log_checkpoints;
 
+extern int	CheckPointSegments;
+
 /* WAL levels */
 typedef enum WalLevel
 {
@@ -245,6 +248,9 @@ extern bool CheckPromoteSignal(void);
 extern void WakeupRecovery(void);
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern void assign_checkpoint_wal_size(int newval, void *extra);
+extern void assign_checkpoint_completion_target(double newval, void *extra);
+
 /*
  * Starting/stopping a base backup
  */

#54

Heikki Linnakangas

hlinnakangas@vmware.com

about 11 years ago

In reply to: Amit Kapila (#51)

Re: Redesigning checkpoint_segments

On 09/01/2013 10:37 AM, Amit Kapila wrote:

On Sat, Aug 24, 2013 at 2:38 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

a.
In XLogFileInit(),
/*
! * XXX: What should we use as max_segno? We used to use XLOGfileslop when
! * that was a constant, but that was always a bit dubious: normally, at a
! * checkpoint, XLOGfileslop was the offset from the checkpoint record,
! * but here, it was the offset from the insert location. We can't do the
! * normal XLOGfileslop calculation here because we don't have access to
! * the prior checkpoint's redo location. So somewhat arbitrarily, just
! * use CheckPointSegments.
! */
! max_segno = logsegno + CheckPointSegments;
if (!InstallXLogFileSegment(&installed_segno, tmppath,
! *use_existent, max_segno,
use_lock))

Earlier max_advance is same when InstallXLogFileSegment is called from
RemoveOldXlogFiles() and XLogFileInit(),
but now they will be different (and it seems there is no direct
relation between these 2 numbers), so will it be okay for scenario
when someone else has created the file while this function was
filling, because it needs to restore as future segment which will be
decided based on max_segno?

I haven't really thought hard about the above. As the comment says,
passing the same max_advance value here and in RemoveOldXlogFiles() was
a bit dubious too, because the reference point was different.

I believe it's quite rare that two processes create a new WAL segment
concurrently, so it isn't terribly important what we do here.

b. Do createrestartpoint need to update the
CheckPointDistanceEstimate, as when it will try to remove old xlog
files, it needs recycleSegNo which is calculated using
CheckPointDistanceEstimate?

Yeah, you're right, it should. I haven't tested this with archive
recovery or replication at all yet.

As a developer, I would love to have configuration knob such as
min_recycle_wal_size, but not sure how many users will be comfortable
setting this value, actually few users I had talked about this earlier
are interested in setting max WAL size which can allow them to set an
upper limit on space required by WAL.
Can't we think of doing the calculation of files to recycle only based
on CheckPointDistanceEstimate.

You can always just leave min_recycle_wal_size to the default. It sets a
minimum for the number of preallocated segments, which can help if you
have spikes that consume a lot of WAL, like nightly batch jobs. But if
you don't have such spikes, or the overhead of creating new segments
when such a spike happens isn't too large, you don't need to set it.

One idea is to try to make the creation of new WAL segments faster. Then
it wouldn't hurt so much if you run out of preallocated/recycled
segments and need to suddenly create a lot of new ones. Then we might
not need a minimum setting at all.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

Josh Berkus

josh@agliodbs.com

about 11 years ago

In reply to: Heikki Linnakangas (#1)

Re: Redesigning checkpoint_segments

Heikki,

Thanks for getting back to this! I really look forward to simplifying
WAL tuning for users.

min_recycle_wal_size
checkpoint_wal_size

<snip>

These settings are fairly intuitive for a DBA to tune. You begin by
figuring out how much disk space you can afford to spend on WAL, and set
checkpoint_wal_size to that (with some safety margin, of course). Then
you set checkpoint_timeout based on how long you're willing to wait for
recovery to finish. Finally, if you have infrequent batch jobs that need
a lot more WAL than the system otherwise needs, you can set
min_recycle_wal_size to keep enough WAL preallocated for the spikes.

We'll want to rename them to make it even *more* intuitive.

Sure, I'm all ears.

My suggestion:

max_wal_size
min_wal_size

... these would be very easy to read & understand for users: "Set
max_wal_size based on the amount of space you have available for the
transaction log, or about 10% of the space available for your database
if you don't have a specific allocation for the log. If your database
involves large batch imports, you may want to increase min_wal_size to
be at least the size of your largest batch."

Suggested defaults:

max_wal_size: 256MB
min_wal_size: 64MB

Please remind me because I'm having trouble finding this in the
archives: how does wal_keep_segments interact with the new settings?

But ... do I understand things correctly that checkpoint wouldn't "kick
in" until you hit checkpoint_wal_size? If that's the case, isn't real
disk space usage around 2X checkpoint_wal_size if spread checkpoint is
set to 0.9? Or does checkpoint kick in sometime earlier?

It kicks in earlier, so that the checkpoint *completes* just when
checkpoint_wal_size of WAL is used up. So the real disk usage is
checkpoint_wal_size.

Awesome. This makes me very happy.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WMf3ae9c7d4fa768715e1061cd3655516318dceef650ac20dda59e16a85a5b2bdce7a1335573a21372dd8cd6d8ffaded65@asav-3.01.com

#56

Heikki Linnakangas

hlinnakangas@vmware.com

about 11 years ago

In reply to: Josh Berkus (#55)

Re: Redesigning checkpoint_segments

On 01/01/2015 03:24 AM, Josh Berkus wrote:

Please remind me because I'm having trouble finding this in the
archives: how does wal_keep_segments interact with the new settings?

It's not very straightforward. First of all, min_recycle_wal_size has a
different effect than wal_keep_segments. Raising min_recycle_wal_size
causes more segments to be recycled rather than deleted, while
wal_keep_segments causes old segments to be retained as old segments, so
that they can be used for streaming replication. If you raise
min_recycle_wal_size, it will not do any good for streaming replication.

wal_keep_segments does not affect the calculation of CheckPointSegments.
If you set wal_keep_segments high enough, checkpoint_wal_size will be
exceeded. The other alternative would be to force a checkpoint earlier,
i.e. lower CheckPointSegments, so that checkpoint_wal_size would be
honored. However, if you set wal_keep_segments high enough, higher than
checkpoint_wal_size, it's impossible to honor checkpoint_wal_size no
matter how frequently you checkpoint.

It's not totally straightforward to calculate what maximum size of WAL a
given wal_keep_segments settings will force. wal_keep_segments is taken
into account at a checkpoint, when we recycle old WAL segments. For
example, imagine that prior checkpoint started at segment 95, a new
checkpoint finishes at segment 100, and wal_keep_segments=10. Because of
wal_keep_segments, we have to retain segments 90-95, which could
otherwise be recycled. So that forces a WAL size of 10 segments, while
otherwise 5 would be enough. However, before we reach the next
checkpoint, let's assume it will complete at segment 105, we will
consume five more segments, so the actual max WAL size is 15 segments.
However, we could start recycling the segments 90-95 before we reach the
next checkpoint, because wal_keep_segments stipulates how many segments
from the current *insert* location needs to be retained, with not regard
to checkpoints. But we only attempt to recycle segments at checkpoints.

So that could be made more straightforward if we recycled old segments
in the background, between checkpoints. That might allow merging
wal_keep_segments and min_recycle_wal_size settings, too: instead of
renaming all old recycleable segments at a checkpoint, you could keep
them around as old segments until they're actually needed for reuse, so
they could be used for streaming replication up to that point.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57

Josh Berkus

josh@agliodbs.com

about 11 years ago

In reply to: Heikki Linnakangas (#1)

Re: Redesigning checkpoint_segments

On 01/02/2015 01:57 AM, Heikki Linnakangas wrote:

wal_keep_segments does not affect the calculation of CheckPointSegments.
If you set wal_keep_segments high enough, checkpoint_wal_size will be
exceeded. The other alternative would be to force a checkpoint earlier,
i.e. lower CheckPointSegments, so that checkpoint_wal_size would be
honored. However, if you set wal_keep_segments high enough, higher than
checkpoint_wal_size, it's impossible to honor checkpoint_wal_size no
matter how frequently you checkpoint.

So you're saying that wal_keep_segments is part of the max_wal_size
total, NOT in addition to it?

Just asking for clarification, here. I think that's a fine idea, I just
want to make sure I understood you. The importance of wal_keep_segments
will be fading as more people use replication slots.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WMe95fc3a92e8db1e73ae124af22985cc2cca0487a98c49e9480a2f5539148c127ed4f7c95fad9366d93d2ba226303ab0d@asav-3.01.com

#58

Heikki Linnakangas

hlinnakangas@vmware.com

about 11 years ago

In reply to: Josh Berkus (#57)

Re: Redesigning checkpoint_segments

On 01/03/2015 12:28 AM, Josh Berkus wrote:

On 01/02/2015 01:57 AM, Heikki Linnakangas wrote:

wal_keep_segments does not affect the calculation of CheckPointSegments.
If you set wal_keep_segments high enough, checkpoint_wal_size will be
exceeded. The other alternative would be to force a checkpoint earlier,
i.e. lower CheckPointSegments, so that checkpoint_wal_size would be
honored. However, if you set wal_keep_segments high enough, higher than
checkpoint_wal_size, it's impossible to honor checkpoint_wal_size no
matter how frequently you checkpoint.

So you're saying that wal_keep_segments is part of the max_wal_size
total, NOT in addition to it?

Not sure what you mean. wal_keep_segments is an extra control that can
prevent WAL segments from being recycled. It has the same effect as
archive_command failing for N most recent segments, if that helps.

Just asking for clarification, here. I think that's a fine idea, I just
want to make sure I understood you. The importance of wal_keep_segments
will be fading as more people use replication slots.

Yeah.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59

Josh Berkus

josh@agliodbs.com

about 11 years ago

In reply to: Heikki Linnakangas (#1)

Re: Redesigning checkpoint_segments

On 01/03/2015 12:56 AM, Heikki Linnakangas wrote:

On 01/03/2015 12:28 AM, Josh Berkus wrote:

On 01/02/2015 01:57 AM, Heikki Linnakangas wrote:

wal_keep_segments does not affect the calculation of CheckPointSegments.
If you set wal_keep_segments high enough, checkpoint_wal_size will be
exceeded. The other alternative would be to force a checkpoint earlier,
i.e. lower CheckPointSegments, so that checkpoint_wal_size would be
honored. However, if you set wal_keep_segments high enough, higher than
checkpoint_wal_size, it's impossible to honor checkpoint_wal_size no
matter how frequently you checkpoint.

So you're saying that wal_keep_segments is part of the max_wal_size
total, NOT in addition to it?

Not sure what you mean. wal_keep_segments is an extra control that can
prevent WAL segments from being recycled. It has the same effect as
archive_command failing for N most recent segments, if that helps.

I mean, if I have these settings:

max_wal_size* = 256MB
wal_keep_segments = 8

... then my max wal size is *still* 256MB, NOT 384MB?

If that's the case (and I think it's a good plan), then as a follow-on,
we should prevent users from setting wal_keep_segments to more than 50%
of max_wal_size, no?

(* max_wal_size == checkpoint_wal_size, per prior email)

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM98fe27e4e0a0d781b65fbbc3e7fe840492e6db5208a31faf2ea9902b8a6578d5d29b0a67ebbdd2e3bbd446f03e9754ad@asav-2.01.com

#60

Heikki Linnakangas

hlinnakangas@vmware.com

about 11 years ago

In reply to: Josh Berkus (#59)

Re: Redesigning checkpoint_segments

On 01/04/2015 11:44 PM, Josh Berkus wrote:

On 01/03/2015 12:56 AM, Heikki Linnakangas wrote:

On 01/03/2015 12:28 AM, Josh Berkus wrote:

On 01/02/2015 01:57 AM, Heikki Linnakangas wrote:

wal_keep_segments does not affect the calculation of CheckPointSegments.
If you set wal_keep_segments high enough, checkpoint_wal_size will be
exceeded. The other alternative would be to force a checkpoint earlier,
i.e. lower CheckPointSegments, so that checkpoint_wal_size would be
honored. However, if you set wal_keep_segments high enough, higher than
checkpoint_wal_size, it's impossible to honor checkpoint_wal_size no
matter how frequently you checkpoint.

So you're saying that wal_keep_segments is part of the max_wal_size
total, NOT in addition to it?

Not sure what you mean. wal_keep_segments is an extra control that can
prevent WAL segments from being recycled. It has the same effect as
archive_command failing for N most recent segments, if that helps.

I mean, if I have these settings:

max_wal_size* = 256MB
wal_keep_segments = 8

... then my max wal size is *still* 256MB, NOT 384MB?

Right.

If that's the case (and I think it's a good plan), then as a follow-on,
we should prevent users from setting wal_keep_segments to more than 50%
of max_wal_size, no?

Not sure if the 50% figure is correct, but I see what you mean: don't
allow setting wal_keep_segments so high that we would exceed
max_wal_size because of it.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61

Andres Freund

andres@2ndquadrant.com

about 11 years ago

In reply to: Heikki Linnakangas (#60)

Re: Redesigning checkpoint_segments

On 2015-01-05 11:34:54 +0200, Heikki Linnakangas wrote:

On 01/04/2015 11:44 PM, Josh Berkus wrote:

On 01/03/2015 12:56 AM, Heikki Linnakangas wrote:

On 01/03/2015 12:28 AM, Josh Berkus wrote:

On 01/02/2015 01:57 AM, Heikki Linnakangas wrote:

wal_keep_segments does not affect the calculation of CheckPointSegments.
If you set wal_keep_segments high enough, checkpoint_wal_size will be
exceeded. The other alternative would be to force a checkpoint earlier,
i.e. lower CheckPointSegments, so that checkpoint_wal_size would be
honored. However, if you set wal_keep_segments high enough, higher than
checkpoint_wal_size, it's impossible to honor checkpoint_wal_size no
matter how frequently you checkpoint.

So you're saying that wal_keep_segments is part of the max_wal_size
total, NOT in addition to it?

Not sure what you mean. wal_keep_segments is an extra control that can
prevent WAL segments from being recycled. It has the same effect as
archive_command failing for N most recent segments, if that helps.

I mean, if I have these settings:

max_wal_size* = 256MB
wal_keep_segments = 8

... then my max wal size is *still* 256MB, NOT 384MB?

Right.

With that you mean that wal_keep_segments has *no* influence over
checkpoint pacing or the contrary? Because upthread you imply that it
doesn't, but later comments may mean the contrary.

I think that influencing the pacing would be pretty insane - the user
certainly doesn't expect drastic performance changes when changing
wal_keep_segments. It's confusing enough that it can cause slight
peformance variations due to recycling, but we shouldn't make it have a
larger influence.

If that's the case (and I think it's a good plan), then as a follow-on,
we should prevent users from setting wal_keep_segments to more than 50%
of max_wal_size, no?

Not sure if the 50% figure is correct, but I see what you mean: don't allow
setting wal_keep_segments so high that we would exceed max_wal_size because
of it.

That seems a unrealistic goal. I've seen setups that have set
checkpoint_segments intentionally, and with good reasoning, north of
50k.

Neither wal_keep_segments, nor failing archive_commands nor replication
slot should have an influence on checkpoint pacing.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62

Heikki Linnakangas

hlinnakangas@vmware.com

about 11 years ago

In reply to: Andres Freund (#61)

Re: Redesigning checkpoint_segments

On 01/05/2015 12:06 PM, Andres Freund wrote:

On 2015-01-05 11:34:54 +0200, Heikki Linnakangas wrote:

On 01/04/2015 11:44 PM, Josh Berkus wrote:

On 01/03/2015 12:56 AM, Heikki Linnakangas wrote:

On 01/03/2015 12:28 AM, Josh Berkus wrote:

On 01/02/2015 01:57 AM, Heikki Linnakangas wrote:

wal_keep_segments does not affect the calculation of CheckPointSegments.
If you set wal_keep_segments high enough, checkpoint_wal_size will be
exceeded. The other alternative would be to force a checkpoint earlier,
i.e. lower CheckPointSegments, so that checkpoint_wal_size would be
honored. However, if you set wal_keep_segments high enough, higher than
checkpoint_wal_size, it's impossible to honor checkpoint_wal_size no
matter how frequently you checkpoint.

So you're saying that wal_keep_segments is part of the max_wal_size
total, NOT in addition to it?

Not sure what you mean. wal_keep_segments is an extra control that can
prevent WAL segments from being recycled. It has the same effect as
archive_command failing for N most recent segments, if that helps.

I mean, if I have these settings:

max_wal_size* = 256MB
wal_keep_segments = 8

... then my max wal size is *still* 256MB, NOT 384MB?

Right.

With that you mean that wal_keep_segments has *no* influence over
checkpoint pacing or the contrary? Because upthread you imply that it
doesn't, but later comments may mean the contrary.

wal_keep_segments does not influence checkpoint pacing.

If that's the case (and I think it's a good plan), then as a follow-on,
we should prevent users from setting wal_keep_segments to more than 50%
of max_wal_size, no?

Not sure if the 50% figure is correct, but I see what you mean: don't allow
setting wal_keep_segments so high that we would exceed max_wal_size because
of it.

I wasn't clear on my opinion here. I think I understood what Josh meant,
but I don't think we should do it. Seems like unnecessary nannying of
the DBA. Let's just mention in the manual that if you set
wal_keep_segments higher than [insert formula here], you will routinely
have more WAL in pg_xlog than what checkpoint_wal_size is set to.

That seems a unrealistic goal. I've seen setups that have set
checkpoint_segments intentionally, and with good reasoning, north of
50k.

So? I don't see how that's relevant.

Neither wal_keep_segments, nor failing archive_commands nor replication
slot should have an influence on checkpoint pacing.

Agreed.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63

Josh Berkus

josh@agliodbs.com

about 11 years ago

In reply to: Heikki Linnakangas (#53)

Re: Redesigning checkpoint_segments

On 01/05/2015 09:06 AM, Heikki Linnakangas wrote:

I wasn't clear on my opinion here. I think I understood what Josh meant,
but I don't think we should do it. Seems like unnecessary nannying of
the DBA. Let's just mention in the manual that if you set
wal_keep_segments higher than [insert formula here], you will routinely
have more WAL in pg_xlog than what checkpoint_wal_size is set to.

That seems a unrealistic goal. I've seen setups that have set
checkpoint_segments intentionally, and with good reasoning, north of
50k.

So? I don't see how that's relevant.

Neither wal_keep_segments, nor failing archive_commands nor replication
slot should have an influence on checkpoint pacing.

Agreed.

Oh, right, slots can also cause the log to increase in size. And we've
already had the discussion about hard limits, which is maybe a future
feature and not part of this patch.

Can we figure out a reasonable formula? My thinking is 50% for
wal_keep_segments, because we need at least 50% of the wals to do a
reasonable spread checkpoint. If max_wal_size is 1GB, and
wal_keep_segments is 1.5GB, what would happen? What if
wal_keep_segments is 0.9GB?

I need to create a fake benchmark for this ...

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WMa390aff4540b3cc6a4b5d2b50436305cf6d4a9e8711d3ac19d5335c659b488b141f5bda6ff8dae4ad7cd6926d319e4ed@asav-2.01.com

#64

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Heikki Linnakangas (#56)

Re: Redesigning checkpoint_segments

On Fri, Jan 2, 2015 at 3:27 PM, Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:

On 01/01/2015 03:24 AM, Josh Berkus wrote:

Please remind me because I'm having trouble finding this in the
archives: how does wal_keep_segments interact with the new settings?

It's not very straightforward. First of all, min_recycle_wal_size has a
different effect than wal_keep_segments. Raising min_recycle_wal_size
causes more segments to be recycled rather than deleted, while
wal_keep_segments causes old segments to be retained as old segments, so
that they can be used for streaming replication. If you raise
min_recycle_wal_size, it will not do any good for streaming replication.

wal_keep_segments does not affect the calculation of CheckPointSegments.
If you set wal_keep_segments high enough, checkpoint_wal_size will be
exceeded. The other alternative would be to force a checkpoint earlier,
i.e. lower CheckPointSegments, so that checkpoint_wal_size would be
honored. However, if you set wal_keep_segments high enough, higher than
checkpoint_wal_size, it's impossible to honor checkpoint_wal_size no matter
how frequently you checkpoint.

Doesn't this indicate that we should have some co-relation
between checkpoint_wal_size and wal_keep_segments?

It's not totally straightforward to calculate what maximum size of WAL a
given wal_keep_segments settings will force. wal_keep_segments is taken
into account at a checkpoint, when we recycle old WAL segments. For
example, imagine that prior checkpoint started at segment 95, a new
checkpoint finishes at segment 100, and wal_keep_segments=10. Because of
wal_keep_segments, we have to retain segments 90-95, which could otherwise
be recycled. So that forces a WAL size of 10 segments, while otherwise 5
would be enough. However, before we reach the next checkpoint, let's assume
it will complete at segment 105, we will consume five more segments, so the
actual max WAL size is 15 segments. However, we could start recycling the
segments 90-95 before we reach the next checkpoint, because
wal_keep_segments stipulates how many segments from the current *insert*
location needs to be retained, with not regard to checkpoints. But we only
attempt to recycle segments at checkpoints.

I am thinking that it might make sense to have checkpoint_wal_size
equal to size of wal_keep_segments incase wal_keep_segments is
greater than checkpoint_wal_size size. It will not make any difference
in retaining wal segments, but I think it can make checkpoint trigger
at more appropriate intervals. Won't this help in addressing the above
situation explained by you to an extent as it will make a new checkpoint
to start little later such that it will help in removing segments between
90-95 one cycle earlier.

So that could be made more straightforward if we recycled old segments in
the background, between checkpoints. That might allow merging
wal_keep_segments and min_recycle_wal_size settings, too: instead of
renaming all old recycleable segments at a checkpoint, you could keep them
around as old segments until they're actually needed for reuse, so they
could be used for streaming replication up to that point.

Are you imagining some other background process to do this
activity? Does it make sense if we try to do the same in
foreground (I understand that it can impact performance of that
session, but such a thing can maintain the WAL size better)?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#65

Venkata Balaji N

nag1010@gmail.com

almost 11 years ago

In reply to: Heikki Linnakangas (#53)

Re: Redesigning checkpoint_segments

Hi,

I really like the idea of tuning checkpoint segments based on disk space
usage.

I performed series of tests for this patch and would like to share the
results. My comments are in-line.

To start with, I applied this patch to the master successfully -

But ... do I understand things correctly that checkpoint wouldn't "kick

in" until you hit checkpoint_wal_size? If that's the case, isn't real
disk space usage around 2X checkpoint_wal_size if spread checkpoint is
set to 0.9? Or does checkpoint kick in sometime earlier?

It kicks in earlier, so that the checkpoint *completes* just when
checkpoint_wal_size of WAL is used up. So the real disk usage is
checkpoint_wal_size.

There is still an internal variable called CheckPointSegments that
triggers the checkpoint, but it is now derived from checkpoint_wal_size
(see CalculateCheckpointSegments function):

CheckPointSegments = (checkpoint_wal_size / 16 MB) / (2 +
checkpoint_completion_target)

Yes, i see this happening.

This is the same formula we've always had in the manual for calculating the

amount of WAL space used, but in reverse. I.e. we calculate
CheckPointSegments based on the desired disk space usage, not the other way
round.

As a note, pgBench would be a terrible test for this patch; we really

need something which creates uneven traffic. I'll see if I can devise
something.

Attached is a rebased version of this patch. Everyone, please try this out
on whatever workloads you have, and let me know:

a) How does the auto-tuning of the number of recycled segments work? Does
pg_xlog reach a steady-state size, or does it fluctuate a lot?

I performed the tests by executing heavy INSERT operations (INSERTS only)
using benchmarksql. I do see that pg_xlog size is increasing at times.

I have inserted about 6GB of data for testing.

Below are the test results.

*Test 1 :*

In this test, i see removed+recycled segments = 3 (except for the first 3
checkpoint cycles) and has been steady through out until the INSERT
operation completed.

Actual calculation of CheckPointSegments = 3.2 ( is getting rounded up to 3
)

pg_xlog size is 128M and has increased to 160M max during the INSERT
operation.

shared_buffers = 128M
checkpoint_wal_size = 128M
min_recycle_wal_size = 80M
checkpoint_timeout = 5min

TimeStamp=2015-01-27 09:39:14.325 GMT-10 DB=bsql SID=54c6cfd6.5e4
User=postgres LOG: statement: update order_line set ol_amount = 0.01;
TimeStamp=2015-01-27 09:39:15.407 GMT-10 DB= SID=54bee4a1.3002 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-27 09:39:18.680 GMT-10 DB= SID=54bee4a1.3002 User= LOG:
checkpoint complete: wrote 5123 buffers (31.3%); 0 transaction log file(s)
added, 1 removed, 0 recycled; write=0.593 s, sync=2.492 s, total=3.273 s;
sync files=26, longest=0.399 s, average=0.095 s; distance=52653 KB,
estimate=52653 KB
TimeStamp=2015-01-27 09:39:18.680 GMT-10 DB= SID=54bee4a1.3002 User= LOG:
checkpoints are occurring too frequently (3 seconds apart)
TimeStamp=2015-01-27 09:39:18.680 GMT-10 DB= SID=54bee4a1.3002 User= HINT:
Consider increasing the configuration parameter "checkpoint_wal_size".
TimeStamp=2015-01-27 09:39:18.680 GMT-10 DB= SID=54bee4a1.3002 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-27 09:39:21.211 GMT-10 DB= SID=54bee4a1.3002 User= LOG:
checkpoint complete: wrote 8145 buffers (49.7%); 0 transaction log file(s)
added, 3 removed, 0 recycled; write=0.913 s, sync=1.476 s, total=2.530 s;
sync files=4, longest=0.534 s, average=0.369 s; distance=87446 KB,
estimate=87446 KB
TimeStamp=2015-01-27 09:39:21.211 GMT-10 DB= SID=54bee4a1.3002 User= LOG:
checkpoints are occurring too frequently (3 seconds apart)
TimeStamp=2015-01-27 09:39:21.211 GMT-10 DB= SID=54bee4a1.3002 User= HINT:
Consider increasing the configuration parameter "checkpoint_wal_size".
TimeStamp=2015-01-27 09:39:21.211 GMT-10 DB= SID=54bee4a1.3002 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-27 09:39:23.169 GMT-10 DB= SID=54bee4a1.3002 User= LOG:
checkpoint complete: wrote 4598 buffers (28.1%); 0 transaction log file(s)
added, 3 removed, 2 recycled; write=0.716 s, sync=1.083 s, total=1.957 s;
sync files=4, longest=0.486 s, average=0.270 s; distance=47964 KB,
estimate=83498 KB
TimeStamp=2015-01-27 09:39:23.235 GMT-10 DB= SID=54bee4a1.3002 User= LOG:
checkpoints are occurring too frequently (2 seconds apart)
TimeStamp=2015-01-27 09:39:23.235 GMT-10 DB= SID=54bee4a1.3002 User= HINT:
Consider increasing the configuration parameter "checkpoint_wal_size".
TimeStamp=2015-01-27 09:39:23.235 GMT-10 DB= SID=54bee4a1.3002 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-27 09:39:24.968 GMT-10 DB= SID=54bee4a1.3002 User= LOG:
checkpoint complete: wrote 3417 buffers (20.9%); 0 transaction log file(s)
added, 1 removed, 2 recycled; write=0.539 s, sync=1.059 s, total=1.732 s;
sync files=4, longest=0.535 s, average=0.264 s; distance=44814 KB,
estimate=79629 KB
TimeStamp=2015-01-27 09:39:25.118 GMT-10 DB= SID=54bee4a1.3002 User= LOG:
checkpoints are occurring too frequently (2 seconds apart)
TimeStamp=2015-01-27 09:39:25.118 GMT-10 DB= SID=54bee4a1.3002 User= HINT:
Consider increasing the configuration parameter "checkpoint_wal_size".
TimeStamp=2015-01-27 09:39:25.118 GMT-10 DB= SID=54bee4a1.3002 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-27 09:39:26.879 GMT-10 DB= SID=54bee4a1.3002 User= LOG:
checkpoint complete: wrote 4721 buffers (28.8%); 0 transaction log file(s)
added, 1 removed, 2 recycled; write=0.474 s, sync=1.166 s, total=1.761 s;
sync files=4, longest=0.583 s, average=0.291 s; distance=49145 KB,
estimate=76581 KB

*Test 2 :*

removed+recycled segments remained 3 even after i increased the
checkpoint_wal_size = 144M. This is obviously due to the calculation in
CalculateCheckpointSegments() functions.

checkpoint_wal_size = 144M
min_recycle_wal_size = 104M
checkpoint_timeout = 5min
shared_buffers = 1 GB

Actual calculation of CheckPointSegments = 3.6

TimeStamp=2015-01-27 13:54:38.469 GMT-10 DB= SID=54c70b57.21a0 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-27 13:54:42.831 GMT-10 DB= SID=54c70b57.21a0 User= LOG:
checkpoint complete: wrote 5419 buffers (4.1%); 0 transaction log file(s)
added, 0 removed, 3 recycled; write=2.408 s, sync=1.820 s, total=4.361 s;
sync files=3, longest=1.432 s, average=0.606 s; distance=48175 KB,
estimate=49972 KB
TimeStamp=2015-01-27 13:54:44.824 GMT-10 DB= SID=54c70b57.21a0 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-27 13:54:49.008 GMT-10 DB= SID=54c70b57.21a0 User= LOG:
checkpoint complete: wrote 5570 buffers (4.2%); 0 transaction log file(s)
added, 0 removed, 3 recycled; write=2.769 s, sync=1.268 s, total=4.184 s;
sync files=3, longest=0.843 s, average=0.422 s; distance=51720 KB,
estimate=51720 KB
TimeStamp=2015-01-27 13:54:50.754 GMT-10 DB= SID=54c70b57.21a0 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-27 13:54:55.127 GMT-10 DB= SID=54c70b57.21a0 User= LOG:
checkpoint complete: wrote 5155 buffers (3.9%); 0 transaction log file(s)
added, 0 removed, 3 recycled; write=2.977 s, sync=1.273 s, total=4.372 s;
sync files=3, longest=0.848 s, average=0.424 s; distance=46133 KB,
estimate=51161 KB
TimeStamp=2015-01-27 13:54:57.164 GMT-10 DB= SID=54c70b57.21a0 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-27 13:55:01.622 GMT-10 DB= SID=54c70b57.21a0 User= LOG:
checkpoint complete: wrote 5345 buffers (4.1%); 0 transaction log file(s)
added, 0 removed, 3 recycled; write=2.598 s, sync=1.290 s, total=4.458 s;
sync files=3, longest=0.894 s, average=0.430 s; distance=49604 KB,
estimate=51006 KB
TimeStamp=2015-01-27 13:55:03.501 GMT-10 DB= SID=54c70b57.21a0 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-27 13:55:07.390 GMT-10 DB= SID=54c70b57.21a0 User= LOG:
checkpoint complete: wrote 5482 buffers (4.2%); 0 transaction log file(s)
added, 0 removed, 3 recycled; write=2.549 s, sync=1.193 s, total=3.889 s;
sync files=3, longest=0.837 s, average=0.397 s; distance=49963 KB,
estimate=50901 KB
TimeStamp=2015-01-27 13:55:09.381 GMT-10 DB= SID=54c70b57.21a0 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-27 13:55:13.626 GMT-10 DB= SID=54c70b57.21a0 User= LOG:
checkpoint complete: wrote 5481 buffers (4.2%); 0 transaction log file(s)
added, 0 removed, 3 recycled; write=2.778 s, sync=1.280 s, total=4.244 s;
sync

*Test 3 :*
checkpoint_wal_size = 244M
min_recycle_wal_size = 204M
checkpoint_timeout = 5min
shared_buffers = 1 GB

removed+recycled segments remained 6.

Actual calculation of checkpointsegments = 6.1

TimeStamp=2015-01-27 14:02:01.936 GMT-10 DB= SID=54c70d58.22f4 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-27 14:02:10.638 GMT-10 DB= SID=54c70d58.22f4 User= LOG:
checkpoint complete: wrote 14111 buffers (10.8%); 0 transaction log file(s)
added, 1 removed, 5 recycled; write=5.527 s, sync=2.719 s, total=8.701 s;
sync files=14, longest=1.789 s, average=0.194 s; distance=98617 KB,
estimate=99036 KB
TimeStamp=2015-01-27 14:02:14.243 GMT-10 DB= SID=54c70d58.22f4 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-27 14:02:22.783 GMT-10 DB= SID=54c70d58.22f4 User= LOG:
checkpoint complete: wrote 16524 buffers (12.6%); 0 transaction log file(s)
added, 1 removed, 5 recycled; write=7.013 s, sync=1.394 s, total=8.540 s;
sync files=3, longest=0.867 s, average=0.464 s; distance=98724 KB,
estimate=99005 KB
TimeStamp=2015-01-27 14:02:28.066 GMT-10 DB= SID=54c70d58.22f4 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-27 14:02:36.946 GMT-10 DB= SID=54c70d58.22f4 User= LOG:
checkpoint complete: wrote 16541 buffers (12.6%); 0 transaction log file(s)
added, 1 removed, 5 recycled; write=4.899 s, sync=3.801 s, total=8.879 s;
sync files=9, longest=2.800 s, average=0.422 s; distance=98719 KB,
estimate=98976 KB
TimeStamp=2015-01-27 14:02:40.611 GMT-10 DB= SID=54c70d58.22f4 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-27 14:02:48.066 GMT-10 DB= SID=54c70d58.22f4 User= LOG:
checkpoint complete: wrote 10998 buffers (8.4%); 0 transaction log file(s)
added, 1 removed, 5 recycled; write=4.874 s, sync=1.998 s, total=7.455 s;
sync files=3, longest=1.998 s, average=0.666 s; distance=98771 KB,
estimate=98956 KB
TimeStamp=2015-01-27 14:02:53.327 GMT-10 DB= SID=54c70d58.22f4 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-27 14:03:00.872 GMT-10 DB= SID=54c70d58.22f4 User= LOG:
checkpoint complete: wrote 10640 buffers (8.1%); 0 transaction log file(s)
added, 1 removed, 5 recycled; write=5.247 s, sync=2.097 s, total=7.544 s;
sync files=3, longest=1.640 s, average=0.699 s; distance=98624 KB,
estimate=98923 KB

*Test 4 :*

This time i tested with wal_keep_segments = 300 (4.8 G)

checkpoint_wal_size = 512MB
min_recycle_wal_size = 80M
wal_keep_segments = 300
checkpoint_timeout = 5min
shared_buffers = 1 GB

Actual calculation of checkpointsegments = 12.8

TimeStamp=2015-01-29 12:51:48.276 GMT-10 DB= SID=54c99ff1.5bc9 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-29 12:52:04.325 GMT-10 DB= SID=54c99ff1.5bc9 User= LOG:
checkpoint complete: wrote 20965 buffers (16.0%); 0 transaction log file(s)
added, 0 removed, 0 recycled; write=11.676 s, sync=3.830 s, total=16.049 s;
sync files=18, longest=2.991 s, average=0.212 s; distance=196705 KB,
estimate=196705 KB
TimeStamp=2015-01-29 12:52:16.068 GMT-10 DB= SID=54c99ff1.5bc9 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-29 12:52:33.529 GMT-10 DB= SID=54c99ff1.5bc9 User= LOG:
checkpoint complete: wrote 22009 buffers (16.8%); 1 transaction log file(s)
added, 0 removed, 0 recycled; write=12.705 s, sync=3.559 s, total=17.460 s;
sync files=3, longest=3.002 s, average=1.186 s; distance=200401 KB,
estimate=200401 KB
TimeStamp=2015-01-29 12:52:43.321 GMT-10 DB= SID=54c99ff1.5bc9 User= LOG:
checkpoint starting: xlog

Since the wal_keep_segments is 300, recycling or removing of the
transactions logs begins after the required number of wal_keep_segments are
retained. Which is 4.8G in this case.

removed+recycled has always been 12 except for the first 3 checkpoint
cycles after pg_xlog size reached 4.8G.

TimeStamp=2015-01-29 13:03:29.167 GMT-10 DB= SID=54c99ff1.5bc9 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-29 13:03:58.401 GMT-10 DB= SID=54c99ff1.5bc9 User= LOG:
checkpoint complete: wrote 20316 buffers (15.5%); 0 transaction log file(s)
added, 0 removed, 0 recycled; write=11.963 s, sync=16.840 s, total=29.233
s; sync files=16, longest=15.137 s, average=1.052 s; distance=197432 KB,
estimate=197432 KB
TimeStamp=2015-01-29 13:04:05.451 GMT-10 DB= SID=54c99ff1.5bc9 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-29 13:04:52.416 GMT-10 DB= SID=54c99ff1.5bc9 User= LOG:
checkpoint complete: wrote 20280 buffers (15.5%); 0 transaction log file(s)
added, 5 removed, 8 recycled; write=10.989 s, sync=35.791 s, total=46.965
s; sync files=10, longest=17.927 s, average=3.579 s; distance=196668 KB,
estimate=197356 KB
TimeStamp=2015-01-29 13:04:52.635 GMT-10 DB= SID=54c99ff1.5bc9 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-29 13:05:15.520 GMT-10 DB= SID=54c99ff1.5bc9 User= LOG:
checkpoint complete: wrote 31394 buffers (24.0%); 0 transaction log file(s)
added, 0 removed, 10 recycled; write=10.270 s, sync=12.404 s, total=22.884
s; sync files=17, longest=5.014 s, average=0.729 s; distance=197961 KB,
estimate=197961 KB
TimeStamp=2015-01-29 13:05:20.356 GMT-10 DB= SID=54c99ff1.5bc9 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-29 13:05:35.060 GMT-10 DB= SID=54c99ff1.5bc9 User= LOG:
checkpoint complete: wrote 32731 buffers (25.0%); 0 transaction log file(s)
added, 0 removed, 10 recycled; write=11.433 s, sync=3.055 s, total=14.703
s; sync files=13, longest=1.300 s, average=0.235 s; distance=196510 KB,
estimate=197816 KB
TimeStamp=2015-01-29 13:05:43.059 GMT-10 DB= SID=54c99ff1.5bc9 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-29 13:05:59.518 GMT-10 DB= SID=54c99ff1.5bc9 User= LOG:
checkpoint complete: wrote 30264 buffers (23.1%); 0 transaction log file(s)
added, 0 removed, 12 recycled; write=10.687 s, sync=5.624 s, total=16.459
s; sync files=12, longest=3.971 s, average=0.468 s; distance=193348 KB,
estimate=197369 KB
TimeStamp=2015-01-29 13:06:07.371 GMT-10 DB= SID=54c99ff1.5bc9 User= LOG:
checkpoint starting: xlog
TimeStamp=2015-01-29 13:06:23.870 GMT-10 DB= SID=54c99ff1.5bc9 User= LOG:
checkpoint complete: wrote 30723 buffers (23.4%); 0 transaction log file(s)
added, 0 removed, 12 recycled; write=10.132 s, sync=6.159 s, total=16.498
s; sync file

b) Are the two GUCs, checkpoint_wal_size, and min_recycle_wal_size,
intuitive to set?

During my tests, I did not observe the significance of min_recycle_wal_size
parameter yet. Ofcourse, i had sufficient disk space for pg_xlog.

I would like to understand more about "min_recycle_wal_size" parameter. In
theory, i only understand from the note in the patch that if the disk space
usage falls below certain threshold, min_recycle_wal_size number of WALs
will be removed to accommodate future pg_xlog segments. I will try to test
this out. Please let me know if there is any specific test to understand
min_recycle_wal_size behaviour.

I will try to perform some more stress testing with different set of high
workloads and will share the results.

I did not review the patch code completely. Will comment once done.

Please share your thoughts on this.

Regards,
Venkata B N

#66

Heikki Linnakangas

hlinnakangas@vmware.com

almost 11 years ago

In reply to: Venkata Balaji N (#65)

Re: Redesigning checkpoint_segments

On 01/30/2015 04:48 AM, Venkata Balaji N wrote:

I performed series of tests for this patch and would like to share the
results. My comments are in-line.

Thanks for the testing!

*Test 1 :*

In this test, i see removed+recycled segments = 3 (except for the first 3
checkpoint cycles) and has been steady through out until the INSERT
operation completed.

Actual calculation of CheckPointSegments = 3.2 ( is getting rounded up to 3
)

pg_xlog size is 128M and has increased to 160M max during the INSERT
operation.

shared_buffers = 128M
checkpoint_wal_size = 128M
min_recycle_wal_size = 80M
checkpoint_timeout = 5min

Hmm, did I understand correctly that pg_xlog peaked at 160MB, but most
of the stayed at 128 MB? That sounds like it's working as designed;
checkpoint_wal_size is not a hard limit after all.

b) Are the two GUCs, checkpoint_wal_size, and min_recycle_wal_size,
intuitive to set?

During my tests, I did not observe the significance of min_recycle_wal_size
parameter yet. Ofcourse, i had sufficient disk space for pg_xlog.

I would like to understand more about "min_recycle_wal_size" parameter. In
theory, i only understand from the note in the patch that if the disk space
usage falls below certain threshold, min_recycle_wal_size number of WALs
will be removed to accommodate future pg_xlog segments. I will try to test
this out. Please let me know if there is any specific test to understand
min_recycle_wal_size behaviour.

min_recycle_wal_size comes into play when you have only light load, so
that checkpoints are triggered by checkpoint_timeout rather than
checkpoint_wal_size. In that scenario, the WAL usage will shrink down to
min_recycle_wal_size, but not below that. Did that explanation help? Can
you suggest changes to the docs to make it more clear?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Heikki Linnakangas (#66)

Re: Redesigning checkpoint_segments

On Fri, Jan 30, 2015 at 3:58 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

During my tests, I did not observe the significance of
min_recycle_wal_size
parameter yet. Ofcourse, i had sufficient disk space for pg_xlog.

I would like to understand more about "min_recycle_wal_size" parameter. In
theory, i only understand from the note in the patch that if the disk
space
usage falls below certain threshold, min_recycle_wal_size number of WALs
will be removed to accommodate future pg_xlog segments. I will try to test
this out. Please let me know if there is any specific test to understand
min_recycle_wal_size behaviour.

min_recycle_wal_size comes into play when you have only light load, so that
checkpoints are triggered by checkpoint_timeout rather than
checkpoint_wal_size. In that scenario, the WAL usage will shrink down to
min_recycle_wal_size, but not below that. Did that explanation help? Can you
suggest changes to the docs to make it more clear?

First, as a general comment, I think we could do little that would
improve the experience of tuning PostgreSQL as much as getting this
patch committed with some reasonable default values for the settings
in question. Shipping with checkpoint_segments=3 is a huge obstacle
to good performance. It might be a reasonable value for
min_recycle_wal_size, but it's not a remotely reasonable upper bound
on WAL generated between checkpoints. We haven't increased that limit
even once in the 14 years we've had it (cf.
4d14fe0048cf80052a3ba2053560f8aab1bb1b22) and typical disk sizes have
grown by an order of magnitude since then.

Second, I *think* that these settings are symmetric and, if that's
right, then I suggest that they ought to be named symmetrically.
Basically, I think you've got min_checkpoint_segments (the number of
recycled segments we keep around always) and max_checkpoint_segments
(the maximum number of segments we can have between checkpoints),
essentially splitting the current role of checkpoint_segments in half.
I'd go so far as to suggest we use exactly that naming. It would be
reasonable to allow the value to be specified in MB rather than in
16MB units, and to specify it that way by default, but maybe a
unit-less value should have the old interpretation since everybody's
used to it. That would require adding GUC_UNIT_XSEG or similar, but
that seems OK.

Also, I'd like to propose that we set the default value of
max_checkpoint_segments/checkpoint_wal_size to something at least an
order of magnitude larger than the current default setting. I'll open
the bidding at 1600MB (aka 100). I expect some pushback here, but I
don't think this is unreasonable; some people will need to raise it
further. If you're generating 1600MB of WAL in 5 minutes, you're
either making the database bigger very quickly (in which case the
extra disk space that is consumed by the WAL will quickly blend into
the background) or you are updating the data already in the database
at a tremendous rate (in which case you are probably willing to burn
some disk space to have that go fast). Right now, it's impractical to
ship something like checkpoint_segments=100 because we'd eat all that
space even on tiny databases with no activity. But this patch fixes
that, so we might as well try to ship a default that's large enough to
use the database as something other than a toy.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68

Andres Freund

andres@2ndquadrant.com

almost 11 years ago

In reply to: Robert Haas (#67)

Re: Redesigning checkpoint_segments

Hi,

On 2015-02-02 08:36:41 -0500, Robert Haas wrote:

Also, I'd like to propose that we set the default value of
max_checkpoint_segments/checkpoint_wal_size to something at least an
order of magnitude larger than the current default setting.

I think we need to increase checkpoint_timeout too - that's actually
just as important for the default experience from my pov. 5 minutes
often just unnecessarily generates FPWs en masse.

I'll open the bidding at 1600MB (aka 100).

Fine with me.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69

Heikki Linnakangas

hlinnakangas@vmware.com

almost 11 years ago

In reply to: Andres Freund (#68)

Re: Redesigning checkpoint_segments

On 02/02/2015 04:21 PM, Andres Freund wrote:

Hi,

On 2015-02-02 08:36:41 -0500, Robert Haas wrote:

Also, I'd like to propose that we set the default value of
max_checkpoint_segments/checkpoint_wal_size to something at least an
order of magnitude larger than the current default setting.

+1

I don't agree with that principle. I wouldn't mind increasing it a
little bit, but not by an order of magnitude. For better or worse, *all*
our defaults are tuned toward small systems, and so that PostgreSQL
doesn't hog all the resources. We shouldn't make an exception for this.

I think we need to increase checkpoint_timeout too - that's actually
just as important for the default experience from my pov. 5 minutes
often just unnecessarily generates FPWs en masse.

I'll open the bidding at 1600MB (aka 100).

Fine with me.

I wouldn't object to raising it a little bit, but that's way too high.
It's entirely possible to have a small database that generates a lot of
WAL. A table that has only a few rows, but is updated very very
frequently, for example. And checkpointing such a database is quick too,
so frequent checkpoints are not a problem. You don't want to end up with
1.5 GB of WAL on a 100 MB database.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70

Heikki Linnakangas

hlinnakangas@vmware.com

almost 11 years ago

In reply to: Robert Haas (#67)

Re: Redesigning checkpoint_segments

On 02/02/2015 03:36 PM, Robert Haas wrote:

Second, I*think* that these settings are symmetric and, if that's
right, then I suggest that they ought to be named symmetrically.
Basically, I think you've got min_checkpoint_segments (the number of
recycled segments we keep around always) and max_checkpoint_segments
(the maximum number of segments we can have between checkpoints),
essentially splitting the current role of checkpoint_segments in half.
I'd go so far as to suggest we use exactly that naming. It would be
reasonable to allow the value to be specified in MB rather than in
16MB units, and to specify it that way by default, but maybe a
unit-less value should have the old interpretation since everybody's
used to it. That would require adding GUC_UNIT_XSEG or similar, but
that seems OK.

Works for me. However, note that "max_checkpoint_segments = 10" doesn't
mean the same as current "checkpoint_segments = 10". With
checkpoint_segments = 10 you end up with about 2x-3x as much WAL as with
max_checkpoint_segments = 10. So the "everybody's used to it" argument
doesn't hold much water.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71

Kevin Grittner

kgrittn@ymail.com

almost 11 years ago

In reply to: Heikki Linnakangas (#69)

Re: Redesigning checkpoint_segments

Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 02/02/2015 04:21 PM, Andres Freund wrote:

On 2015-02-02 08:36:41 -0500, Robert Haas wrote:

Also, I'd like to propose that we set the default value of
max_checkpoint_segments/checkpoint_wal_size to something at
least an order of magnitude larger than the current default
setting.

+1

I don't agree with that principle. I wouldn't mind increasing it
a little bit, but not by an order of magnitude.

Especially without either confirming that this effect is no longer
present, or having an explanation for it:

/messages/by-id/4A44E58C0200002500027FCF@gw.
wicourts.gov

Note that Greg Smith found the same effect on a machine without any
write caching, which shoots down my theory, at least on his
machine:

/messages/by-id/4BCCDAD5.3040101@2ndquadrant.com

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Heikki Linnakangas (#70)

Re: Redesigning checkpoint_segments

On Tue, Feb 3, 2015 at 7:31 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 02/02/2015 03:36 PM, Robert Haas wrote:

Second, I*think* that these settings are symmetric and, if that's
right, then I suggest that they ought to be named symmetrically.
Basically, I think you've got min_checkpoint_segments (the number of
recycled segments we keep around always) and max_checkpoint_segments
(the maximum number of segments we can have between checkpoints),
essentially splitting the current role of checkpoint_segments in half.
I'd go so far as to suggest we use exactly that naming. It would be
reasonable to allow the value to be specified in MB rather than in
16MB units, and to specify it that way by default, but maybe a
unit-less value should have the old interpretation since everybody's
used to it. That would require adding GUC_UNIT_XSEG or similar, but
that seems OK.

Works for me. However, note that "max_checkpoint_segments = 10" doesn't mean
the same as current "checkpoint_segments = 10". With checkpoint_segments =
10 you end up with about 2x-3x as much WAL as with max_checkpoint_segments =
10. So the "everybody's used to it" argument doesn't hold much water.

Hmm, that's surprising. Why does that happen?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73

Heikki Linnakangas

hlinnakangas@vmware.com

almost 11 years ago

In reply to: Robert Haas (#72)

Re: Redesigning checkpoint_segments

On 02/03/2015 05:19 PM, Robert Haas wrote:

On Tue, Feb 3, 2015 at 7:31 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 02/02/2015 03:36 PM, Robert Haas wrote:

Second, I*think* that these settings are symmetric and, if that's
right, then I suggest that they ought to be named symmetrically.
Basically, I think you've got min_checkpoint_segments (the number of
recycled segments we keep around always) and max_checkpoint_segments
(the maximum number of segments we can have between checkpoints),
essentially splitting the current role of checkpoint_segments in half.
I'd go so far as to suggest we use exactly that naming. It would be
reasonable to allow the value to be specified in MB rather than in
16MB units, and to specify it that way by default, but maybe a
unit-less value should have the old interpretation since everybody's
used to it. That would require adding GUC_UNIT_XSEG or similar, but
that seems OK.

Works for me. However, note that "max_checkpoint_segments = 10" doesn't mean
the same as current "checkpoint_segments = 10". With checkpoint_segments =
10 you end up with about 2x-3x as much WAL as with max_checkpoint_segments =
10. So the "everybody's used to it" argument doesn't hold much water.

Hmm, that's surprising. Why does that happen?

That's the whole point of this patch. "max_checkpoint_segments = 10", or
"max_checkpoint_segments = 160 MB", means that the system will begin a
checkpoint so that when the checkpoint completes, and it truncates away
or recycles old WAL, the total size of pg_xlog is 160 MB.

That's different from our current checkpoint_segments setting. With
checkpoint_segments, the documented formula for calculating the disk
usage is (2 + checkpoint_completion_target) * checkpoint_segments * 16
MB. That's a lot less intuitive to set.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Heikki Linnakangas (#73)

Re: Redesigning checkpoint_segments

On Tue, Feb 3, 2015 at 10:44 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Works for me. However, note that "max_checkpoint_segments = 10" doesn't
mean
the same as current "checkpoint_segments = 10". With checkpoint_segments
=
10 you end up with about 2x-3x as much WAL as with
max_checkpoint_segments =
10. So the "everybody's used to it" argument doesn't hold much water.

Hmm, that's surprising. Why does that happen?

That's the whole point of this patch. "max_checkpoint_segments = 10", or
"max_checkpoint_segments = 160 MB", means that the system will begin a
checkpoint so that when the checkpoint completes, and it truncates away or
recycles old WAL, the total size of pg_xlog is 160 MB.

That's different from our current checkpoint_segments setting. With
checkpoint_segments, the documented formula for calculating the disk usage
is (2 + checkpoint_completion_target) * checkpoint_segments * 16 MB. That's
a lot less intuitive to set.

Hmm, that's different from what I was thinking. We probably shouldn't
call that max_checkpoint_segments, then. I got confused and thought
you were just trying to decouple the number of segments that it takes
to trigger a checkpoint from the number we keep preallocated.

But I'm confused: how can we know how much new WAL will be written
before the checkpoint completes?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75

Petr Jelinek

petr@2ndquadrant.com

almost 11 years ago

In reply to: Robert Haas (#74)

Re: Redesigning checkpoint_segments

On 03/02/15 16:50, Robert Haas wrote:

On Tue, Feb 3, 2015 at 10:44 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

That's the whole point of this patch. "max_checkpoint_segments = 10", or
"max_checkpoint_segments = 160 MB", means that the system will begin a
checkpoint so that when the checkpoint completes, and it truncates away or
recycles old WAL, the total size of pg_xlog is 160 MB.

That's different from our current checkpoint_segments setting. With
checkpoint_segments, the documented formula for calculating the disk usage
is (2 + checkpoint_completion_target) * checkpoint_segments * 16 MB. That's
a lot less intuitive to set.

Hmm, that's different from what I was thinking. We probably shouldn't
call that max_checkpoint_segments, then. I got confused and thought
you were just trying to decouple the number of segments that it takes
to trigger a checkpoint from the number we keep preallocated.

But I'm confused: how can we know how much new WAL will be written
before the checkpoint completes?

The preallocation is based on estimated size of next checkpoint which is
basically running average of the previous checkpoints with some
additional adjustments for unsteady behavior (last checkpoint has higher
weight in the formula).

(we also still internally have the CheckPointSegments which is
calculated the way Heikki described above)

In any case, I don't like the max_checkpoint_segments naming too much,
and I don't even like the number of segments as limit too much, I think
the ability to set this in actual size is quite nice property of this
patch and as Heikki says the numbers don't map that well to the old ones
in practice.

I did some code reading and I do like the patch. Basically only negative
thing I can say is that I am not big fan of _logSegNo variable name but
that's not new in this patch, we use it all over the place in xlog.

I would vote for bigger default of the checkpoint_wal_size (or whatever
it will be named) though, since the current one is not much bigger in
practice than what we have now and that one is way too conservative.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76

Josh Berkus

josh@agliodbs.com

almost 11 years ago

In reply to: Heikki Linnakangas (#1)

Re: Redesigning checkpoint_segments

On 02/03/2015 07:50 AM, Robert Haas wrote:

On Tue, Feb 3, 2015 at 10:44 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

That's the whole point of this patch. "max_checkpoint_segments = 10", or
"max_checkpoint_segments = 160 MB", means that the system will begin a
checkpoint so that when the checkpoint completes, and it truncates away or
recycles old WAL, the total size of pg_xlog is 160 MB.

That's different from our current checkpoint_segments setting. With
checkpoint_segments, the documented formula for calculating the disk usage
is (2 + checkpoint_completion_target) * checkpoint_segments * 16 MB. That's
a lot less intuitive to set.

Hmm, that's different from what I was thinking. We probably shouldn't
call that max_checkpoint_segments, then. I got confused and thought
you were just trying to decouple the number of segments that it takes
to trigger a checkpoint from the number we keep preallocated.

Wait, what? Because the new setting is an actual soft maximum, we
*shouldn't* call it a maximum? Or are you saying something else?

On 02/03/2015 04:25 AM, Heikki Linnakangas wrote:

On 02/02/2015 04:21 PM, Andres Freund wrote:

I think we need to increase checkpoint_timeout too - that's actually
just as important for the default experience from my pov. 5 minutes
often just unnecessarily generates FPWs en masse.

I have yet to see any serious benchmarking on checkpoint_timeout. It
does seem that for some workloads on some machines a longer timeout is
better, but I've also seen workloads where a longer timeout decreases
throughput or raises IO. So absent some hard numbers, I'd be opposed to
changing the default.

I'll open the bidding at 1600MB (aka 100).

Fine with me.

I wouldn't object to raising it a little bit, but that's way too high.
It's entirely possible to have a small database that generates a lot of
WAL. A table that has only a few rows, but is updated very very
frequently, for example. And checkpointing such a database is quick too,
so frequent checkpoints are not a problem. You don't want to end up with
1.5 GB of WAL on a 100 MB database.

I suggest 192MB instead (12 segments). That almost doubles our current
real default, without requiring huge disk space which might surprise
some users.

In practice, checkpoint_segments is impossible to automatically tune
correctly. So let's be conservative.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM19c23213e01a492ec2b596cfa228a412fb706e3a4de4a048abf9b4830eaeff52dfa4b0d0c935a294b0977e71f8573049@asav-1.01.com

#77

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Josh Berkus (#76)

Re: Redesigning checkpoint_segments

On Tue, Feb 3, 2015 at 4:18 PM, Josh Berkus <josh@agliodbs.com> wrote:

That's different from our current checkpoint_segments setting. With
checkpoint_segments, the documented formula for calculating the disk usage
is (2 + checkpoint_completion_target) * checkpoint_segments * 16 MB. That's
a lot less intuitive to set.

Hmm, that's different from what I was thinking. We probably shouldn't
call that max_checkpoint_segments, then. I got confused and thought
you were just trying to decouple the number of segments that it takes
to trigger a checkpoint from the number we keep preallocated.

Wait, what? Because the new setting is an actual soft maximum, we
*shouldn't* call it a maximum? Or are you saying something else?

I am saying that I proposed calling it max_checkpoint_segments because
I thought it was the maximum number of segments between checkpoints.
But it's not.

I wouldn't object to raising it a little bit, but that's way too high.
It's entirely possible to have a small database that generates a lot of
WAL. A table that has only a few rows, but is updated very very
frequently, for example. And checkpointing such a database is quick too,
so frequent checkpoints are not a problem. You don't want to end up with
1.5 GB of WAL on a 100 MB database.

I suggest 192MB instead (12 segments). That almost doubles our current
real default, without requiring huge disk space which might surprise
some users.

In practice, checkpoint_segments is impossible to automatically tune
correctly. So let's be conservative.

We are too often far too conservative about these things. If we make
the default 192MB, it will only ever get tuned in one direction: up.
It is not a bad thing for us to set the settings high enough that once
in a great while someone might find them to be too high rather than
too low.

I find it amazing that anyone here thinks that a user would be OK with
using 192MB of space for WAL, but 384MB would break the bank. The
hard drive in my laptop is 456GB. The point is, with Heikki's work
here, you're only going to use the maximum amount of space if you have
massive write activity. And if you have massive write activity, it's
extremely likely that you will be OK with using a very modest amount
of disk space to have that be fast. Right now, we have to be really
conservative because we're going to use the full allocation all the
time, but this fixes that. I think.

If somebody were to propose limiting the size of the database to
192MB, and requiring a configuration setting to make it larger,
everybody would say that's a terrible idea. Heck, if I were to
propose limiting the database to 19.2GB, and require a configuration
setting to make it larger, everybody would say that's a terrible idea.
But what we actually have is not far off from that. Sure, you can
create a 20GB database with an out-of-the-box configuration, but you'd
better get out your pillow before starting the data load, because with
checkpoint_segments=3 that's going to be fantastically slow. And
you'd better hope that the update rate is pretty low, too, because if
it's anything even slightly interesting you're going to be spewing
checkpoint warnings into the log. So our settings need to *support*
creating a 20GB database out of the box, but it's OK if it performs
absolutely terribly.

I really have a hard time believing that there are many people who are
going to complain about WAL utilization peaking at 1.6GB (my initial
proposal). Your database is probably rapidly expanding, and the WAL
utilization will drop when it stops. And if it isn't rapidly
expanding, because you're doing a ton of updates in place, you'll
probably still be happier to spend a little extra disk space than to
have it be cripplingly slow. And if you're not, then, first, what is
wrong with you, and second, well then you can turn down the setting.
That's why we have settings. I enjoy getting paid to tell people to
increase checkpoint_segments by two orders of magnitude as much as the
next PostgreSQL consultant, but I don't enjoy the fact that people
benchmark the default configuration and get terrible results because
we haven't updated the default value for this parameter since it was
added in 2001.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78

Josh Berkus

josh@agliodbs.com

almost 11 years ago

In reply to: Heikki Linnakangas (#1)

Re: Redesigning checkpoint_segments

On 02/04/2015 09:28 AM, Robert Haas wrote:

On Tue, Feb 3, 2015 at 4:18 PM, Josh Berkus <josh@agliodbs.com> wrote:

That's different from our current checkpoint_segments setting. With
checkpoint_segments, the documented formula for calculating the disk usage
is (2 + checkpoint_completion_target) * checkpoint_segments * 16 MB. That's
a lot less intuitive to set.

Hmm, that's different from what I was thinking. We probably shouldn't
call that max_checkpoint_segments, then. I got confused and thought
you were just trying to decouple the number of segments that it takes
to trigger a checkpoint from the number we keep preallocated.

Wait, what? Because the new setting is an actual soft maximum, we
*shouldn't* call it a maximum? Or are you saying something else?

I am saying that I proposed calling it max_checkpoint_segments because
I thought it was the maximum number of segments between checkpoints.
But it's not.

That's good, though, isn't it? Knowing the number of segments between
checkpoints is useful only to postgres experts with experience. What
the patch defines is what most users actually want to know: how much
disk space, total, do I need to allocate?

Let me push "max_wal_size" and "min_wal_size" again as our new parameter
names, because:

* does what it says on the tin
* new user friendly
* encourages people to express it in MB, not segments
* very different from the old name, so people will know it works differently

We are too often far too conservative about these things. If we make
the default 192MB, it will only ever get tuned in one direction: up.
It is not a bad thing for us to set the settings high enough that once
in a great while someone might find them to be too high rather than
too low.

I find it amazing that anyone here thinks that a user would be OK with
using 192MB of space for WAL, but 384MB would break the bank. The
hard drive in my laptop is 456GB. The point is, with Heikki's work
here, you're only going to use the maximum amount of space if you have
massive write activity. And if you have massive write activity, it's
extremely likely that you will be OK with using a very modest amount
of disk space to have that be fast. Right now, we have to be really
conservative because we're going to use the full allocation all the
time, but this fixes that. I think.

Hmmm, I see your point. I spend a lot of time on AWS and in
container-world, where disk space is a lot more constrained. However,
it probably makes more sense to recommend non-default settings for that
environment, since it requires non-default settings anyway.

So, 384MB?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WMa537c4095e71c1355f08a12657045d96e7d89b59da2138da20b3dd72e6cf3f62aeeaec5bac3a78c4a500668121ecb0b1@asav-2.01.com

#79

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Josh Berkus (#78)

Re: Redesigning checkpoint_segments

On Wed, Feb 4, 2015 at 1:05 PM, Josh Berkus <josh@agliodbs.com> wrote:

Let me push "max_wal_size" and "min_wal_size" again as our new parameter
names, because:

* does what it says on the tin
* new user friendly
* encourages people to express it in MB, not segments
* very different from the old name, so people will know it works differently

That's not bad. If we added a hard WAL limit in a future release, how
would that fit into this naming scheme?

We are too often far too conservative about these things. If we make
the default 192MB, it will only ever get tuned in one direction: up.
It is not a bad thing for us to set the settings high enough that once
in a great while someone might find them to be too high rather than
too low.

I find it amazing that anyone here thinks that a user would be OK with
using 192MB of space for WAL, but 384MB would break the bank. The
hard drive in my laptop is 456GB. The point is, with Heikki's work
here, you're only going to use the maximum amount of space if you have
massive write activity. And if you have massive write activity, it's
extremely likely that you will be OK with using a very modest amount
of disk space to have that be fast. Right now, we have to be really
conservative because we're going to use the full allocation all the
time, but this fixes that. I think.

Hmmm, I see your point. I spend a lot of time on AWS and in
container-world, where disk space is a lot more constrained. However,
it probably makes more sense to recommend non-default settings for that
environment, since it requires non-default settings anyway.

So, 384MB?

That's certainly better, but I think we should go further. Again,
you're not committed to using this space all the time, and if you are
using it you must have a lot of write activity, which means you are
not running on a tin can and a string. If you have a little tiny
database, say 100MB, running on a little-tiny Amazon instance,
handling a small number of transactions, you're going to stay close to
wal_min_size anyway. Right?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80

Josh Berkus

josh@agliodbs.com

almost 11 years ago

In reply to: Heikki Linnakangas (#1)

Re: Redesigning checkpoint_segments

On 02/04/2015 12:06 PM, Robert Haas wrote:

On Wed, Feb 4, 2015 at 1:05 PM, Josh Berkus <josh@agliodbs.com> wrote:

Let me push "max_wal_size" and "min_wal_size" again as our new parameter
names, because:

* does what it says on the tin
* new user friendly
* encourages people to express it in MB, not segments
* very different from the old name, so people will know it works differently

That's not bad. If we added a hard WAL limit in a future release, how
would that fit into this naming scheme?

Well, first, nobody's at present proposing a patch to add a hard limit,
so I'm reluctant to choose non-obvious names to avoid conflict with a
feature nobody may ever write. There's a number of reasons a hard limit
would be difficult and/or undesirable.

If we did add one, I'd suggest calling it "wal_size_limit" or something
similar. However, we're most likely to only implement the limit for
archives, which means that it might acually be called
"archive_buffer_limit" or something more to the point.

That's certainly better, but I think we should go further. Again,
you're not committed to using this space all the time, and if you are
using it you must have a lot of write activity, which means you are
not running on a tin can and a string. If you have a little tiny
database, say 100MB, running on a little-tiny Amazon instance,
handling a small number of transactions, you're going to stay close to
wal_min_size anyway. Right?

Well, we can test that.

So what's your proposed size?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM845b7d69b7f120f772e777936cab07b45a35a654ff40266b4d58b76bc59dbaec5a0d2c582913d0c52bf4d7eaba62505e@asav-3.01.com

#81

David Steele

david@pgmasters.net

almost 11 years ago

In reply to: Robert Haas (#79)

Re: Redesigning checkpoint_segments

On 2/4/15 3:06 PM, Robert Haas wrote:

Hmmm, I see your point. I spend a lot of time on AWS and in
container-world, where disk space is a lot more constrained. However,
it probably makes more sense to recommend non-default settings for that
environment, since it requires non-default settings anyway.

So, 384MB?

That's certainly better, but I think we should go further. Again,
you're not committed to using this space all the time, and if you are
using it you must have a lot of write activity, which means you are
not running on a tin can and a string. If you have a little tiny
database, say 100MB, running on a little-tiny Amazon instance,
handling a small number of transactions, you're going to stay close to
wal_min_size anyway. Right?

The main exception I can think of is when using dump/restore to upgrade
instead of pg_upgrade. This would generate a lot of WAL for what could
otherwise be a low-traffic database.

--
- David Steele
david@pgmasters.net

#82

Jim Nasby

Jim.Nasby@BlueTreble.com

almost 11 years ago

In reply to: David Steele (#81)

Re: Redesigning checkpoint_segments

On 2/4/15 6:16 PM, David Steele wrote:

On 2/4/15 3:06 PM, Robert Haas wrote:

Hmmm, I see your point. I spend a lot of time on AWS and in
container-world, where disk space is a lot more constrained. However,
it probably makes more sense to recommend non-default settings for that
environment, since it requires non-default settings anyway.

So, 384MB?

That's certainly better, but I think we should go further. Again,
you're not committed to using this space all the time, and if you are
using it you must have a lot of write activity, which means you are
not running on a tin can and a string. If you have a little tiny
database, say 100MB, running on a little-tiny Amazon instance,
handling a small number of transactions, you're going to stay close to
wal_min_size anyway. Right?

The main exception I can think of is when using dump/restore to upgrade
instead of pg_upgrade. This would generate a lot of WAL for what could
otherwise be a low-traffic database.

But you'd still want to use that extra WAL space so you're not
checkpointing every 3 seconds. Really I can't see this becoming an issue
unless you're about to run out of disk space.

Is there a defined way to find out how much space we have left on the
disk that's hosting WAL? If so we could curtail WAL size if we're close
to running out of room. (But, honestly, I think we should just set this
to 1-2GB and be done with it).
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Josh Berkus (#80)

Re: Redesigning checkpoint_segments

On Thu, Feb 5, 2015 at 3:11 AM, Josh Berkus <josh@agliodbs.com> wrote:

On 02/04/2015 12:06 PM, Robert Haas wrote:

On Wed, Feb 4, 2015 at 1:05 PM, Josh Berkus <josh@agliodbs.com> wrote:

Let me push "max_wal_size" and "min_wal_size" again as our new

parameter

names, because:

* does what it says on the tin
* new user friendly
* encourages people to express it in MB, not segments
* very different from the old name, so people will know it works

differently

That's not bad. If we added a hard WAL limit in a future release, how
would that fit into this naming scheme?

Well, first, nobody's at present proposing a patch to add a hard limit,
so I'm reluctant to choose non-obvious names to avoid conflict with a
feature nobody may ever write. There's a number of reasons a hard limit
would be difficult and/or undesirable.

If we did add one, I'd suggest calling it "wal_size_limit" or something
similar.

I think both the names (max_wal_size and wal_size_limit) seems to
indicate the same same thing. Few more suggestions:
typical_wal_size, wal_size_soft_limit?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#84

Venkata Balaji N

nag1010@gmail.com

almost 11 years ago

In reply to: Heikki Linnakangas (#1)

Re: Redesigning checkpoint_segments

Missed adding "pgsql-hackers" group while replying.

Regards,
Venkata Balaji N

On Thu, Feb 5, 2015 at 12:48 PM, Venkata Balaji N <nag1010@gmail.com> wrote:

Show quoted text

On Fri, Jan 30, 2015 at 7:58 PM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

On 01/30/2015 04:48 AM, Venkata Balaji N wrote:

I performed series of tests for this patch and would like to share the
results. My comments are in-line.

Thanks for the testing!

*Test 1 :*

In this test, i see removed+recycled segments = 3 (except for the first 3
checkpoint cycles) and has been steady through out until the INSERT
operation completed.

Actual calculation of CheckPointSegments = 3.2 ( is getting rounded up
to 3
)

pg_xlog size is 128M and has increased to 160M max during the INSERT
operation.

shared_buffers = 128M
checkpoint_wal_size = 128M
min_recycle_wal_size = 80M
checkpoint_timeout = 5min

Hmm, did I understand correctly that pg_xlog peaked at 160MB, but most of
the stayed at 128 MB? That sounds like it's working as designed;
checkpoint_wal_size is not a hard limit after all.

Yes, the pg_xlog directory size peaked to 160MB at times and most of the
time stayed at 128MB. I did make an observation in an other round of latest
test, my observations are below.

b) Are the two GUCs, checkpoint_wal_size, and min_recycle_wal_size,

intuitive to set?

During my tests, I did not observe the significance of
min_recycle_wal_size
parameter yet. Ofcourse, i had sufficient disk space for pg_xlog.

I would like to understand more about "min_recycle_wal_size" parameter.
In
theory, i only understand from the note in the patch that if the disk
space
usage falls below certain threshold, min_recycle_wal_size number of WALs
will be removed to accommodate future pg_xlog segments. I will try to
test
this out. Please let me know if there is any specific test to understand
min_recycle_wal_size behaviour.

min_recycle_wal_size comes into play when you have only light load, so
that checkpoints are triggered by checkpoint_timeout rather than
checkpoint_wal_size. In that scenario, the WAL usage will shrink down to
min_recycle_wal_size, but not below that. Did that explanation help? Can
you suggest changes to the docs to make it more clear?

Thanks for the explanation. I see the below note from the patch, i think
it should also say that minimum wal size on the disk will be
"min_recycle_wal_size" during the light load and idle situations.

I think the name of the parameter name "min_recycle_wal_size" implies
something slightly different. It does not give an impression that it is the
minimum wal size on the disk during light loads. I agree with Josh Berkus
that the parameter (min_recycle_wal_size) name must be something like
"min_wal_size" which makes more sense.
+   <varname>*wal_recycle_min_size*</> puts a minimum on the amount of
WAL files
+   recycled for future usage; that much WAL is always recycled for future
use,
+   even if the system is idle and the WAL usage estimate suggests that
little
+   WAL is needed.
+  </para>
Note : in wal.sgml, the parameter name is mentioned as
"wal_recycle_min_size". That must be changed to min_recycle_wal_size.

Another round of test : I raised checkpoint_wal_size to 10512 MB which is
about 10GB and kept min_recycle_wal_size at 128 MB (with checkpoint_timeout
= 5min). The checkpoints timed out and started recycling about 2 GB
segments regularly, below are the checkpoint logs -

I started loading the data of size more than 100GB.

TimeStamp=2015-02-05 10:22:40.323 GMT-10 DB= SID=54d2af22.65b4 User= LOG:
checkpoint complete: wrote 83998 buffers (64.1%); 0 transaction log file(s)
added, 0 removed, 135 recycled; write=95.687 s, sync=25.845 s,
total=121.866 s; sync files=18, longest=10.306 s, average=1.435 s;
distance=2271524 KB, estimate=2300497 KB
TimeStamp=2015-02-05 10:25:38.875 GMT-10 DB= SID=54d2af22.65b4 User= LOG:
checkpoint starting: time
TimeStamp=2015-02-05 10:27:50.955 GMT-10 DB= SID=54d2af22.65b4 User= LOG:
checkpoint complete: wrote 83216 buffers (63.5%); 0 transaction log file(s)
added, 0 removed, 146 recycled; write=96.951 s, sync=34.814 s,
total=132.079 s; sync files=18, longest=9.535 s, average=1.934 s;
distance=2229416 KB, estimate=2293388 KB
TimeStamp=2015-02-05 10:30:38.786 GMT-10 DB= SID=54d2af22.65b4 User= LOG:
checkpoint starting: time
TimeStamp=2015-02-05 10:32:20.332 GMT-10 DB= SID=54d2af22.65b4 User= LOG:
checkpoint complete: wrote 82409 buffers (62.9%); 0 transaction log file(s)
added, 0 removed, 131 recycled; write=94.712 s, sync=6.516 s, total=101.545
s; sync files=18, longest=2.645 s, average=0.362 s; distance=2131805 KB,
estimate=2277230 KB
TimeStamp=2015-02-05 10:35:38.788 GMT-10 DB= SID=54d2af22.65b4 User= LOG:
checkpoint starting: time
TimeStamp=2015-02-05 10:37:35.883 GMT-10 DB= SID=54d2af22.65b4 User= LOG:
checkpoint complete: wrote 87821 buffers (67.0%); 0 transaction log file(s)
added, 0 removed, 134 recycled; write=99.461 s, sync=17.058 s,
total=117.094 s; sync files=19, longest=9.022 s, average=0.897 s;
distance=2339374 KB, estimate=2339374 KB
TimeStamp=2015-02-05 10:40:38.975 GMT-10 DB= SID=54d2af22.65b4 User= LOG:
checkpoint starting: time
TimeStamp=2015-02-05 10:42:46.789 GMT-10 DB= SID=54d2af22.65b4 User= LOG:
checkpoint complete: wrote 82975 buffers (63.3%); 0 transaction log file(s)
added, 0 removed, 146 recycled; write=94.458 s, sync=33.025 s,
total=127.814 s; sync files=19, longest=5.975 s, average=1.738 s;
distance=2298657 KB, estimate=2335302 KB

My observations are :

1. As per your explanation, I also see pg_xlog size is not getting reduced
to "min_recycled_wal_size" (128M) after the load operation is complete.
I did a manual checkpoint and also, restarted the database, still,
pg_xlog size stays at 7 GB. am i missing something here ?

2. checkpoint_wal_size has any upper limit ?

Please share your thoughts.

Regards,
Venkata Balaji N

Import Notes

Reply to msg id not found: CAEyp7J_R4YHYdL8QmovAOkM_1JLX4MHbTaqhe6j_TLxOsFo0-Q@mail.gmail.com

#85

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Josh Berkus (#80)

Re: Redesigning checkpoint_segments

On Wed, Feb 4, 2015 at 4:41 PM, Josh Berkus <josh@agliodbs.com> wrote:

That's certainly better, but I think we should go further. Again,
you're not committed to using this space all the time, and if you are
using it you must have a lot of write activity, which means you are
not running on a tin can and a string. If you have a little tiny
database, say 100MB, running on a little-tiny Amazon instance,
handling a small number of transactions, you're going to stay close to
wal_min_size anyway. Right?

Well, we can test that.

So what's your proposed size?

I previously proposed 100 segments, or 1.6GB. If that seems too
large, how about 64 segments, or 1.024GB? I think there will be few
people who can't tolerate a gigabyte of xlog under peak load, and an
awful lot who will benefit from it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86

Andres Freund

andres@2ndquadrant.com

almost 11 years ago

In reply to: Robert Haas (#85)

Re: Redesigning checkpoint_segments

On 2015-02-05 09:42:37 -0500, Robert Haas wrote:

I previously proposed 100 segments, or 1.6GB. If that seems too
large, how about 64 segments, or 1.024GB? I think there will be few
people who can't tolerate a gigabyte of xlog under peak load, and an
awful lot who will benefit from it.

It'd be quite easier to go there if we'd shrink back to the min_size
after a while, after having peaked above it. IIUC the patch doesn't do
that?

Admittedly it's not easy to come up with an algorithm that doesn't cause
superflous file removals. Initiating wal files isn't cheap.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87

Heikki Linnakangas

hlinnakangas@vmware.com

almost 11 years ago

In reply to: Andres Freund (#86)

Re: Redesigning checkpoint_segments

On 02/05/2015 04:47 PM, Andres Freund wrote:

On 2015-02-05 09:42:37 -0500, Robert Haas wrote:

I previously proposed 100 segments, or 1.6GB. If that seems too
large, how about 64 segments, or 1.024GB? I think there will be few
people who can't tolerate a gigabyte of xlog under peak load, and an
awful lot who will benefit from it.

It'd be quite easier to go there if we'd shrink back to the min_size
after a while, after having peaked above it. IIUC the patch doesn't do
that?

It doesn't actively go and remove files once they've already been
recycled, but if the system stays relatively idle for several
checkpoints, the WAL usage will shrink down again. That's the core idea
of the patch.

If the system stays completely or almost completely idle, that won't
happen though, because then it will never switch to a new segment so
none of the segments become old so that they could be removed.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88

Josh Berkus

josh@agliodbs.com

almost 11 years ago

In reply to: Heikki Linnakangas (#1)

Re: Redesigning checkpoint_segments

On 02/04/2015 04:16 PM, David Steele wrote:

On 2/4/15 3:06 PM, Robert Haas wrote:

Hmmm, I see your point. I spend a lot of time on AWS and in
container-world, where disk space is a lot more constrained. However,
it probably makes more sense to recommend non-default settings for that
environment, since it requires non-default settings anyway.

So, 384MB?

That's certainly better, but I think we should go further. Again,
you're not committed to using this space all the time, and if you are
using it you must have a lot of write activity, which means you are
not running on a tin can and a string. If you have a little tiny
database, say 100MB, running on a little-tiny Amazon instance,
handling a small number of transactions, you're going to stay close to
wal_min_size anyway. Right?

The main exception I can think of is when using dump/restore to upgrade
instead of pg_upgrade. This would generate a lot of WAL for what could
otherwise be a low-traffic database.

Except that, when setting up servers for customers, one thing I pretty
much always do for them is temporarily increase checkpoint_segments for
the initial data load. So having Postgres do this automatically would
be a feature, not a bug.

I say we go with ~~ 1GB. That's an 8X increase over current default
size for the maximum

Default of 4 for min_wal_size?

On 02/04/2015 07:37 PM, Amit Kapila wrote:> On Thu, Feb 5, 2015 at 3:11
AM, Josh Berkus <josh@agliodbs.com

If we did add one, I'd suggest calling it "wal_size_limit" or something
similar.

I think both the names (max_wal_size and wal_size_limit) seems to
indicate the same same thing. Few more suggestions:
typical_wal_size, wal_size_soft_limit?

Again, you're suggesting more complicated (and difficult to translate,
and for that matter misleading) names in order to work around a future
feature which nobody is currently working on, and we may never have.
Let's keep clear and simple parameter names which most people can
understand, instead of making things complicated for the sake of complexity.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM91c8c9e6705fc9f31793a4bc6fe7acac4d17722a51fd26b88f1e912420c6a36e0cd85c045483efec02ed595ec08179ad@asav-2.01.com

#89

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Josh Berkus (#88)

Re: Redesigning checkpoint_segments

On Thu, Feb 5, 2015 at 2:11 PM, Josh Berkus <josh@agliodbs.com> wrote:

Except that, when setting up servers for customers, one thing I pretty
much always do for them is temporarily increase checkpoint_segments for
the initial data load. So having Postgres do this automatically would
be a feature, not a bug.

Right!

I say we go with ~~ 1GB. That's an 8X increase over current default
size for the maximum

Sounds great.

Default of 4 for min_wal_size?

I assume you mean 4 segments; why not 3 as currently? As long as the
system has the latitude to ratchet it up when needed, there seems to
be little advantage to raising the minimum. Of course I guess there
must be some advantage or Heikki wouldn't have made it configurable,
but I'd err on the side of keeping this one small. Hopefully the
system that automatically adjusts this is really smart, and a large
min_wal_size is superfluous for most people.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90

Josh Berkus

josh@agliodbs.com

almost 11 years ago

In reply to: Heikki Linnakangas (#1)

Re: Redesigning checkpoint_segments

On 02/05/2015 01:28 PM, Robert Haas wrote:

On Thu, Feb 5, 2015 at 2:11 PM, Josh Berkus <josh@agliodbs.com> wrote:

Except that, when setting up servers for customers, one thing I pretty
much always do for them is temporarily increase checkpoint_segments for
the initial data load. So having Postgres do this automatically would
be a feature, not a bug.

Right!

I say we go with ~~ 1GB. That's an 8X increase over current default
size for the maximum

Sounds great.

Default of 4 for min_wal_size?

I assume you mean 4 segments; why not 3 as currently? As long as the
system has the latitude to ratchet it up when needed, there seems to
be little advantage to raising the minimum. Of course I guess there
must be some advantage or Heikki wouldn't have made it configurable,
but I'd err on the side of keeping this one small. Hopefully the
system that automatically adjusts this is really smart, and a large
min_wal_size is superfluous for most people.

Keep in mind that the current is actually 7, not three (3*2+1). So 3
would be a siginficant decrease. However, I don't feel strongly about
it either way. I think that there is probably a minimum reasonable
value > 1, but I'm not sure what it is.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM5eec95e6da6aa8f6aec5e9093922b8edc8076784668018170c76cd7dc86cdb1f1b1e5b2d40bb2726b6b58de44eddc296@asav-2.01.com

#91

Heikki Linnakangas

hlinnakangas@vmware.com

almost 11 years ago

In reply to: Robert Haas (#89)

Re: Redesigning checkpoint_segments

On 02/05/2015 11:28 PM, Robert Haas wrote:

On Thu, Feb 5, 2015 at 2:11 PM, Josh Berkus <josh@agliodbs.com> wrote:

Default of 4 for min_wal_size?

I assume you mean 4 segments; why not 3 as currently? As long as the
system has the latitude to ratchet it up when needed, there seems to
be little advantage to raising the minimum. Of course I guess there
must be some advantage or Heikki wouldn't have made it configurable,
but I'd err on the side of keeping this one small. Hopefully the
system that automatically adjusts this is really smart, and a large
min_wal_size is superfluous for most people.

There are a few reasons for making the minimum configurable:

1. Creating new segments when you need them is not free, so if you have
a workload with occasional very large spikes, you might want to prepare
for them. The auto-tuning will accommodate for the peak usage, but it's
a moving average so if the peaks are infrequent enough, it will shrink
the size down between the spikes.

2. To avoid running out of disk space on write to WAL (which leads to a
PANIC). In particular, if you have the WAL on the same filesystem as the
data, pre-reserving all the space required for WAL makes it much more
likely that you when you run out of disk space, you run out when writing
regular data, not WAL.

3. Unforeseen issues with the auto-tuning. It might not suite everyone,
so it's nice that you can still get the old behaviour by setting min=max.

Actually, perhaps we should have a boolean setting that just implies
min=max, instead of having a configurable minimum?. That would cover all
of those reasons pretty well. So we would have a "max_wal_size" setting,
and a boolean "preallocate_all_wal = on | off". Would anyone care for
the flexibility of setting a minimum that's different from the maximum?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#92

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Heikki Linnakangas (#91)

Re: Redesigning checkpoint_segments

On Thu, Feb 5, 2015 at 4:42 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Actually, perhaps we should have a boolean setting that just implies
min=max, instead of having a configurable minimum?. That would cover all of
those reasons pretty well. So we would have a "max_wal_size" setting, and a
boolean "preallocate_all_wal = on | off". Would anyone care for the
flexibility of setting a minimum that's different from the maximum?

I like the way you have it now better. If we knew for certain that
there were no advantage in configuring a value between 0 and the
maximum, that would be one thing, but we don't and can't know that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#93

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Josh Berkus (#90)

Re: Redesigning checkpoint_segments

Default of 4 for min_wal_size?

I assume you mean 4 segments; why not 3 as currently? As long as the
system has the latitude to ratchet it up when needed, there seems to
be little advantage to raising the minimum. Of course I guess there
must be some advantage or Heikki wouldn't have made it configurable,
but I'd err on the side of keeping this one small. Hopefully the
system that automatically adjusts this is really smart, and a large
min_wal_size is superfluous for most people.

Keep in mind that the current is actually 7, not three (3*2+1). So 3
would be a siginficant decrease. However, I don't feel strongly about
it either way. I think that there is probably a minimum reasonable
value > 1, but I'm not sure what it is.

Good point. OK, 4 works for me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#94

Josh Berkus

josh@agliodbs.com

almost 11 years ago

In reply to: Heikki Linnakangas (#1)

Re: Redesigning checkpoint_segments

On 02/05/2015 01:42 PM, Heikki Linnakangas wrote:

There are a few reasons for making the minimum configurable:

Any thoughts on what the default minimum should be, if the default max
is 1.1GB/64?

1. Creating new segments when you need them is not free, so if you have
a workload with occasional very large spikes, you might want to prepare
for them. The auto-tuning will accommodate for the peak usage, but it's
a moving average so if the peaks are infrequent enough, it will shrink
the size down between the spikes.

2. To avoid running out of disk space on write to WAL (which leads to a
PANIC). In particular, if you have the WAL on the same filesystem as the
data, pre-reserving all the space required for WAL makes it much more
likely that you when you run out of disk space, you run out when writing
regular data, not WAL.

3. Unforeseen issues with the auto-tuning. It might not suite everyone,
so it's nice that you can still get the old behaviour by setting min=max.

Actually, perhaps we should have a boolean setting that just implies
min=max, instead of having a configurable minimum?. That would cover all
of those reasons pretty well. So we would have a "max_wal_size" setting,
and a boolean "preallocate_all_wal = on | off". Would anyone care for
the flexibility of setting a minimum that's different from the maximum?

I do, actually. Here's the case I want it for:

I have a web application which gets all of its new data as uncoordinated
batch updates from customers. Since it's possible for me to receive
several batch updates at once, I set max_wal_size to 16GB, roughtly the
side of 8 batch updates. But I don't want the WAL that big all the time
because it slows down backup snapshots. So I set min_wal_size to 2GB,
roughly the size of one batch update.

That's an idiosyncratic case, but I can imagine more of them out there.

I wouldn't be opposed to min_wal_size = -1 meaning "same as
max_wal_size" though.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM3d9f385abe31fb14c7a5ffc746b316c23291587d6b65ebbeeed08733569965ce12ad39a406cbb315a32a85f9197e9880@asav-2.01.com

#95

David Steele

david@pgmasters.net

almost 11 years ago

In reply to: Josh Berkus (#94)

Re: Redesigning checkpoint_segments

On 2/5/15 4:53 PM, Josh Berkus wrote:

Actually, perhaps we should have a boolean setting that just implies
min=max, instead of having a configurable minimum?. That would cover all
of those reasons pretty well. So we would have a "max_wal_size" setting,
and a boolean "preallocate_all_wal = on | off". Would anyone care for
the flexibility of setting a minimum that's different from the maximum?

I do, actually. Here's the case I want it for:

I have a web application which gets all of its new data as uncoordinated
batch updates from customers. Since it's possible for me to receive
several batch updates at once, I set max_wal_size to 16GB, roughtly the
side of 8 batch updates. But I don't want the WAL that big all the time
because it slows down backup snapshots. So I set min_wal_size to 2GB,
roughly the size of one batch update.

That's an idiosyncratic case, but I can imagine more of them out there.

I wouldn't be opposed to min_wal_size = -1 meaning "same as
max_wal_size" though.

+1 for min_wal_size. Like Josh, I can think of instances where this
would be good.

--
- David Steele
david@pgmasters.net

#96

Heikki Linnakangas

hlinnakangas@vmware.com

almost 11 years ago

In reply to: Josh Berkus (#80)

4 attachment(s)

Re: Redesigning checkpoint_segments

On 02/04/2015 11:41 PM, Josh Berkus wrote:

On 02/04/2015 12:06 PM, Robert Haas wrote:

On Wed, Feb 4, 2015 at 1:05 PM, Josh Berkus <josh@agliodbs.com> wrote:

Let me push "max_wal_size" and "min_wal_size" again as our new parameter
names, because:

* does what it says on the tin
* new user friendly
* encourages people to express it in MB, not segments
* very different from the old name, so people will know it works differently

That's not bad. If we added a hard WAL limit in a future release, how
would that fit into this naming scheme?

Well, first, nobody's at present proposing a patch to add a hard limit,
so I'm reluctant to choose non-obvious names to avoid conflict with a
feature nobody may ever write. There's a number of reasons a hard limit
would be difficult and/or undesirable.

If we did add one, I'd suggest calling it "wal_size_limit" or something
similar. However, we're most likely to only implement the limit for
archives, which means that it might acually be called
"archive_buffer_limit" or something more to the point.

Ok, I don't hear any loud objections to min_wal_size and max_wal_size,
so let's go with that then.

Attached is a new version of this. It now comes in four patches. The
first three are just GUC-related preliminary work, the first of which I
posted on a separate thread today.

- Heikki

Attachments:

0001-Refactor-unit-conversions-code-in-guc.c.patchapplication/x-patch; name=0001-Refactor-unit-conversions-code-in-guc.c.patchDownload

From a053c61c333687224d33a18a2a299c4dc2eb6bfe Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 13 Feb 2015 15:24:50 +0200
Subject: [PATCH 1/4] Refactor unit conversions code in guc.c.

Replace the if-switch-case constructs with two conversion tables,
containing all the supported conversions between human-readable unit
strings and the base units used in GUC variables. This makes the code
easier to read, and makes adding new units simpler.
---
 src/backend/utils/misc/guc.c | 425 +++++++++++++++++++------------------------
 src/include/utils/guc.h      |   2 +
 2 files changed, 188 insertions(+), 239 deletions(-)

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 9572777..59e25af 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -97,20 +97,6 @@
 #define CONFIG_EXEC_PARAMS_NEW "global/config_exec_params.new"
 #endif
 
-#define KB_PER_MB (1024)
-#define KB_PER_GB (1024*1024)
-#define KB_PER_TB (1024*1024*1024)
-
-#define MS_PER_S 1000
-#define S_PER_MIN 60
-#define MS_PER_MIN (1000 * 60)
-#define MIN_PER_H 60
-#define S_PER_H (60 * 60)
-#define MS_PER_H (1000 * 60 * 60)
-#define MIN_PER_D (60 * 24)
-#define S_PER_D (60 * 60 * 24)
-#define MS_PER_D (1000 * 60 * 60 * 24)
-
 /*
  * Precision with which REAL type guc values are to be printed for GUC
  * serialization.
@@ -666,6 +652,88 @@ const char *const config_type_names[] =
 	 /* PGC_ENUM */ "enum"
 };
 
+/*
+ * Unit conversions tables.
+ *
+ * There are two tables, one for memory units, and another for time units.
+ * For each supported conversion from one unit to another, we have an entry
+ * in the conversion table.
+ *
+ * To keep things simple, and to avoid intermediate-value overflows,
+ * conversions are never chained. There needs to be a direct conversion
+ * between all units.
+ *
+ * The conversions from each base unit must be kept in order from greatest
+ * to smallest unit; convert_from_base_unit() relies on that. (The order of
+ * the base units does not matter.)
+ */
+#define MAX_UNIT_LEN		3	/* length of longest recognized unit string */
+
+typedef struct
+{
+	char	unit[MAX_UNIT_LEN + 1];	/* unit, as a string, like "kB" or "min" */
+	int		base_unit;		/* GUC_UNIT_XXX */
+	int		multiplier;		/* If positive, multiply the value with this for
+							 * unit -> base_unit conversion. If negative,
+							 * divide (with the absolute value) */
+} unit_conversion;
+
+/* Ensure that the constants in the tables don't overflow or underflow */
+#if BLCKSZ < 1024 || BLCKSZ > (1024*1024)
+#error BLCKSZ must be between 1KB and 1MB
+#endif
+#if XLOG_BLCKSZ < 1024 || XLOG_BLCKSZ > (1024*1024)
+#error XLOG_BLCKSZ must be between 1KB and 1MB
+#endif
+
+static const char *memory_units_hint =
+	gettext_noop("Valid units for this parameter are \"kB\", \"MB\", \"GB\", and \"TB\".");
+
+static const unit_conversion memory_unit_conversion_table[] =
+{
+	{ "TB",		GUC_UNIT_KB,	 	1024*1024*1024 },
+	{ "GB",		GUC_UNIT_KB,	 	1024*1024 },
+	{ "MB",		GUC_UNIT_KB,	 	1024 },
+	{ "kB",		GUC_UNIT_KB,	 	1 },
+
+	{ "TB",		GUC_UNIT_BLOCKS,	(1024*1024*1024) / (BLCKSZ / 1024) },
+	{ "GB",		GUC_UNIT_BLOCKS,	(1024*1024) / (BLCKSZ / 1024) },
+	{ "MB",		GUC_UNIT_BLOCKS,	1024 / (BLCKSZ / 1024) },
+	{ "kB",		GUC_UNIT_BLOCKS,	-(BLCKSZ / 1024) },
+
+	{ "TB",		GUC_UNIT_XBLOCKS,	(1024*1024*1024) / (XLOG_BLCKSZ / 1024) },
+	{ "GB",		GUC_UNIT_XBLOCKS,	(1024*1024) / (XLOG_BLCKSZ / 1024) },
+	{ "MB",		GUC_UNIT_XBLOCKS,	1024 / (XLOG_BLCKSZ / 1024) },
+	{ "kB",		GUC_UNIT_XBLOCKS,	-(XLOG_BLCKSZ / 1024) },
+
+	{ "" }		/* end of table marker */
+};
+
+static const char *time_units_hint =
+	gettext_noop("Valid units for this parameter are \"ms\", \"s\", \"min\", \"h\", and \"d\".");
+
+static const unit_conversion time_unit_conversion_table[] =
+{
+	{ "d",		GUC_UNIT_MS,	1000 * 60 * 60 * 24 },
+	{ "h",		GUC_UNIT_MS,	1000 * 60 * 60 },
+	{ "min", 	GUC_UNIT_MS,	1000 * 60},
+	{ "s",		GUC_UNIT_MS,	1000 },
+	{ "ms",		GUC_UNIT_MS,	1 },
+
+	{ "d",		GUC_UNIT_S,		60 * 60 * 24 },
+	{ "h",		GUC_UNIT_S,		60 * 60 },
+	{ "min", 	GUC_UNIT_S,		60 },
+	{ "s",		GUC_UNIT_S,		1 },
+	{ "ms", 	GUC_UNIT_S,	 	-1000 },
+
+	{ "d", 		GUC_UNIT_MIN,	60 * 24 },
+	{ "h", 		GUC_UNIT_MIN,	60 },
+	{ "min", 	GUC_UNIT_MIN,	1 },
+	{ "s", 		GUC_UNIT_MIN,	-60 },
+	{ "ms", 	GUC_UNIT_MIN,	-1000 * 60 },
+
+	{ "" }		/* end of table marker */
+};
 
 /*
  * Contents of GUC tables
@@ -5018,6 +5086,85 @@ ReportGUCOption(struct config_generic * record)
 }
 
 /*
+ * Convert a value from one of the human-friendly units ("kB", "min" etc.)
+ * to a given base unit. 'value' and 'unit' are the input value and unit to
+ * convert from. The value after conversion to 'base_unit' is stored in
+ * *base_value.
+ *
+ * Returns true on success, or false if the input unit is not recognized.
+ */
+static bool
+convert_to_base_unit(int64 value, const char *unit,
+					 int base_unit, int64 *base_value)
+{
+	const unit_conversion *table;
+	int 		i;
+
+	if (base_unit & GUC_UNIT_MEMORY)
+		table = memory_unit_conversion_table;
+	else
+		table = time_unit_conversion_table;
+
+	for (i = 0; *table[i].unit; i++)
+	{
+		if (base_unit == table[i].base_unit &&
+			strcmp(unit, table[i].unit) == 0)
+		{
+			if (table[i].multiplier < 0)
+				*base_value = value / (-table[i].multiplier);
+			else
+				*base_value = value * table[i].multiplier;
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Convert a value in some base unit to a human-friendly unit.
+ * The output unit is chosen so that it's the greatest unit that can represent
+ * the value without loss. For example, if the base unit is GUC_UNIT_KB, 1024
+ * converted to 1 MB, but 1025 is represented as 1025 kB.
+ */
+static void
+convert_from_base_unit(int64 base_value, int base_unit,
+					   int64 *value, const char **unit)
+{
+	const unit_conversion *table;
+	int			i;
+
+	*unit = NULL;
+
+	if (base_unit & GUC_UNIT_MEMORY)
+		table = memory_unit_conversion_table;
+	else
+		table = time_unit_conversion_table;
+
+	for (i = 0; *table[i].unit; i++)
+	{
+		if (base_unit == table[i].base_unit)
+		{
+			/* accept the first conversion that divides the value evenly */
+			if (table[i].multiplier < 0)
+			{
+				*value = base_value * (-table[i].multiplier);
+				*unit = table[i].unit;
+				break;
+			}
+			else if (base_value % table[i].multiplier == 0)
+			{
+				*value = base_value / table[i].multiplier;
+				*unit = table[i].unit;
+				break;
+			}
+		}
+	}
+
+	Assert(*unit != NULL);
+}
+
+
+/*
  * Try to parse value as an integer.  The accepted formats are the
  * usual decimal, octal, or hexadecimal formats, optionally followed by
  * a unit name if "flags" indicates a unit is allowed.
@@ -5060,170 +5207,36 @@ parse_int(const char *value, int *result, int flags, const char **hintmsg)
 	/* Handle possible unit */
 	if (*endptr != '\0')
 	{
-		/*
-		 * Note: the multiple-switch coding technique here is a bit tedious,
-		 * but seems necessary to avoid intermediate-value overflows.
-		 */
-		if (flags & GUC_UNIT_MEMORY)
-		{
-			/* Set hint for use if no match or trailing garbage */
-			if (hintmsg)
-				*hintmsg = gettext_noop("Valid units for this parameter are \"kB\", \"MB\", \"GB\", and \"TB\".");
+		char		unit[MAX_UNIT_LEN + 1];
+		int			unitlen;
+		bool		converted = false;
 
-#if BLCKSZ < 1024 || BLCKSZ > (1024*1024)
-#error BLCKSZ must be between 1KB and 1MB
-#endif
-#if XLOG_BLCKSZ < 1024 || XLOG_BLCKSZ > (1024*1024)
-#error XLOG_BLCKSZ must be between 1KB and 1MB
-#endif
+		if ((flags & GUC_UNIT) == 0)
+			return false;	/* this setting does not accept a unit */
 
-			if (strncmp(endptr, "kB", 2) == 0)
-			{
-				endptr += 2;
-				switch (flags & GUC_UNIT_MEMORY)
-				{
-					case GUC_UNIT_BLOCKS:
-						val /= (BLCKSZ / 1024);
-						break;
-					case GUC_UNIT_XBLOCKS:
-						val /= (XLOG_BLCKSZ / 1024);
-						break;
-				}
-			}
-			else if (strncmp(endptr, "MB", 2) == 0)
-			{
-				endptr += 2;
-				switch (flags & GUC_UNIT_MEMORY)
-				{
-					case GUC_UNIT_KB:
-						val *= KB_PER_MB;
-						break;
-					case GUC_UNIT_BLOCKS:
-						val *= KB_PER_MB / (BLCKSZ / 1024);
-						break;
-					case GUC_UNIT_XBLOCKS:
-						val *= KB_PER_MB / (XLOG_BLCKSZ / 1024);
-						break;
-				}
-			}
-			else if (strncmp(endptr, "GB", 2) == 0)
-			{
-				endptr += 2;
-				switch (flags & GUC_UNIT_MEMORY)
-				{
-					case GUC_UNIT_KB:
-						val *= KB_PER_GB;
-						break;
-					case GUC_UNIT_BLOCKS:
-						val *= KB_PER_GB / (BLCKSZ / 1024);
-						break;
-					case GUC_UNIT_XBLOCKS:
-						val *= KB_PER_GB / (XLOG_BLCKSZ / 1024);
-						break;
-				}
-			}
-			else if (strncmp(endptr, "TB", 2) == 0)
-			{
-				endptr += 2;
-				switch (flags & GUC_UNIT_MEMORY)
-				{
-					case GUC_UNIT_KB:
-						val *= KB_PER_TB;
-						break;
-					case GUC_UNIT_BLOCKS:
-						val *= KB_PER_TB / (BLCKSZ / 1024);
-						break;
-					case GUC_UNIT_XBLOCKS:
-						val *= KB_PER_TB / (XLOG_BLCKSZ / 1024);
-						break;
-				}
-			}
-		}
-		else if (flags & GUC_UNIT_TIME)
-		{
-			/* Set hint for use if no match or trailing garbage */
-			if (hintmsg)
-				*hintmsg = gettext_noop("Valid units for this parameter are \"ms\", \"s\", \"min\", \"h\", and \"d\".");
+		unitlen = 0;
+		while (*endptr != '\0' && unitlen < MAX_UNIT_LEN)
+			unit[unitlen++] = *(endptr++);
+		unit[unitlen] = '\0';
 
-			if (strncmp(endptr, "ms", 2) == 0)
-			{
-				endptr += 2;
-				switch (flags & GUC_UNIT_TIME)
-				{
-					case GUC_UNIT_S:
-						val /= MS_PER_S;
-						break;
-					case GUC_UNIT_MIN:
-						val /= MS_PER_MIN;
-						break;
-				}
-			}
-			else if (strncmp(endptr, "s", 1) == 0)
-			{
-				endptr += 1;
-				switch (flags & GUC_UNIT_TIME)
-				{
-					case GUC_UNIT_MS:
-						val *= MS_PER_S;
-						break;
-					case GUC_UNIT_MIN:
-						val /= S_PER_MIN;
-						break;
-				}
-			}
-			else if (strncmp(endptr, "min", 3) == 0)
-			{
-				endptr += 3;
-				switch (flags & GUC_UNIT_TIME)
-				{
-					case GUC_UNIT_MS:
-						val *= MS_PER_MIN;
-						break;
-					case GUC_UNIT_S:
-						val *= S_PER_MIN;
-						break;
-				}
-			}
-			else if (strncmp(endptr, "h", 1) == 0)
-			{
-				endptr += 1;
-				switch (flags & GUC_UNIT_TIME)
-				{
-					case GUC_UNIT_MS:
-						val *= MS_PER_H;
-						break;
-					case GUC_UNIT_S:
-						val *= S_PER_H;
-						break;
-					case GUC_UNIT_MIN:
-						val *= MIN_PER_H;
-						break;
-				}
-			}
-			else if (strncmp(endptr, "d", 1) == 0)
-			{
-				endptr += 1;
-				switch (flags & GUC_UNIT_TIME)
-				{
-					case GUC_UNIT_MS:
-						val *= MS_PER_D;
-						break;
-					case GUC_UNIT_S:
-						val *= S_PER_D;
-						break;
-					case GUC_UNIT_MIN:
-						val *= MIN_PER_D;
-						break;
-				}
-			}
-		}
+		converted = convert_to_base_unit(val, unit, (flags & GUC_UNIT), &val);
 
 		/* allow whitespace after unit */
 		while (isspace((unsigned char) *endptr))
 			endptr++;
 
-		if (*endptr != '\0')
-			return false;		/* appropriate hint, if any, already set */
+		if (!converted || *endptr != '\0')
+		{
+			/* invalid unit, or garbage after the unit; set hint and fail. */
+			if (hintmsg)
+			{
+				if (flags & GUC_UNIT_MEMORY)
+					*hintmsg = memory_units_hint;
+				else
+					*hintmsg = time_units_hint;
+			}
+			return false;
+		}
 
 		/* Check for overflow due to units conversion */
 		if (val != (int64) ((int32) val))
@@ -8096,76 +8109,10 @@ _ShowOption(struct config_generic * record, bool use_units)
 					int64		result = *conf->variable;
 					const char *unit;
 
-					if (use_units && result > 0 &&
-						(record->flags & GUC_UNIT_MEMORY))
-					{
-						switch (record->flags & GUC_UNIT_MEMORY)
-						{
-							case GUC_UNIT_BLOCKS:
-								result *= BLCKSZ / 1024;
-								break;
-							case GUC_UNIT_XBLOCKS:
-								result *= XLOG_BLCKSZ / 1024;
-								break;
-						}
-
-						if (result % KB_PER_TB == 0)
-						{
-							result /= KB_PER_TB;
-							unit = "TB";
-						}
-						else if (result % KB_PER_GB == 0)
-						{
-							result /= KB_PER_GB;
-							unit = "GB";
-						}
-						else if (result % KB_PER_MB == 0)
-						{
-							result /= KB_PER_MB;
-							unit = "MB";
-						}
-						else
-						{
-							unit = "kB";
-						}
-					}
-					else if (use_units && result > 0 &&
-							 (record->flags & GUC_UNIT_TIME))
+					if (use_units && result > 0 && (record->flags & GUC_UNIT))
 					{
-						switch (record->flags & GUC_UNIT_TIME)
-						{
-							case GUC_UNIT_S:
-								result *= MS_PER_S;
-								break;
-							case GUC_UNIT_MIN:
-								result *= MS_PER_MIN;
-								break;
-						}
-
-						if (result % MS_PER_D == 0)
-						{
-							result /= MS_PER_D;
-							unit = "d";
-						}
-						else if (result % MS_PER_H == 0)
-						{
-							result /= MS_PER_H;
-							unit = "h";
-						}
-						else if (result % MS_PER_MIN == 0)
-						{
-							result /= MS_PER_MIN;
-							unit = "min";
-						}
-						else if (result % MS_PER_S == 0)
-						{
-							result /= MS_PER_S;
-							unit = "s";
-						}
-						else
-						{
-							unit = "ms";
-						}
+						convert_from_base_unit(result, record->flags & GUC_UNIT,
+											   &result, &unit);
 					}
 					else
 						unit = "";
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 717f46b..9a9a7a0 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -212,6 +212,8 @@ typedef enum
 #define GUC_UNIT_MIN			0x4000	/* value is in minutes */
 #define GUC_UNIT_TIME			0x7000	/* mask for MS, S, MIN */
 
+#define GUC_UNIT				(GUC_UNIT_MEMORY | GUC_UNIT_TIME)
+
 #define GUC_NOT_WHILE_SEC_REST	0x8000	/* can't set if security restricted */
 #define GUC_DISALLOW_IN_AUTO_FILE	0x00010000	/* can't set in
 												 * PG_AUTOCONF_FILENAME */
-- 
2.1.4

0002-Renumber-GUC_-constants.patchapplication/x-patch; name=0002-Renumber-GUC_-constants.patchDownload

From 7b88a19c26b1333cd80c5ba52145aead4ee08859 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 13 Feb 2015 16:13:53 +0200
Subject: [PATCH 2/4] Renumber GUC_* constants.

This moves all the regular flags back together (for aesthetic reasons), and
makes room for more GUC_UNIT_* types.
---
 src/include/utils/guc.h | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 9a9a7a0..22d3a6f 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -201,22 +201,21 @@ typedef enum
 #define GUC_CUSTOM_PLACEHOLDER	0x0080	/* placeholder for custom variable */
 #define GUC_SUPERUSER_ONLY		0x0100	/* show only to superusers */
 #define GUC_IS_NAME				0x0200	/* limit string to NAMEDATALEN-1 */
+#define GUC_NOT_WHILE_SEC_REST	0x0400	/* can't set if security restricted */
+#define GUC_DISALLOW_IN_AUTO_FILE 0x0800 /* can't set in PG_AUTOCONF_FILENAME */
 
-#define GUC_UNIT_KB				0x0400	/* value is in kilobytes */
-#define GUC_UNIT_BLOCKS			0x0800	/* value is in blocks */
-#define GUC_UNIT_XBLOCKS		0x0C00	/* value is in xlog blocks */
-#define GUC_UNIT_MEMORY			0x0C00	/* mask for KB, BLOCKS, XBLOCKS */
+#define GUC_UNIT_KB				0x1000	/* value is in kilobytes */
+#define GUC_UNIT_BLOCKS			0x2000	/* value is in blocks */
+#define GUC_UNIT_XBLOCKS		0x3000	/* value is in xlog blocks */
+#define GUC_UNIT_MEMORY			0xF000	/* mask for KB, BLOCKS, XBLOCKS */
 
-#define GUC_UNIT_MS				0x1000	/* value is in milliseconds */
-#define GUC_UNIT_S				0x2000	/* value is in seconds */
-#define GUC_UNIT_MIN			0x4000	/* value is in minutes */
-#define GUC_UNIT_TIME			0x7000	/* mask for MS, S, MIN */
+#define GUC_UNIT_MS			   0x10000	/* value is in milliseconds */
+#define GUC_UNIT_S			   0x20000	/* value is in seconds */
+#define GUC_UNIT_MIN		   0x30000	/* value is in minutes */
+#define GUC_UNIT_TIME		   0xF0000	/* mask for MS, S, MIN */
 
 #define GUC_UNIT				(GUC_UNIT_MEMORY | GUC_UNIT_TIME)
 
-#define GUC_NOT_WHILE_SEC_REST	0x8000	/* can't set if security restricted */
-#define GUC_DISALLOW_IN_AUTO_FILE	0x00010000	/* can't set in
-												 * PG_AUTOCONF_FILENAME */
 
 /* GUC vars that are actually declared in guc.c, rather than elsewhere */
 extern bool log_duration;
-- 
2.1.4

0003-Add-support-for-using-WAL-segments-as-GUC-base-unit.patchapplication/x-patch; name=0003-Add-support-for-using-WAL-segments-as-GUC-base-unit.patchDownload

From 30013ed3068613079efc5fd0203941b4b7628802 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 13 Feb 2015 16:36:34 +0200
Subject: [PATCH 3/4] Add support for using WAL segments as GUC base unit

---
 src/backend/utils/misc/guc.c | 8 ++++++++
 src/include/utils/guc.h      | 1 +
 2 files changed, 9 insertions(+)

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 59e25af..20bfbcc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -685,6 +685,9 @@ typedef struct
 #if XLOG_BLCKSZ < 1024 || XLOG_BLCKSZ > (1024*1024)
 #error XLOG_BLCKSZ must be between 1KB and 1MB
 #endif
+#if XLOG_SEG_SIZE < (1024*1024) || XLOG_BLCKSZ > (1024*1024*1024)
+#error XLOG_SEG_SIZE must be between 1MB and 1GB
+#endif
 
 static const char *memory_units_hint =
 	gettext_noop("Valid units for this parameter are \"kB\", \"MB\", \"GB\", and \"TB\".");
@@ -706,6 +709,11 @@ static const unit_conversion memory_unit_conversion_table[] =
 	{ "MB",		GUC_UNIT_XBLOCKS,	1024 / (XLOG_BLCKSZ / 1024) },
 	{ "kB",		GUC_UNIT_XBLOCKS,	-(XLOG_BLCKSZ / 1024) },
 
+	{ "TB",		GUC_UNIT_XSEGS,		(1024*1024*1024) / (XLOG_SEG_SIZE / 1024) },
+	{ "GB",		GUC_UNIT_XSEGS,		(1024*1024) / (XLOG_SEG_SIZE / 1024) },
+	{ "MB",		GUC_UNIT_XSEGS,		-(XLOG_SEG_SIZE / (1024 * 1024)) },
+	{ "kB",		GUC_UNIT_XSEGS,		-(XLOG_SEG_SIZE / 1024) },
+
 	{ "" }		/* end of table marker */
 };
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 22d3a6f..d3100d1 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -207,6 +207,7 @@ typedef enum
 #define GUC_UNIT_KB				0x1000	/* value is in kilobytes */
 #define GUC_UNIT_BLOCKS			0x2000	/* value is in blocks */
 #define GUC_UNIT_XBLOCKS		0x3000	/* value is in xlog blocks */
+#define GUC_UNIT_XSEGS			0x4000	/* value is in xlog segments */
 #define GUC_UNIT_MEMORY			0xF000	/* mask for KB, BLOCKS, XBLOCKS */
 
 #define GUC_UNIT_MS			   0x10000	/* value is in milliseconds */
-- 
2.1.4

0004-Replace-checkpoint_segments-with-min_wal_size-and-ma.patchapplication/x-patch; name=0004-Replace-checkpoint_segments-with-min_wal_size-and-ma.patchDownload

From 5af7d9720bb498c5051ca3e78f792144a93a9d9f Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 13 Feb 2015 18:59:25 +0200
Subject: [PATCH 4/4] Replace checkpoint_segments with min_wal_size and
 max_wal_size.

Instead of having a single knob (checkpoint_segments) that both triggers
checkpoints, and determines how many checkpoints to recycle, they are now
separate concerns. There is still an internal variable called
CheckpointSegments, which triggers checkpoints. But it no longer determines
how many segments to recycle at a checkpoint. That is now auto-tuned by
keeping a moving average of the distance between checkpoints (in bytes),
and try to keep that many segments in reserve. The advantage of this is
that you can set max_wal_size very high, but the system won't actually
consume that much space if there isn't any need for it. The min_wal_size
sets a floor for that; you can effectively disable the auto-tuning behavior
by setting min_wal_size equal to max_wal_size.

The max_wal_size setting is now the actual target size of WAL at which a
new checkpoint is triggered, instead of the distance between checkpoints.
Previously, you could calculate the actual WAL usage with the formula
"(2 + checkpoint_completion_target) * checkpoint_segments + 1". With this
patch, you set the desired WAL usage with max_wal_size, and the system
calculates the appropriate CheckpointSegments with the reverse of that
formula. That's a lot more intuitive for administrators to set.

Reviewed by Amit Kapila and Venkata Balaji N.
---
 doc/src/sgml/config.sgml                      |  40 +++-
 doc/src/sgml/perform.sgml                     |  16 +-
 doc/src/sgml/wal.sgml                         |  69 ++++---
 src/backend/access/transam/xlog.c             | 262 ++++++++++++++++++++------
 src/backend/postmaster/checkpointer.c         |   6 +-
 src/backend/utils/misc/guc.c                  |  22 ++-
 src/backend/utils/misc/postgresql.conf.sample |   3 +-
 src/include/access/xlog.h                     |   8 +-
 8 files changed, 318 insertions(+), 108 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6bcb106..ac50105 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1325,7 +1325,7 @@ include_dir 'conf.d'
         40% of RAM to <varname>shared_buffers</varname> will work better than a
         smaller amount.  Larger settings for <varname>shared_buffers</varname>
         usually require a corresponding increase in
-        <varname>checkpoint_segments</varname>, in order to spread out the
+        <varname>max_wal_size</varname>, in order to spread out the
         process of writing large quantities of new or changed data over a
         longer period of time.
        </para>
@@ -2394,18 +2394,20 @@ include_dir 'conf.d'
      <title>Checkpoints</title>
 
     <variablelist>
-     <varlistentry id="guc-checkpoint-segments" xreflabel="checkpoint_segments">
-      <term><varname>checkpoint_segments</varname> (<type>integer</type>)
+     <varlistentry id="guc-max-wal-size" xreflabel="max_wal_size">
+      <term><varname>max_wal_size</varname> (<type>integer</type>)</term>
       <indexterm>
-       <primary><varname>checkpoint_segments</> configuration parameter</primary>
+       <primary><varname>max_wal_size</> configuration parameter</primary>
       </indexterm>
-      </term>
       <listitem>
        <para>
-        Maximum number of log file segments between automatic WAL
-        checkpoints (each segment is normally 16 megabytes). The default
-        is three segments.  Increasing this parameter can increase the
-        amount of time needed for crash recovery.
+        Maximum size to let the WAL grow to between automatic WAL
+        checkpoints. This is a soft limit; WAL size can exceed
+        <varname>max_wal_size</> under special circumstances, like
+        under heavy load, a failing <varname>archive_command</>, or a high
+        <varname>wal_keep_segments</> setting. The default is 128 MB.
+        Increasing this parameter can increase the amount of time needed for
+        crash recovery.
         This parameter can only be set in the <filename>postgresql.conf</>
         file or on the server command line.
        </para>
@@ -2458,7 +2460,7 @@ include_dir 'conf.d'
         Write a message to the server log if checkpoints caused by
         the filling of checkpoint segment files happen closer together
         than this many seconds (which suggests that
-        <varname>checkpoint_segments</> ought to be raised).  The default is
+        <varname>max_wal_size</> ought to be raised).  The default is
         30 seconds (<literal>30s</>).  Zero disables the warning.
         No warnings will be generated if <varname>checkpoint_timeout</varname>
         is less than <varname>checkpoint_warning</varname>.
@@ -2468,6 +2470,24 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-min-wal-size" xreflabel="min_wal_size">
+      <term><varname>min_wal_size</varname> (<type>integer</type>)</term>
+      <indexterm>
+       <primary><varname>min_wal_size</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        As long as WAL disk usage stays below this setting, old WAL files are
+        always recycled for future use at a checkpoint, rather than removed.
+        This can be used to ensure that enough WAL space is reserved to
+        handle spikes in WAL usage, for example when running large batch
+        jobs. The default is 80 MB.
+        This parameter can only be set in the <filename>postgresql.conf</>
+        file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 5a087fb..c73580e 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1328,19 +1328,19 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
    </para>
   </sect2>
 
-  <sect2 id="populate-checkpoint-segments">
-   <title>Increase <varname>checkpoint_segments</varname></title>
+  <sect2 id="populate-max-wal-size">
+   <title>Increase <varname>max_wal_size</varname></title>
 
    <para>
-    Temporarily increasing the <xref
-    linkend="guc-checkpoint-segments"> configuration variable can also
+    Temporarily increasing the <xref linkend="guc-max-wal-size">
+    configuration variable can also
     make large data loads faster.  This is because loading a large
     amount of data into <productname>PostgreSQL</productname> will
     cause checkpoints to occur more often than the normal checkpoint
     frequency (specified by the <varname>checkpoint_timeout</varname>
     configuration variable). Whenever a checkpoint occurs, all dirty
     pages must be flushed to disk. By increasing
-    <varname>checkpoint_segments</varname> temporarily during bulk
+    <varname>max_wal_size</varname> temporarily during bulk
     data loads, the number of checkpoints that are required can be
     reduced.
    </para>
@@ -1445,7 +1445,7 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
       <para>
        Set appropriate (i.e., larger than normal) values for
        <varname>maintenance_work_mem</varname> and
-       <varname>checkpoint_segments</varname>.
+       <varname>max_wal_size</varname>.
       </para>
      </listitem>
      <listitem>
@@ -1512,7 +1512,7 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
 
     So when loading a data-only dump, it is up to you to drop and recreate
     indexes and foreign keys if you wish to use those techniques.
-    It's still useful to increase <varname>checkpoint_segments</varname>
+    It's still useful to increase <varname>max_wal_size</varname>
     while loading the data, but don't bother increasing
     <varname>maintenance_work_mem</varname>; rather, you'd do that while
     manually recreating indexes and foreign keys afterwards.
@@ -1577,7 +1577,7 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
 
      <listitem>
       <para>
-       Increase <xref linkend="guc-checkpoint-segments"> and <xref
+       Increase <xref linkend="guc-max-wal-size"> and <xref
        linkend="guc-checkpoint-timeout"> ; this reduces the frequency
        of checkpoints, but increases the storage requirements of
        <filename>/pg_xlog</>.
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 1254c03..b57749f 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -472,9 +472,10 @@
   <para>
    The server's checkpointer process automatically performs
    a checkpoint every so often.  A checkpoint is begun every <xref
-   linkend="guc-checkpoint-segments"> log segments, or every <xref
-   linkend="guc-checkpoint-timeout"> seconds, whichever comes first.
-   The default settings are 3 segments and 300 seconds (5 minutes), respectively.
+   linkend="guc-checkpoint-timeout"> seconds, or if
+   <xref linkend="guc-max-wal-size"> is about to be exceeded,
+   whichever comes first.
+   The default settings are 5 minutes and 128 MB, respectively.
    If no WAL has been written since the previous checkpoint, new checkpoints
    will be skipped even if <varname>checkpoint_timeout</> has passed.
    (If WAL archiving is being used and you want to put a lower limit on how
@@ -486,8 +487,8 @@
   </para>
 
   <para>
-   Reducing <varname>checkpoint_segments</varname> and/or
-   <varname>checkpoint_timeout</varname> causes checkpoints to occur
+   Reducing <varname>checkpoint_timeout</varname> and/or
+   <varname>max_wal_size</varname> causes checkpoints to occur
    more often. This allows faster after-crash recovery, since less work
    will need to be redone. However, one must balance this against the
    increased cost of flushing dirty data pages more often. If
@@ -510,11 +511,11 @@
    parameter.  If checkpoints happen closer together than
    <varname>checkpoint_warning</> seconds,
    a message will be output to the server log recommending increasing
-   <varname>checkpoint_segments</varname>.  Occasional appearance of such
+   <varname>max_wal_size</varname>.  Occasional appearance of such
    a message is not cause for alarm, but if it appears often then the
    checkpoint control parameters should be increased. Bulk operations such
    as large <command>COPY</> transfers might cause a number of such warnings
-   to appear if you have not set <varname>checkpoint_segments</> high
+   to appear if you have not set <varname>max_wal_size</> high
    enough.
   </para>
 
@@ -525,10 +526,10 @@
    <xref linkend="guc-checkpoint-completion-target">, which is
    given as a fraction of the checkpoint interval.
    The I/O rate is adjusted so that the checkpoint finishes when the
-   given fraction of <varname>checkpoint_segments</varname> WAL segments
-   have been consumed since checkpoint start, or the given fraction of
-   <varname>checkpoint_timeout</varname> seconds have elapsed,
-   whichever is sooner.  With the default value of 0.5,
+   given fraction of
+   <varname>checkpoint_timeout</varname> seconds have elapsed, or before
+   <varname>max_wal_size</varname> is exceeded, whichever is sooner.
+   With the default value of 0.5,
    <productname>PostgreSQL</> can be expected to complete each checkpoint
    in about half the time before the next checkpoint starts.  On a system
    that's very close to maximum I/O throughput during normal operation,
@@ -545,18 +546,35 @@
   </para>
 
   <para>
-   There will always be at least one WAL segment file, and will normally
-   not be more than (2 + <varname>checkpoint_completion_target</varname>) * <varname>checkpoint_segments</varname> + 1
-   or <varname>checkpoint_segments</> + <xref linkend="guc-wal-keep-segments"> + 1
-   files.  Each segment file is normally 16 MB (though this size can be
-   altered when building the server).  You can use this to estimate space
-   requirements for <acronym>WAL</acronym>.
-   Ordinarily, when old log segment files are no longer needed, they
-   are recycled (that is, renamed to become future segments in the numbered
-   sequence). If, due to a short-term peak of log output rate, there
-   are more than 3 * <varname>checkpoint_segments</varname> + 1
-   segment files, the unneeded segment files will be deleted instead
-   of recycled until the system gets back under this limit.
+   The number of WAL segment files in <filename>pg_xlog</> directory depends on
+   <varname>min_wal_size</>, <varname>max_wal_size</> and
+   the amount of WAL generated in previous checkpoint cycles. When old log
+   segment files are no longer needed, they are removed or recycled (that is,
+   renamed to become future segments in the numbered sequence). If, due to a
+   short-term peak of log output rate, <varname>max_wal_size</> is
+   exceeded, the unneeded segment files will be removed until the system
+   gets back under this limit. Below that limit, the system recycles enough
+   WAL files to cover the estimated need until the next checkpoint, and
+   removes the rest. The estimate is based on a moving average of the number
+   of WAL files used in previous checkpoint cycles. The moving average
+   is increased immediately if the actual usage exceeds the estimate, so it
+   accommodates peak usage rather average usage to some extent.
+   <varname>min_wal_size</> puts a minimum on the amount of WAL files
+   recycled for future usage; that much WAL is always recycled for future use,
+   even if the system is idle and the WAL usage estimate suggests that little
+   WAL is needed.
+  </para>
+
+  <para>
+   Independently of <varname>max_wal_size</varname>,
+   <xref linkend="guc-wal-keep-segments"> + 1 most recent WAL files are
+   kept at all times. Also, if WAL archiving is used, old segments can not be
+   removed or recycled until they are archived. If WAL archiving cannot keep up
+   with the pace that WAL is generated, or if <varname>archive_command</varname>
+   fails repeatedly, old WAL files will accumulate in <filename>pg_xlog</>
+   until the situation is resolved. A slow or failed standby server that
+   uses a replication slot will have the same effect (see
+   <xref linkend="streaming-replication-slots">).
   </para>
 
   <para>
@@ -571,9 +589,8 @@
    master because restartpoints can only be performed at checkpoint records.
    A restartpoint is triggered when a checkpoint record is reached if at
    least <varname>checkpoint_timeout</> seconds have passed since the last
-   restartpoint. In standby mode, a restartpoint is also triggered if at
-   least <varname>checkpoint_segments</> log segments have been replayed
-   since the last restartpoint.
+   restartpoint, or if WAL size is about to exceed
+   <varname>max_wal_size</>.
   </para>
 
   <para>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 629a457..f0741bd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -79,7 +79,8 @@ extern uint32 bootstrap_data_checksum_version;
 
 
 /* User-settable parameters */
-int			CheckPointSegments = 3;
+int			max_wal_size = 8;		/* 128 MB */
+int			min_wal_size = 5;		/* 80 MB */
 int			wal_keep_segments = 0;
 int			XLOGbuffers = -1;
 int			XLogArchiveTimeout = 0;
@@ -106,18 +107,14 @@ bool		XLOG_DEBUG = false;
 #define NUM_XLOGINSERT_LOCKS  8
 
 /*
- * XLOGfileslop is the maximum number of preallocated future XLOG segments.
- * When we are done with an old XLOG segment file, we will recycle it as a
- * future XLOG segment as long as there aren't already XLOGfileslop future
- * segments; else we'll delete it.  This could be made a separate GUC
- * variable, but at present I think it's sufficient to hardwire it as
- * 2*CheckPointSegments+1.  Under normal conditions, a checkpoint will free
- * no more than 2*CheckPointSegments log segments, and we want to recycle all
- * of them; the +1 allows boundary cases to happen without wasting a
- * delete/create-segment cycle.
+ * Max distance from last checkpoint, before triggering a new xlog-based
+ * checkpoint.
  */
-#define XLOGfileslop	(2*CheckPointSegments + 1)
+int			CheckPointSegments;
 
+/* Estimated distance between checkpoints, in bytes */
+static double CheckPointDistanceEstimate = 0;
+static double PrevCheckPointDistance = 0;
 
 /*
  * GUC support
@@ -778,7 +775,7 @@ static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
-					   bool find_free, int *max_advance,
+					   bool find_free, XLogSegNo max_segno,
 					   bool use_lock);
 static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			 int source, bool notexistOk);
@@ -791,7 +788,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
-static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr);
+static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr);
 static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
 static void CleanupBackupHistory(void);
@@ -1958,6 +1955,104 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 }
 
 /*
+ * Calculate CheckPointSegments based on max_wal_size and
+ * checkpoint_completion_target.
+ */
+static void
+CalculateCheckpointSegments(void)
+{
+	double		target;
+
+	/*-------
+	 * Calculate the distance at which to trigger a checkpoint, to avoid
+	 * exceeding max_wal_size. This is based on two assumptions:
+	 *
+	 * a) we keep WAL for two checkpoint cycles, back to the "prev" checkpoint.
+	 * b) during checkpoint, we consume checkpoint_completion_target *
+	 *    number of segments consumed between checkpoints.
+	 *-------
+	 */
+	target = (double ) max_wal_size / (2.0 + CheckPointCompletionTarget);
+
+	/* round down */
+	CheckPointSegments = (int) target;
+
+	if (CheckPointSegments < 1)
+		CheckPointSegments = 1;
+}
+
+void
+assign_max_wal_size(int newval, void *extra)
+{
+	max_wal_size = newval;
+	CalculateCheckpointSegments();
+}
+
+void
+assign_checkpoint_completion_target(double newval, void *extra)
+{
+	CheckPointCompletionTarget = newval;
+	CalculateCheckpointSegments();
+}
+
+/*
+ * At a checkpoint, how many WAL segments to recycle as preallocated future
+ * XLOG segments? Returns the highest segment that should be preallocated.
+ */
+static XLogSegNo
+XLOGfileslop(XLogRecPtr PriorRedoPtr)
+{
+	XLogSegNo	minSegNo;
+	XLogSegNo	maxSegNo;
+	double		distance;
+	XLogSegNo	recycleSegNo;
+
+	/*
+	 * Calculate the segment numbers that min_wal_size and max_wal_size
+	 * correspond to. Always recycle enough segments to meet the minimum, and
+	 * remove enough segments to stay below the maximum.
+	 */
+	minSegNo = PriorRedoPtr / XLOG_SEG_SIZE + min_wal_size - 1;
+	maxSegNo =  PriorRedoPtr / XLOG_SEG_SIZE + max_wal_size - 1;
+
+	/*
+	 * Between those limits, recycle enough segments to get us through to the
+	 * estimated end of next checkpoint.
+	 *
+	 * To estimate where the next checkpoint will finish, assume that the
+	 * system runs steadily consuming CheckPointDistanceEstimate
+	 * bytes between every checkpoint.
+	 *
+	 * The reason this calculation is done from the prior checkpoint, not the
+	 * one that just finished, is that this behaves better if some checkpoint
+	 * cycles are abnormally short, like if you perform a manual checkpoint
+	 * right after a timed one. The manual checkpoint will make almost a full
+	 * cycle's worth of WAL segments available for recycling, because the
+	 * segments from the prior's prior, fully-sized checkpoint cycle are no
+	 * longer needed. However, the next checkpoint will make only few segments
+	 * available for recycling, the ones generated between the timed
+	 * checkpoint and the manual one right after that. If at the manual
+	 * checkpoint we only retained enough segments to get us to the next timed
+	 * one, and removed the rest, then at the next checkpoint we would not
+	 * have enough segments around for recycling, to get us to the checkpoint
+	 * after that. Basing the calculations on the distance from the prior redo
+	 * pointer largely fixes that problem.
+	 */
+	distance = (2.0 + CheckPointCompletionTarget) * CheckPointDistanceEstimate;
+	/* add 10% for good measure. */
+	distance *= 1.10;
+
+	recycleSegNo = (XLogSegNo) ceil(((double) PriorRedoPtr + distance) / XLOG_SEG_SIZE);
+
+	if (recycleSegNo < minSegNo)
+		recycleSegNo = minSegNo;
+	if (recycleSegNo > maxSegNo)
+		recycleSegNo = maxSegNo;
+
+	return recycleSegNo;
+}
+
+/*
  * Check whether we've consumed enough xlog space that a checkpoint is needed.
  *
  * new_segno indicates a log file that has just been filled up (or read
@@ -2764,7 +2859,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	char		zbuffer_raw[XLOG_BLCKSZ + MAXIMUM_ALIGNOF];
 	char	   *zbuffer;
 	XLogSegNo	installed_segno;
-	int			max_advance;
+	XLogSegNo	max_segno;
 	int			fd;
 	int			nbytes;
 
@@ -2867,9 +2962,19 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	 * pre-create a future log segment.
 	 */
 	installed_segno = logsegno;
-	max_advance = XLOGfileslop;
+
+	/*
+	 * XXX: What should we use as max_segno? We used to use XLOGfileslop when
+	 * that was a constant, but that was always a bit dubious: normally, at a
+	 * checkpoint, XLOGfileslop was the offset from the checkpoint record,
+	 * but here, it was the offset from the insert location. We can't do the
+	 * normal XLOGfileslop calculation here because we don't have access to
+	 * the prior checkpoint's redo location. So somewhat arbitrarily, just
+	 * use CheckPointSegments.
+	 */
+	max_segno = logsegno + CheckPointSegments;
 	if (!InstallXLogFileSegment(&installed_segno, tmppath,
-								*use_existent, &max_advance,
+								*use_existent, max_segno,
 								use_lock))
 	{
 		/*
@@ -3010,7 +3115,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	/*
 	 * Now move the segment into place with its final name.
 	 */
-	if (!InstallXLogFileSegment(&destsegno, tmppath, false, NULL, false))
+	if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false))
 		elog(ERROR, "InstallXLogFileSegment should not have failed");
 }
 
@@ -3030,22 +3135,21 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
  * number at or after the passed numbers.  If FALSE, install the new segment
  * exactly where specified, deleting any existing segment file there.
  *
- * *max_advance: maximum number of segno slots to advance past the starting
- * point.  Fail if no free slot is found in this range.  On return, reduced
- * by the number of slots skipped over.  (Irrelevant, and may be NULL,
- * when find_free is FALSE.)
+ * max_segno: maximum segment number to install the new file as.  Fail if no
+ * free slot is found between *segno and max_segno. (Ignored when find_free
+ * is FALSE.)
  *
  * use_lock: if TRUE, acquire ControlFileLock while moving file into
  * place.  This should be TRUE except during bootstrap log creation.  The
  * caller must *not* hold the lock at call.
  *
  * Returns TRUE if the file was installed successfully.  FALSE indicates that
- * max_advance limit was exceeded, or an error occurred while renaming the
+ * max_segno limit was exceeded, or an error occurred while renaming the
  * file into place.
  */
 static bool
 InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
-					   bool find_free, int *max_advance,
+					   bool find_free, XLogSegNo max_segno,
 					   bool use_lock)
 {
 	char		path[MAXPGPATH];
@@ -3069,7 +3173,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 		/* Find a free slot to put it in */
 		while (stat(path, &stat_buf) == 0)
 		{
-			if (*max_advance <= 0)
+			if ((*segno) >= max_segno)
 			{
 				/* Failed to find a free slot within specified range */
 				if (use_lock)
@@ -3077,7 +3181,6 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 				return false;
 			}
 			(*segno)++;
-			(*max_advance)--;
 			XLogFilePath(path, ThisTimeLineID, *segno);
 		}
 	}
@@ -3425,14 +3528,15 @@ UpdateLastRemovedPtr(char *filename)
 /*
  * Recycle or remove all log files older or equal to passed segno
  *
- * endptr is current (or recent) end of xlog; this is used to determine
+ * endptr is current (or recent) end of xlog, and PriorRedoRecPtr is the
+ * redo pointer of the previous checkpoint. These are used to determine
  * whether we want to recycle rather than delete no-longer-wanted log files.
  */
 static void
-RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr)
+RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 {
 	XLogSegNo	endlogSegNo;
-	int			max_advance;
+	XLogSegNo	recycleSegNo;
 	DIR		   *xldir;
 	struct dirent *xlde;
 	char		lastoff[MAXFNAMELEN];
@@ -3444,11 +3548,10 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr)
 	struct stat statbuf;
 
 	/*
-	 * Initialize info about where to try to recycle to.  We allow recycling
-	 * segments up to XLOGfileslop segments beyond the current XLOG location.
+	 * Initialize info about where to try to recycle to.
 	 */
 	XLByteToPrevSeg(endptr, endlogSegNo);
-	max_advance = XLOGfileslop;
+	recycleSegNo = XLOGfileslop(PriorRedoPtr);
 
 	xldir = AllocateDir(XLOGDIR);
 	if (xldir == NULL)
@@ -3497,20 +3600,17 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr)
 				 * for example can create symbolic links pointing to a
 				 * separate archive directory.
 				 */
-				if (lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
+				if (endlogSegNo <= recycleSegNo &&
+					lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 					InstallXLogFileSegment(&endlogSegNo, path,
-										   true, &max_advance, true))
+										   true, recycleSegNo, true))
 				{
 					ereport(DEBUG2,
 							(errmsg("recycled transaction log file \"%s\"",
 									xlde->d_name)));
 					CheckpointStats.ckpt_segs_recycled++;
 					/* Needn't recheck that slot on future iterations */
-					if (max_advance > 0)
-					{
-						endlogSegNo++;
-						max_advance--;
-					}
+					endlogSegNo++;
 				}
 				else
 				{
@@ -7593,7 +7693,8 @@ LogCheckpointEnd(bool restartpoint)
 	elog(LOG, "%s complete: wrote %d buffers (%.1f%%); "
 		 "%d transaction log file(s) added, %d removed, %d recycled; "
 		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
-		 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s",
+		 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+		 "distance=%d kB, estimate=%d kB",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 CheckpointStats.ckpt_bufs_written,
 		 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
@@ -7605,7 +7706,48 @@ LogCheckpointEnd(bool restartpoint)
 		 total_secs, total_usecs / 1000,
 		 CheckpointStats.ckpt_sync_rels,
 		 longest_secs, longest_usecs / 1000,
-		 average_secs, average_usecs / 1000);
+		 average_secs, average_usecs / 1000,
+		 (int) (PrevCheckPointDistance / 1024.0),
+		 (int) (CheckPointDistanceEstimate / 1024.0));
+}
+
+/*
+ * Update the estimate of distance between checkpoints.
+ *
+ * The estimate is used to calculate the number of WAL segments to keep
+ * preallocated, see XLOGFileSlop().
+ */
+static void
+UpdateCheckPointDistanceEstimate(uint64 nbytes)
+{
+	/*
+	 * To estimate the number of segments consumed between checkpoints, keep
+	 * a moving average of the amount of WAL generated in previous checkpoint
+	 * cycles. However, if the load is bursty, with quiet periods and busy
+	 * periods, we want to cater for the peak load. So instead of a plain
+	 * moving average, let the average decline slowly if the previous cycle
+	 * used less WAL than estimated, but bump it up immediately if it used
+	 * more.
+	 *
+	 * When checkpoints are triggered by max_wal_size, this should converge to
+	 * CheckpointSegments * XLOG_SEG_SIZE,
+	 *
+	 * Note: This doesn't pay any attention to what caused the checkpoint.
+	 * Checkpoints triggered manually with CHECKPOINT command, or by e.g.
+	 * starting a base backup, are counted the same as those created
+	 * automatically. The slow-decline will largely mask them out, if they are
+	 * not frequent. If they are frequent, it seems reasonable to count them
+	 * in as any others; if you issue a manual checkpoint every 5 minutes and
+	 * never let a timed checkpoint happen, it makes sense to base the
+	 * preallocation on that 5 minute interval rather than whatever
+	 * checkpoint_timeout is set to.
+	 */
+	PrevCheckPointDistance = nbytes;
+	if (CheckPointDistanceEstimate < nbytes)
+		CheckPointDistanceEstimate = nbytes;
+	else
+		CheckPointDistanceEstimate =
+			(0.90 * CheckPointDistanceEstimate + 0.10 * (double) nbytes);
 }
 
 /*
@@ -7645,7 +7787,7 @@ CreateCheckPoint(int flags)
 	XLogRecPtr	recptr;
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	uint32		freespace;
-	XLogSegNo	_logSegNo;
+	XLogRecPtr	PriorRedoPtr;
 	XLogRecPtr	curInsert;
 	VirtualTransactionId *vxids;
 	int			nvxids;
@@ -7960,10 +8102,10 @@ CreateCheckPoint(int flags)
 				(errmsg("concurrent transaction log activity while database system is shutting down")));
 
 	/*
-	 * Select point at which we can truncate the log, which we base on the
-	 * prior checkpoint's earliest info.
+	 * Remember the prior checkpoint's redo pointer, used later to determine
+	 * the point where the log can be truncated.
 	 */
-	XLByteToSeg(ControlFile->checkPointCopy.redo, _logSegNo);
+	PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
 	/*
 	 * Update the control file.
@@ -8018,11 +8160,17 @@ CreateCheckPoint(int flags)
 	 * Delete old log files (those no longer needed even for previous
 	 * checkpoint or the standbys in XLOG streaming).
 	 */
-	if (_logSegNo)
+	if (PriorRedoPtr != InvalidXLogRecPtr)
 	{
+		XLogSegNo	_logSegNo;
+
+		/* Update the average distance between checkpoints. */
+		UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
+
+		XLByteToSeg(PriorRedoPtr, _logSegNo);
 		KeepLogSeg(recptr, &_logSegNo);
 		_logSegNo--;
-		RemoveOldXlogFiles(_logSegNo, recptr);
+		RemoveOldXlogFiles(_logSegNo, PriorRedoPtr, recptr);
 	}
 
 	/*
@@ -8190,7 +8338,7 @@ CreateRestartPoint(int flags)
 {
 	XLogRecPtr	lastCheckPointRecPtr;
 	CheckPoint	lastCheckPoint;
-	XLogSegNo	_logSegNo;
+	XLogRecPtr	PriorRedoPtr;
 	TimestampTz xtime;
 
 	/*
@@ -8255,14 +8403,14 @@ CreateRestartPoint(int flags)
 	/*
 	 * Update the shared RedoRecPtr so that the startup process can calculate
 	 * the number of segments replayed since last restartpoint, and request a
-	 * restartpoint if it exceeds checkpoint_segments.
+	 * restartpoint if it exceeds CheckPointSegments.
 	 *
 	 * Like in CreateCheckPoint(), hold off insertions to update it, although
 	 * during recovery this is just pro forma, because no WAL insertions are
 	 * happening.
 	 */
 	WALInsertLockAcquireExclusive();
-	XLogCtl->Insert.RedoRecPtr = lastCheckPoint.redo;
+	RedoRecPtr = XLogCtl->Insert.RedoRecPtr = lastCheckPoint.redo;
 	WALInsertLockRelease();
 
 	/* Also update the info_lck-protected copy */
@@ -8286,10 +8434,10 @@ CreateRestartPoint(int flags)
 	CheckPointGuts(lastCheckPoint.redo, flags);
 
 	/*
-	 * Select point at which we can truncate the xlog, which we base on the
-	 * prior checkpoint's earliest info.
+	 * Remember the prior checkpoint's redo pointer, used later to determine
+	 * the point at which we can truncate the log.
 	 */
-	XLByteToSeg(ControlFile->checkPointCopy.redo, _logSegNo);
+	PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
 	/*
 	 * Update pg_control, using current time.  Check that it still shows
@@ -8316,12 +8464,18 @@ CreateRestartPoint(int flags)
 	 * checkpoint/restartpoint) to prevent the disk holding the xlog from
 	 * growing full.
 	 */
-	if (_logSegNo)
+	if (PriorRedoPtr != InvalidXLogRecPtr)
 	{
 		XLogRecPtr	receivePtr;
 		XLogRecPtr	replayPtr;
 		TimeLineID	replayTLI;
 		XLogRecPtr	endptr;
+		XLogSegNo	_logSegNo;
+
+		/* Update the average distance between checkpoints/restartpoints. */
+		UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
+
+		XLByteToSeg(PriorRedoPtr, _logSegNo);
 
 		/*
 		 * Get the current end of xlog replayed or received, whichever is
@@ -8350,7 +8504,7 @@ CreateRestartPoint(int flags)
 		if (RecoveryInProgress())
 			ThisTimeLineID = replayTLI;
 
-		RemoveOldXlogFiles(_logSegNo, endptr);
+		RemoveOldXlogFiles(_logSegNo, PriorRedoPtr, endptr);
 
 		/*
 		 * Make more log segments if needed.  (Do this after recycling old log
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 237be12..5cd85e7 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -471,7 +471,7 @@ CheckpointerMain(void)
 				"checkpoints are occurring too frequently (%d seconds apart)",
 									   elapsed_secs,
 									   elapsed_secs),
-						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
+						 errhint("Consider increasing the configuration parameter \"max_wal_size\".")));
 
 			/*
 			 * Initialize checkpointer-private variables used during
@@ -749,11 +749,11 @@ IsCheckpointOnSchedule(double progress)
 		return false;
 
 	/*
-	 * Check progress against WAL segments written and checkpoint_segments.
+	 * Check progress against WAL segments written and CheckPointSegments.
 	 *
 	 * We compare the current WAL insert location against the location
 	 * computed before calling CreateCheckPoint. The code in XLogInsert that
-	 * actually triggers a checkpoint when checkpoint_segments is exceeded
+	 * actually triggers a checkpoint when CheckPointSegments is exceeded
 	 * compares against RedoRecptr, so this is not completely accurate.
 	 * However, it's good enough for our purposes, we're only calculating an
 	 * estimate anyway.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 20bfbcc..929c86e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2154,16 +2154,28 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
-		{"checkpoint_segments", PGC_SIGHUP, WAL_CHECKPOINTS,
-			gettext_noop("Sets the maximum distance in log segments between automatic WAL checkpoints."),
-			NULL
+		{"min_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Sets the minimum size to shrink the WAL to."),
+			NULL,
+			GUC_UNIT_XSEGS
 		},
-		&CheckPointSegments,
-		3, 1, INT_MAX,
+		&min_wal_size,
+		5, 2, INT_MAX,
 		NULL, NULL, NULL
 	},
 
 	{
+		{"max_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Sets the WAL size that triggers a checkpoint."),
+			NULL,
+			GUC_UNIT_XSEGS
+		},
+		&max_wal_size,
+		8, 2, INT_MAX,
+		NULL, assign_max_wal_size, NULL
+	},
+
+	{
 		{"checkpoint_timeout", PGC_SIGHUP, WAL_CHECKPOINTS,
 			gettext_noop("Sets the maximum time between automatic WAL checkpoints."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b053659..cb678f9 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -197,8 +197,9 @@
 
 # - Checkpoints -
 
-#checkpoint_segments = 3		# in logfile segments, min 1, 16MB each
 #checkpoint_timeout = 5min		# range 30s-1h
+#max_wal_size = 128MB			# in logfile segments
+#min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_warning = 30s		# 0 disables
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 138deaf..beef652 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -89,7 +89,8 @@ extern XLogRecPtr XactLastRecEnd;
 extern bool reachedConsistency;
 
 /* these variables are GUC parameters related to XLOG */
-extern int	CheckPointSegments;
+extern int	min_wal_size;
+extern int	max_wal_size;
 extern int	wal_keep_segments;
 extern int	XLOGbuffers;
 extern int	XLogArchiveTimeout;
@@ -100,6 +101,8 @@ extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool log_checkpoints;
 
+extern int	CheckPointSegments;
+
 /* WAL levels */
 typedef enum WalLevel
 {
@@ -245,6 +248,9 @@ extern bool CheckPromoteSignal(void);
 extern void WakeupRecovery(void);
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern void assign_max_wal_size(int newval, void *extra);
+extern void assign_checkpoint_completion_target(double newval, void *extra);
+
 /*
  * Starting/stopping a base backup
  */
-- 
2.1.4

#97

Petr Jelinek

petr@2ndquadrant.com

almost 11 years ago

In reply to: Heikki Linnakangas (#96)

Re: Redesigning checkpoint_segments

On 13/02/15 18:43, Heikki Linnakangas wrote:

Ok, I don't hear any loud objections to min_wal_size and max_wal_size,
so let's go with that then.

Attached is a new version of this. It now comes in four patches. The
first three are just GUC-related preliminary work, the first of which I
posted on a separate thread today.

The 0001 patch is very nice, I would go ahead and commit it.

Not really sure I see the need for 0002 but it should not harm anything
so why not.

The 0003 should be part of 0004 IMHO as it does not really do anything
by itself.

I am wondering a bit about interaction with wal_keep_segments.
One thing is that wal_keep_segments is still specified in number of
segments and not size units, maybe it would be worth to change it also?
And the other thing is that, if set, the wal_keep_segments is the real
max_wal_size from the user perspective (not from perspective of the
algorithm in this patch, but user does not really care about that) which
is somewhat weird given the naming.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#98

Venkata Balaji N

nag1010@gmail.com

almost 11 years ago

In reply to: Petr Jelinek (#97)

Re: Redesigning checkpoint_segments

I am wondering a bit about interaction with wal_keep_segments.
One thing is that wal_keep_segments is still specified in number of
segments and not size units, maybe it would be worth to change it also?
And the other thing is that, if set, the wal_keep_segments is the real
max_wal_size from the user perspective (not from perspective of the
algorithm in this patch, but user does not really care about that) which is
somewhat weird given the naming.

In my opinion -

I think wal_keep_segments being number of segments would help a lot. In my
experience, while handling production databases, to arrive at an optimal
value for wal_keep_segments, we go by calculating number of segments
getting generated in wal archive destination (hourly or daily basis), this
would further help us calculate how many segments to keep considering
various other factors in an replication environment to ensure master has
enough WALs in pg_xlog when standby comes back up after the outage.

Ofcourse, if we can calculate number-of-segments, we can calculate the same
in terms of size too. Calculating number of segments would be more
feasible.

Regards,
VBN

#99

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Petr Jelinek (#97)

Re: Redesigning checkpoint_segments

On Sat, Feb 21, 2015 at 11:29 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

I am wondering a bit about interaction with wal_keep_segments.
One thing is that wal_keep_segments is still specified in number of segments
and not size units, maybe it would be worth to change it also?
And the other thing is that, if set, the wal_keep_segments is the real
max_wal_size from the user perspective (not from perspective of the
algorithm in this patch, but user does not really care about that) which is
somewhat weird given the naming.

It seems like wal_keep_segments is more closely related to
wal_*min*_size. The idea of both settings is that each is a minimum
amount of WAL we want to keep around for some purpose. But they're
not quite the same, I guess, because wal_min_size just forces us to
keep that many files around - they can be overwritten whenever.
wal_keep_segments is an amount of actual WAL data we want to keep
around.

Would it make sense to require that wal_keep_segments <= wal_min_size?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#100

Petr Jelinek

petr@2ndquadrant.com

almost 11 years ago

In reply to: Robert Haas (#99)

Re: Redesigning checkpoint_segments

On 23/02/15 03:24, Robert Haas wrote:

On Sat, Feb 21, 2015 at 11:29 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

I am wondering a bit about interaction with wal_keep_segments.
One thing is that wal_keep_segments is still specified in number of segments
and not size units, maybe it would be worth to change it also?
And the other thing is that, if set, the wal_keep_segments is the real
max_wal_size from the user perspective (not from perspective of the
algorithm in this patch, but user does not really care about that) which is
somewhat weird given the naming.

It seems like wal_keep_segments is more closely related to
wal_*min*_size. The idea of both settings is that each is a minimum
amount of WAL we want to keep around for some purpose. But they're
not quite the same, I guess, because wal_min_size just forces us to
keep that many files around - they can be overwritten whenever.
wal_keep_segments is an amount of actual WAL data we want to keep
around.

Err yes of course, min not max :)

Would it make sense to require that wal_keep_segments <= wal_min_size?

It would to me, the patch as it stands is confusing in a sense that you
can set min and max but then wal_keep_segments somewhat overrides those.

And BTW this brings another point, I actually don't see check for
min_wal_size <= max_wal_size anywhere in the patch.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101

Venkata Balaji N

nag1010@gmail.com

almost 11 years ago

In reply to: Heikki Linnakangas (#96)

Re: Redesigning checkpoint_segments

On Sat, Feb 14, 2015 at 4:43 AM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

On 02/04/2015 11:41 PM, Josh Berkus wrote:

On 02/04/2015 12:06 PM, Robert Haas wrote:

On Wed, Feb 4, 2015 at 1:05 PM, Josh Berkus <josh@agliodbs.com> wrote:

Let me push "max_wal_size" and "min_wal_size" again as our new parameter
names, because:

* does what it says on the tin
* new user friendly
* encourages people to express it in MB, not segments
* very different from the old name, so people will know it works
differently

That's not bad. If we added a hard WAL limit in a future release, how
would that fit into this naming scheme?

Well, first, nobody's at present proposing a patch to add a hard limit,
so I'm reluctant to choose non-obvious names to avoid conflict with a
feature nobody may ever write. There's a number of reasons a hard limit
would be difficult and/or undesirable.

If we did add one, I'd suggest calling it "wal_size_limit" or something
similar. However, we're most likely to only implement the limit for
archives, which means that it might acually be called
"archive_buffer_limit" or something more to the point.

Ok, I don't hear any loud objections to min_wal_size and max_wal_size, so
let's go with that then.

Attached is a new version of this. It now comes in four patches. The first
three are just GUC-related preliminary work, the first of which I posted on
a separate thread today.

I applied all the 4 patches to the latest master successfully and performed
a test with heavy continuous load. I see no much difference in the
checkpoint behaviour and all seems to be working as expected.

I did a test with following parameter values -

max_wal_size = 10000MB
min_wal_size = 1000MB
checkpoint_timeout = 5min

Upon performing a heavy load operation, the checkpoints were occurring
based on timeouts.

pg_xlog size fluctuated a bit (not very much). Initially few mins pg_xlog
size stayed at 3.3G and gradually increased to 5.5G max during the
operation. There was a continuous fluctuation on number of segments being
removed+recycled.

A part of the checkpoint logs are as follows -

2015-02-23 15:16:00.318 GMT-10 LOG: checkpoint starting: time
2015-02-23 15:16:53.943 GMT-10 LOG: checkpoint complete: wrote 3010
buffers (18.4%); 0 transaction log file(s) added, 0 removed, 159 recycled;
write=27.171 s, sync=25.945 s, total=53.625 s; sync files=20, longest=5.376
s, average=1.297 s; distance=2748844 kB, estimate=2748844 kB
2015-02-23 15:21:00.438 GMT-10 LOG: checkpoint starting: time
2015-02-23 15:22:01.352 GMT-10 LOG: checkpoint complete: wrote 2812
buffers (17.2%); 0 transaction log file(s) added, 0 removed, 168 recycled;
write=25.351 s, sync=35.346 s, total=60.914 s; sync files=34, longest=9.025
s, average=1.039 s; distance=1983318 kB, estimate=2672291 kB
2015-02-23 15:26:00.314 GMT-10 LOG: checkpoint starting: time
2015-02-23 15:26:25.612 GMT-10 LOG: checkpoint complete: wrote 2510
buffers (15.3%); 0 transaction log file(s) added, 0 removed, 121 recycled;
write=22.623 s, sync=2.477 s, total=25.297 s; sync files=20, longest=1.418
s, average=0.123 s; distance=2537230 kB, estimate=2658785 kB
2015-02-23 15:31:00.477 GMT-10 LOG: checkpoint starting: time
2015-02-23 15:31:25.925 GMT-10 LOG: checkpoint complete: wrote 2625
buffers (16.0%); 0 transaction log file(s) added, 0 removed, 155 recycled;
write=23.657 s, sync=1.592 s, total=25.447 s; sync files=13, longest=0.319
s, average=0.122 s; distance=2797386 kB, estimate=2797386 kB
2015-02-23 15:36:00.607 GMT-10 LOG: checkpoint starting: time
2015-02-23 15:36:52.686 GMT-10 LOG: checkpoint complete: wrote 3473
buffers (21.2%); 0 transaction log file(s) added, 0 removed, 171 recycled;
write=31.257 s, sync=20.446 s, total=52.078 s; sync files=33, longest=4.512
s, average=0.619 s; distance=2153903 kB, estimate=2733038 kB
2015-02-23 15:41:00.675 GMT-10 LOG: checkpoint starting: time
2015-02-23 15:41:25.092 GMT-10 LOG: checkpoint complete: wrote 2456
buffers (15.0%); 0 transaction log file(s) added, 0 removed, 131 recycled;
write=21.974 s, sync=2.282 s, total=24.417 s; sync files=27, longest=1.275
s, average=0.084 s; distance=2258648 kB, estimate=2685599 kB
2015-02-23 15:46:00.671 GMT-10 LOG: checkpoint starting: time
2015-02-23 15:46:26.757 GMT-10 LOG: checkpoint complete: wrote 2644
buffers (16.1%); 0 transaction log file(s) added, 0 removed, 138 recycled;
write=23.619 s, sync=2.181 s, total=26.086 s; sync files=12, longest=0.709
s, average=0.181 s; distance=2787124 kB, estimate=2787124 kB
2015-02-23 15:51:00.509 GMT-10 LOG: checkpoint starting: time
2015-02-23 15:53:30.793 GMT-10 LOG: checkpoint complete: wrote 13408
buffers (81.8%); 0 transaction log file(s) added, 0 removed, 170 recycled;
write=149.432 s, sync=0.664 s, total=150.284 s; sync files=13,
longest=0.286 s, average=0.051 s; distance=1244483 kB, estimate=2632860 kB

Above checkpoint logs are generated at the time when pg_xlog size was at
5.4G

*Code* *Review*

I had a look at the code and do not have any comments from my end.

Regards,
Venkata Balaji N

#102

Andres Freund

andres@2ndquadrant.com

almost 11 years ago

In reply to: Robert Haas (#99)

Re: Redesigning checkpoint_segments

On 2015-02-22 21:24:56 -0500, Robert Haas wrote:

On Sat, Feb 21, 2015 at 11:29 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

I am wondering a bit about interaction with wal_keep_segments.
One thing is that wal_keep_segments is still specified in number of segments
and not size units, maybe it would be worth to change it also?
And the other thing is that, if set, the wal_keep_segments is the real
max_wal_size from the user perspective (not from perspective of the
algorithm in this patch, but user does not really care about that) which is
somewhat weird given the naming.

It seems like wal_keep_segments is more closely related to
wal_*min*_size. The idea of both settings is that each is a minimum
amount of WAL we want to keep around for some purpose. But they're
not quite the same, I guess, because wal_min_size just forces us to
keep that many files around - they can be overwritten whenever.
wal_keep_segments is an amount of actual WAL data we want to keep
around.

Would it make sense to require that wal_keep_segments <= wal_min_size?

I don't think so. Right now checkpoint_segments is a useful tool to
relatively effectively control the amount of WAL that needs to be
replayed in the event of a crash. wal_keep_segments in contrast doesn't
have much to do with the normal working of the system, except that it
delays recycling of WAL segments a bit.

With a condition like above, how would you set up things that you have
50k segments around for replication (say a good days worth), but that
your will never have to replay more than ~800 segments (i.e. something
like checkpoint_segments = 800)?

Am I missing something?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103

Heikki Linnakangas

hlinnakangas@vmware.com

almost 11 years ago

In reply to: Andres Freund (#102)

Re: Redesigning checkpoint_segments

On 02/23/2015 01:01 PM, Andres Freund wrote:

On 2015-02-22 21:24:56 -0500, Robert Haas wrote:

On Sat, Feb 21, 2015 at 11:29 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

I am wondering a bit about interaction with wal_keep_segments.
One thing is that wal_keep_segments is still specified in number of segments
and not size units, maybe it would be worth to change it also?
And the other thing is that, if set, the wal_keep_segments is the real
max_wal_size from the user perspective (not from perspective of the
algorithm in this patch, but user does not really care about that) which is
somewhat weird given the naming.

It seems like wal_keep_segments is more closely related to
wal_*min*_size. The idea of both settings is that each is a minimum
amount of WAL we want to keep around for some purpose. But they're
not quite the same, I guess, because wal_min_size just forces us to
keep that many files around - they can be overwritten whenever.
wal_keep_segments is an amount of actual WAL data we want to keep
around.

Would it make sense to require that wal_keep_segments <= wal_min_size?

I don't think so. Right now checkpoint_segments is a useful tool to
relatively effectively control the amount of WAL that needs to be
replayed in the event of a crash. wal_keep_segments in contrast doesn't
have much to do with the normal working of the system, except that it
delays recycling of WAL segments a bit.

With a condition like above, how would you set up things that you have
50k segments around for replication (say a good days worth), but that
your will never have to replay more than ~800 segments (i.e. something
like checkpoint_segments = 800)?

Right. While wal_keep_segments and wal_min_size both set a kind of a
minimum on the amount of WAL that's kept in pg_xlog, they are different
things, and a rule that one must be less than or greater than the other
doesn't make sense.

Everyone seems to be happy with the names and behaviour of the GUCs, so
committed.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104

Josh Berkus

josh@agliodbs.com

almost 11 years ago

In reply to: Josh Berkus (#76)

Re: Redesigning checkpoint_segments

On 02/23/2015 08:56 AM, Heikki Linnakangas wrote:

Everyone seems to be happy with the names and behaviour of the GUCs, so
committed.

Yay!

But ... I thought we were going to raise the default for max_wal_size to
something much higher, like 1GB? That's what was discussed on this thread.

When I build, I get this:

#max_wal_size = 128MB # in logfile segments
#min_wal_size = 80MB

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WMa8d471d4b28ba29364fb0706e95eea622f13c017845e8540026818430395463fa7eeee25d7afe057e8018fbfcece0a46@asav-3.01.com

#105

Heikki Linnakangas

hlinnaka@iki.fi

almost 11 years ago

In reply to: Josh Berkus (#104)

Re: Redesigning checkpoint_segments

On 02/26/2015 01:32 AM, Josh Berkus wrote:

But ... I thought we were going to raise the default for max_wal_size to
something much higher, like 1GB? That's what was discussed on this thread.

No conclusion was reached on that. Me and some others were against
raising the default, while others were for it.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Heikki Linnakangas (#105)

Re: Redesigning checkpoint_segments

On Mon, Mar 2, 2015 at 6:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 02/26/2015 01:32 AM, Josh Berkus wrote:

But ... I thought we were going to raise the default for max_wal_size to
something much higher, like 1GB? That's what was discussed on this
thread.

No conclusion was reached on that. Me and some others were against raising
the default, while others were for it.

I guess that's a fair summary of the discussion, but I still think
it's the wrong conclusion. Right now, you can't get reasonable write
performance with PostgreSQL even on tiny databases (a few GB) without
increasing that setting by an order of magnitude. It seems an awful
shame to go to all the work to mitigate the downsides of setting a
large checkpoint_segments and then still ship a tiny default setting.
I've got to believe that the number of people who think 128MB of WAL
is tolerable but 512MB or 1GB is excessive is almost nobody. Disk
sizes these days are measured in TB.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#107

Stephen Frost

sfrost@snowman.net

almost 11 years ago

In reply to: Robert Haas (#106)

Re: Redesigning checkpoint_segments

* Robert Haas (robertmhaas@gmail.com) wrote:

On Mon, Mar 2, 2015 at 6:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 02/26/2015 01:32 AM, Josh Berkus wrote:

But ... I thought we were going to raise the default for max_wal_size to
something much higher, like 1GB? That's what was discussed on this
thread.

No conclusion was reached on that. Me and some others were against raising
the default, while others were for it.

I guess that's a fair summary of the discussion, but I still think
it's the wrong conclusion. Right now, you can't get reasonable write
performance with PostgreSQL even on tiny databases (a few GB) without
increasing that setting by an order of magnitude. It seems an awful
shame to go to all the work to mitigate the downsides of setting a
large checkpoint_segments and then still ship a tiny default setting.
I've got to believe that the number of people who think 128MB of WAL
is tolerable but 512MB or 1GB is excessive is almost nobody. Disk
sizes these days are measured in TB.

+1. I thought the conclusion had actually been in favor of the change,
though there had been voices for and against.

Thanks,

Stephen

#108

Josh Berkus

josh@agliodbs.com

almost 11 years ago

In reply to: Josh Berkus (#80)

Re: Redesigning checkpoint_segments

On 03/02/2015 05:38 AM, Stephen Frost wrote:

* Robert Haas (robertmhaas@gmail.com) wrote:

On Mon, Mar 2, 2015 at 6:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 02/26/2015 01:32 AM, Josh Berkus wrote:

But ... I thought we were going to raise the default for max_wal_size to
something much higher, like 1GB? That's what was discussed on this
thread.

No conclusion was reached on that. Me and some others were against raising
the default, while others were for it.

I guess that's a fair summary of the discussion, but I still think
it's the wrong conclusion. Right now, you can't get reasonable write
performance with PostgreSQL even on tiny databases (a few GB) without
increasing that setting by an order of magnitude. It seems an awful
shame to go to all the work to mitigate the downsides of setting a
large checkpoint_segments and then still ship a tiny default setting.
I've got to believe that the number of people who think 128MB of WAL
is tolerable but 512MB or 1GB is excessive is almost nobody. Disk
sizes these days are measured in TB.

+1. I thought the conclusion had actually been in favor of the change,
though there had been voices for and against.

That was the impression I had too, which was why I was surprised. The
last post on the topic was one by Robert Haas, agreeing with me on a
value of 1GB, and there were zero objections after that.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WMa937cd4d3ab9fd4b97f8c73b015875d97d3fa4efde529bd95700e4e8eed58f7ec1384ede75958049f09a5bfd1f3331e5@asav-2.01.com

#109

Heikki Linnakangas

hlinnaka@iki.fi

almost 11 years ago

In reply to: Josh Berkus (#108)

Re: Redesigning checkpoint_segments

On 03/02/2015 08:05 PM, Josh Berkus wrote:

On 03/02/2015 05:38 AM, Stephen Frost wrote:

* Robert Haas (robertmhaas@gmail.com) wrote:

On Mon, Mar 2, 2015 at 6:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 02/26/2015 01:32 AM, Josh Berkus wrote:

But ... I thought we were going to raise the default for max_wal_size to
something much higher, like 1GB? That's what was discussed on this
thread.

No conclusion was reached on that. Me and some others were against raising
the default, while others were for it.

I guess that's a fair summary of the discussion, but I still think
it's the wrong conclusion. Right now, you can't get reasonable write
performance with PostgreSQL even on tiny databases (a few GB) without
increasing that setting by an order of magnitude. It seems an awful
shame to go to all the work to mitigate the downsides of setting a
large checkpoint_segments and then still ship a tiny default setting.
I've got to believe that the number of people who think 128MB of WAL
is tolerable but 512MB or 1GB is excessive is almost nobody. Disk
sizes these days are measured in TB.

+1. I thought the conclusion had actually been in favor of the change,
though there had been voices for and against.

That was the impression I had too, which was why I was surprised. The
last post on the topic was one by Robert Haas, agreeing with me on a
value of 1GB, and there were zero objections after that.

I didn't make any further posts to that thread because I had already
objected earlier and didn't have anything to add.

Now, if someone's going to go and raise the default, I'm not going to
make a fuss about it, but the fact remains that *all* the defaults in
postgresql.conf.sample are geared towards small systems, and not hogging
all resources. The default max_wal_size of 128 MB is well in line with
e.g. shared_buffers=128MB.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110

Stephen Frost

sfrost@snowman.net

almost 11 years ago

In reply to: Heikki Linnakangas (#109)

Re: Redesigning checkpoint_segments

Heikki,

On Monday, March 2, 2015, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 03/02/2015 08:05 PM, Josh Berkus wrote:

On 03/02/2015 05:38 AM, Stephen Frost wrote:

* Robert Haas (robertmhaas@gmail.com) wrote:

On Mon, Mar 2, 2015 at 6:43 AM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

On 02/26/2015 01:32 AM, Josh Berkus wrote:

But ... I thought we were going to raise the default for max_wal_size
to
something much higher, like 1GB? That's what was discussed on this
thread.

No conclusion was reached on that. Me and some others were against
raising
the default, while others were for it.

I guess that's a fair summary of the discussion, but I still think
it's the wrong conclusion. Right now, you can't get reasonable write
performance with PostgreSQL even on tiny databases (a few GB) without
increasing that setting by an order of magnitude. It seems an awful
shame to go to all the work to mitigate the downsides of setting a
large checkpoint_segments and then still ship a tiny default setting.
I've got to believe that the number of people who think 128MB of WAL
is tolerable but 512MB or 1GB is excessive is almost nobody. Disk
sizes these days are measured in TB.

+1. I thought the conclusion had actually been in favor of the change,
though there had been voices for and against.

That was the impression I had too, which was why I was surprised. The
last post on the topic was one by Robert Haas, agreeing with me on a
value of 1GB, and there were zero objections after that.

I didn't make any further posts to that thread because I had already
objected earlier and didn't have anything to add.

Now, if someone's going to go and raise the default, I'm not going to make
a fuss about it, but the fact remains that *all* the defaults in
postgresql.conf.sample are geared towards small systems, and not hogging
all resources. The default max_wal_size of 128 MB is well in line with e.g.
shared_buffers=128MB.

Not to be too much of a pain, but I've run into very few systems where
memory and disk are less than an order of magnitude different in size. I
definitely feel we need to support users tuning their systems for smaller
sizes but I do think our defaults are too small for the majority.

Thanks!

Stephen

#111

Josh Berkus

josh@agliodbs.com

almost 11 years ago

In reply to: Josh Berkus (#80)

Re: Redesigning checkpoint_segments

On 03/02/2015 12:23 PM, Heikki Linnakangas wrote:

On 03/02/2015 08:05 PM, Josh Berkus wrote:

That was the impression I had too, which was why I was surprised. The
last post on the topic was one by Robert Haas, agreeing with me on a
value of 1GB, and there were zero objections after that.

I didn't make any further posts to that thread because I had already
objected earlier and didn't have anything to add.

Now, if someone's going to go and raise the default, I'm not going to
make a fuss about it, but the fact remains that *all* the defaults in
postgresql.conf.sample are geared towards small systems, and not hogging
all resources. The default max_wal_size of 128 MB is well in line with
e.g. shared_buffers=128MB.

OK, I don't think Robert or I realized that you were still not agreeing.
I originally thought we should keep it small, but Robert pointed out
that under your code, WAL only grows if you have high traffic.

Patch attached in a new thread.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM81a9c930cf10b08a18a837f6ed0592262b9734b1523cd46501158a40e54a8de250bd6f855792b13d68b395ec2a7b6909@asav-2.01.com

#112

Jeff Janes

jeff.janes@gmail.com

almost 11 years ago

In reply to: Heikki Linnakangas (#103)

Re: Redesigning checkpoint_segments

On Mon, Feb 23, 2015 at 8:56 AM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

Everyone seems to be happy with the names and behaviour of the GUCs, so
committed.

The docs suggest that max_wal_size will be respected during archive
recovery (causing restartpoints and recycling), but I'm not seeing that
happening. Is this a doc bug or an implementation bug?

Cheers,

Jeff

#113

Jeff Janes

jeff.janes@gmail.com

over 10 years ago

In reply to: Jeff Janes (#112)

1 attachment(s)

Re: Redesigning checkpoint_segments

On Mon, Mar 16, 2015 at 11:05 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Mon, Feb 23, 2015 at 8:56 AM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

Everyone seems to be happy with the names and behaviour of the GUCs, so
committed.

The docs suggest that max_wal_size will be respected during archive
recovery (causing restartpoints and recycling), but I'm not seeing that
happening. Is this a doc bug or an implementation bug?

I think the old behavior, where restartpoints were driven only by time and
not by volume, was a misfeature. But not a bug, because it was documented.

One of the points of max_wal_size and its predecessor is to limit how big
pg_xlog can grow. But running out of disk space on pg_xlog is no more fun
during archive recovery than it is during normal operations. So why
shouldn't max_wal_size be active during recovery?

It seems to be a trivial change to implement that, although I might be
overlooking something subtle (pasted below, also attached)

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10946,7 +10946,7 @@ XLogPageRead(XLogReaderState *xlogreader,
XLogRecPtr targetPagePtr, int reqLen,
                 * Request a restartpoint if we've replayed too much xlog
since the
                 * last one.
                 */
-               if (StandbyModeRequested && bgwriterLaunched)
+               if (bgwriterLaunched)
                {
                        if (XLogCheckpointNeeded(readSegNo))
                        {

This keeps pg_xlog at about 67% of max_wal_size during archive recovery
(because checkpoint_completion_target is accounted for but goes unused)

Or, if we do not wish to make this change in behavior, then we should fix
the docs to re-instate this distinction between archive recovery and
standby.

diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f4083c3..ebc8baa 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -589,7 +589,8 @@
    master because restartpoints can only be performed at checkpoint
records.
    A restartpoint is triggered when a checkpoint record is reached if at
    least <varname>checkpoint_timeout</> seconds have passed since the last
-   restartpoint, or if WAL size is about to exceed
+   restartpoint. In standby mode, a restartpoint is also triggered if
+   WAL size is about to exceed
    <varname>max_wal_size</>.
   </para>

Cheers,

Jeff

Attachments:

recovery_max_wal_size.patchapplication/octet-stream; name=recovery_max_wal_size.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
new file mode 100644
index 4af8fdc..5860a0b
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** XLogPageRead(XLogReaderState *xlogreader
*** 10946,10952 ****
  		 * Request a restartpoint if we've replayed too much xlog since the
  		 * last one.
  		 */
! 		if (StandbyModeRequested && bgwriterLaunched)
  		{
  			if (XLogCheckpointNeeded(readSegNo))
  			{
--- 10946,10952 ----
  		 * Request a restartpoint if we've replayed too much xlog since the
  		 * last one.
  		 */
! 		if (bgwriterLaunched)
  		{
  			if (XLogCheckpointNeeded(readSegNo))
  			{

#114

Fujii Masao

masao.fujii@gmail.com

over 10 years ago

In reply to: Jeff Janes (#113)

Re: Redesigning checkpoint_segments

On Thu, May 21, 2015 at 3:53 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Mon, Mar 16, 2015 at 11:05 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Mon, Feb 23, 2015 at 8:56 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Everyone seems to be happy with the names and behaviour of the GUCs, so
committed.

The docs suggest that max_wal_size will be respected during archive
recovery (causing restartpoints and recycling), but I'm not seeing that
happening. Is this a doc bug or an implementation bug?

I think the old behavior, where restartpoints were driven only by time and
not by volume, was a misfeature. But not a bug, because it was documented.

One of the points of max_wal_size and its predecessor is to limit how big
pg_xlog can grow. But running out of disk space on pg_xlog is no more fun
during archive recovery than it is during normal operations. So why
shouldn't max_wal_size be active during recovery?

The following message of commit 7181530 explains why.

In standby mode, respect checkpoint_segments in addition to
checkpoint_timeout to trigger restartpoints. We used to deliberately only
do time-based restartpoints, because if checkpoint_segments is small we
would spend time doing restartpoints more often than really necessary.
But now that restartpoints are done in bgwriter, they're not as
disruptive as they used to be. Secondly, because streaming replication
stores the streamed WAL files in pg_xlog, we want to clean it up more
often to avoid running out of disk space when checkpoint_timeout is large
and checkpoint_segments small.

Previously users were more likely to fall into this trouble (i.e., too frequent
occurrence of restartpoints) because the default value of checkpoint_segments
was very small, I guess. But we increased the default of max_wal_size, so now
the risk of that trouble seems to be smaller than before, and maybe we can
allow max_wal_size to trigger restartpoints.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#115

Simon Riggs

simon@2ndQuadrant.com

over 10 years ago

In reply to: Jeff Janes (#113)

Re: Redesigning checkpoint_segments

On 21 May 2015 at 02:53, Jeff Janes <jeff.janes@gmail.com> wrote:

I think the old behavior, where restartpoints were driven only by time and
not by volume, was a misfeature.

I have no objection to changing that. The main essence of that was to
ensure that a standby could act differently to a master, given different
settings.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#116

Jeff Janes

jeff.janes@gmail.com

over 10 years ago

In reply to: Fujii Masao (#114)

Re: Redesigning checkpoint_segments

On Thu, May 21, 2015 at 8:40 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, May 21, 2015 at 3:53 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Mon, Mar 16, 2015 at 11:05 PM, Jeff Janes <jeff.janes@gmail.com>

wrote:

On Mon, Feb 23, 2015 at 8:56 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Everyone seems to be happy with the names and behaviour of the GUCs, so
committed.

The docs suggest that max_wal_size will be respected during archive
recovery (causing restartpoints and recycling), but I'm not seeing that
happening. Is this a doc bug or an implementation bug?

I think the old behavior, where restartpoints were driven only by time

and

not by volume, was a misfeature. But not a bug, because it was

documented.

One of the points of max_wal_size and its predecessor is to limit how big
pg_xlog can grow. But running out of disk space on pg_xlog is no more

fun

during archive recovery than it is during normal operations. So why
shouldn't max_wal_size be active during recovery?

The following message of commit 7181530 explains why.

In standby mode, respect checkpoint_segments in addition to
checkpoint_timeout to trigger restartpoints. We used to deliberately
only
do time-based restartpoints, because if checkpoint_segments is small we
would spend time doing restartpoints more often than really necessary.
But now that restartpoints are done in bgwriter, they're not as
disruptive as they used to be. Secondly, because streaming replication
stores the streamed WAL files in pg_xlog, we want to clean it up more
often to avoid running out of disk space when checkpoint_timeout is
large
and checkpoint_segments small.

Previously users were more likely to fall into this trouble (i.e., too
frequent
occurrence of restartpoints) because the default value of
checkpoint_segments
was very small, I guess. But we increased the default of max_wal_size, so
now
the risk of that trouble seems to be smaller than before, and maybe we can
allow max_wal_size to trigger restartpoints.

I see. The old behavior was present for the same reason we decided to split
checkpoint_segments into max_wal_size and min_wal_size.

That is, the default checkpoint_segments was small, and it had to be small
because increasing it would cause more space to be used even when that
extra space was not helpful.

So perhaps we can consider this change a completion of the max_wal_size
work, rather than a new feature?

Cheers,

Jeff

#117

Heikki Linnakangas

hlinnaka@iki.fi

over 10 years ago

In reply to: Jeff Janes (#116)

1 attachment(s)

Re: Redesigning checkpoint_segments

On 05/27/2015 12:26 AM, Jeff Janes wrote:

On Thu, May 21, 2015 at 8:40 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, May 21, 2015 at 3:53 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

One of the points of max_wal_size and its predecessor is to limit how big
pg_xlog can grow. But running out of disk space on pg_xlog is no more

fun

during archive recovery than it is during normal operations. So why
shouldn't max_wal_size be active during recovery?

The following message of commit 7181530 explains why.

In standby mode, respect checkpoint_segments in addition to
checkpoint_timeout to trigger restartpoints. We used to deliberately
only
do time-based restartpoints, because if checkpoint_segments is small we
would spend time doing restartpoints more often than really necessary.
But now that restartpoints are done in bgwriter, they're not as
disruptive as they used to be. Secondly, because streaming replication
stores the streamed WAL files in pg_xlog, we want to clean it up more
often to avoid running out of disk space when checkpoint_timeout is
large
and checkpoint_segments small.

Previously users were more likely to fall into this trouble (i.e., too
frequent
occurrence of restartpoints) because the default value of
checkpoint_segments
was very small, I guess. But we increased the default of max_wal_size, so
now
the risk of that trouble seems to be smaller than before, and maybe we can
allow max_wal_size to trigger restartpoints.

I see. The old behavior was present for the same reason we decided to split
checkpoint_segments into max_wal_size and min_wal_size.

That is, the default checkpoint_segments was small, and it had to be small
because increasing it would cause more space to be used even when that
extra space was not helpful.

So perhaps we can consider this change a completion of the max_wal_size
work, rather than a new feature?

Yeah, I'm inclined to change the behaviour. Ignoring checkpoint_segments
made sense when we initially did that, but it has gradually become less
and less sensible after that, as we got streaming replication, and as we
started to keep all restored segments in pg_xlog even in archive recovery.

It seems to be a trivial change to implement that, although I might be
overlooking something subtle (pasted below, also attached)
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10946,7 +10946,7 @@ XLogPageRead(XLogReaderState *xlogreader,
XLogRecPtr targetPagePtr, int reqLen,
* Request a restartpoint if we've replayed too much xlog
since the
* last one.
*/
-               if (StandbyModeRequested && bgwriterLaunched)
+               if (bgwriterLaunched)
{
if (XLogCheckpointNeeded(readSegNo))
{
This keeps pg_xlog at about 67% of max_wal_size during archive recovery
(because checkpoint_completion_target is accounted for but goes unused)

Hmm. checkpoint_completion_target is used when determining progress
against checkpoint_timeout just fine, but the problem is that if you do
just the above, IsCheckpointOnSchedule() still won't consider consumed
WAL when it determines whether the restartpoint is "on time". So the
error is in the other direction: if you set max_wal_size to a small
value, and checkpoint_timeout to a large value, the restartpoint would
think that it has plenty of time to complete, and exceed max_wal_size.
We need to fix IsCheckpointOnSchedule() to also track progress against
max_wal_size during recovery.

I came up with the attached patch as a first attempt. It enables the
same logic to calculate if the checkpoint is on schedule to be used in
recovery. But there's a little problem (also explained in a comment in
the patch):

There is a large gap between a checkpoint's redo-pointer, and the
checkpoint record itself (determined by checkpoint_completion_target).
When we're not in recovery, we set the redo-pointer for the current
checkpoint first, then start flushing data, and finally write the
checkpoint record. The logic in IsCheckpointOnSchedule() calculates a)
how much WAL has been generated since the beginning of the checkpoint,
i.e its redo-pointer, and b) what fraction of shared_buffers has been
flushed to disk. But in recovery, we only start the restartpoint after
replaying the checkpoint record, so at the beginning of a restartpoint,
we're actually already behind schedule by the amount of WAL between the
redo-pointer and the record itself.

I'm not sure what to do about this. With the attached patch, you get the
same leisurely pacing with restartpoints as you get with checkpoints,
but you exceed max_wal_size during recovery, by the amount determined by
checkpoint_completion_target. Alternatively, we could try to perform
restartpoints faster then checkpoints, but then you'll get nasty
checkpoint I/O storms in recovery.

A bigger change would be to write a WAL record at the beginning of a
checkpoint. It wouldn't do anything else, but it would be a hint to
recovery that there's going to be a checkpoint record later whose
redo-pointer will point to that record. We could then start the
restartpoint at that record already, before seeing the checkpoint record
itself.

I think the attached is better than nothing, but I'll take a look at
that beginning-of-checkpoint idea. It might be too big a change to do at
this point, but I'd really like to fix this properly for 9.5, since
we've changed with the way checkpoints are scheduled anyway.

- Heikki

Attachments:

wip-obey_max_wal_size-in-recovery.patchtext/x-diff; name=wip-obey_max_wal_size-in-recovery.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4e37ad3..cc90ae6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10961,7 +10961,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 		 * Request a restartpoint if we've replayed too much xlog since the
 		 * last one.
 		 */
-		if (StandbyModeRequested && bgwriterLaunched)
+		if (bgwriterLaunched)
 		{
 			if (XLogCheckpointNeeded(readSegNo))
 			{
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0dce6a8..bc49ee4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -475,10 +475,12 @@ CheckpointerMain(void)
 
 			/*
 			 * Initialize checkpointer-private variables used during
-			 * checkpoint
+			 * checkpoint.
 			 */
 			ckpt_active = true;
-			if (!do_restartpoint)
+			if (do_restartpoint)
+				ckpt_start_recptr = GetXLogReplayRecPtr(NULL);
+			else
 				ckpt_start_recptr = GetInsertRecPtr();
 			ckpt_start_time = now;
 			ckpt_cached_elapsed = 0;
@@ -720,7 +722,7 @@ CheckpointWriteDelay(int flags, double progress)
 
 /*
  * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
- *		 in time?
+ *		 (or restartpoint) in time?
  *
  * Compares the current progress against the time/segments elapsed since last
  * checkpoint, and returns true if the progress we've made this far is greater
@@ -757,17 +759,28 @@ IsCheckpointOnSchedule(double progress)
 	 * compares against RedoRecptr, so this is not completely accurate.
 	 * However, it's good enough for our purposes, we're only calculating an
 	 * estimate anyway.
+	 *
+	 * During recovery, we compare last replayed WAL record's location with
+	 * the location computed before calling CreateRestartPoint. That maintains
+	 * the same pacing as we have during checkpoints in normal operation, but
+	 * we might exceed max_wal_size by a fair amount. That's because there can
+	 * be a large gap between a checkpoint's redo-pointer and the checkpoint
+	 * record itself, and we only start the restartpoint after we've seen the
+	 * checkpoint record. (The gap is typically up to CheckPointSegments *
+	 * checkpoint_completion_target where checkpoint_completion_target is the
+	 * value that was in effect when the WAL was generated).
 	 */
-	if (!RecoveryInProgress())
-	{
+	if (RecoveryInProgress())
+		recptr = GetXLogReplayRecPtr(NULL);
+	else
 		recptr = GetInsertRecPtr();
-		elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;
+	elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;
 
-		if (progress < elapsed_xlogs)
-		{
-			ckpt_cached_elapsed = elapsed_xlogs;
-			return false;
-		}
+	if (progress < elapsed_xlogs)
+	{
+		elog(LOG, "not on schedule, progress: %f elapsed_xlogs: %f", progress, elapsed_xlogs);
+		ckpt_cached_elapsed = elapsed_xlogs;
+		return false;
 	}
 
 	/*
@@ -779,11 +792,13 @@ IsCheckpointOnSchedule(double progress)
 
 	if (progress < elapsed_time)
 	{
+		elog(LOG, "not on schedule, progress: %f elapsed_xlogs: %f elapsed_time %f", progress, elapsed_xlogs, elapsed_time);
 		ckpt_cached_elapsed = elapsed_time;
 		return false;
 	}
 
 	/* It looks like we're on schedule. */
+	 elog(LOG, "on schedule, progress: %f elapsed_xlogs: %f elapsed_time %f", progress, elapsed_xlogs, elapsed_time);
 	return true;
 }

#118

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Heikki Linnakangas (#117)

Re: Redesigning checkpoint_segments

On Fri, Jun 26, 2015 at 7:08 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm not sure what to do about this. With the attached patch, you get the
same leisurely pacing with restartpoints as you get with checkpoints, but
you exceed max_wal_size during recovery, by the amount determined by
checkpoint_completion_target. Alternatively, we could try to perform
restartpoints faster then checkpoints, but then you'll get nasty checkpoint
I/O storms in recovery.

A bigger change would be to write a WAL record at the beginning of a
checkpoint. It wouldn't do anything else, but it would be a hint to recovery
that there's going to be a checkpoint record later whose redo-pointer will
point to that record. We could then start the restartpoint at that record
already, before seeing the checkpoint record itself.

I think the attached is better than nothing, but I'll take a look at that
beginning-of-checkpoint idea. It might be too big a change to do at this
point, but I'd really like to fix this properly for 9.5, since we've changed
with the way checkpoints are scheduled anyway.

I agree. Actually, I've seen a number of presentations indicating
that the pacing of checkpoints is already too aggressive near the
beginning, because as soon as we initiate the checkpoint we have a
storm of full page writes. I'm sure we can come up with arbitrarily
complicated systems to compensate for this, but something simple might
be to calculate progress done+adjust/total+adjust rather than
done/total. If you let adjust=total/9, for example, then you
essentially start the progress meter at 10% instead of 0%. Even
something that simple might be an improvement.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#119

Heikki Linnakangas

hlinnaka@iki.fi

over 10 years ago

In reply to: Robert Haas (#118)

Re: Redesigning checkpoint_segments

On 06/26/2015 03:40 PM, Robert Haas wrote:

Actually, I've seen a number of presentations indicating
that the pacing of checkpoints is already too aggressive near the
beginning, because as soon as we initiate the checkpoint we have a
storm of full page writes. I'm sure we can come up with arbitrarily
complicated systems to compensate for this, but something simple might
be to calculate progress done+adjust/total+adjust rather than
done/total. If you let adjust=total/9, for example, then you
essentially start the progress meter at 10% instead of 0%. Even
something that simple might be an improvement.

Yeah, but that's an unrelated issue. This was most recently discussed at
/messages/by-id/CAKHd5Ce-bnD=gEEdtXiT2_AY7shquTKd0yHXXk5F4zVEKRPX-w@mail.gmail.com.
I posted a simple patch there - review and testing is welcome ;-).

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Heikki Linnakangas (#119)

Re: Redesigning checkpoint_segments

On Fri, Jun 26, 2015 at 9:47 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/26/2015 03:40 PM, Robert Haas wrote:

Actually, I've seen a number of presentations indicating
that the pacing of checkpoints is already too aggressive near the
beginning, because as soon as we initiate the checkpoint we have a
storm of full page writes. I'm sure we can come up with arbitrarily
complicated systems to compensate for this, but something simple might
be to calculate progress done+adjust/total+adjust rather than
done/total. If you let adjust=total/9, for example, then you
essentially start the progress meter at 10% instead of 0%. Even
something that simple might be an improvement.

Yeah, but that's an unrelated issue. This was most recently discussed at
/messages/by-id/CAKHd5Ce-bnD=gEEdtXiT2_AY7shquTKd0yHXXk5F4zVEKRPX-w@mail.gmail.com.
I posted a simple patch there - review and testing is welcome ;-).

Ah, thanks for the pointer - I had forgotten about that thread.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#121

Heikki Linnakangas

hlinnaka@iki.fi

over 10 years ago

In reply to: Heikki Linnakangas (#117)

Re: Redesigning checkpoint_segments

On 06/26/2015 02:08 PM, Heikki Linnakangas wrote:

I'm not sure what to do about this. With the attached patch, you get the
same leisurely pacing with restartpoints as you get with checkpoints,
but you exceed max_wal_size during recovery, by the amount determined by
checkpoint_completion_target. Alternatively, we could try to perform
restartpoints faster then checkpoints, but then you'll get nasty
checkpoint I/O storms in recovery.

Ok, committed this patch. IMHO it's definitely better than the old
behaviour.

A bigger change would be to write a WAL record at the beginning of a
checkpoint. It wouldn't do anything else, but it would be a hint to
recovery that there's going to be a checkpoint record later whose
redo-pointer will point to that record. We could then start the
restartpoint at that record already, before seeing the checkpoint record
itself.

I think the attached is better than nothing, but I'll take a look at
that beginning-of-checkpoint idea. It might be too big a change to do at
this point, but I'd really like to fix this properly for 9.5, since
we've changed with the way checkpoints are scheduled anyway.

This would've been a much more complicated patch, so I dropped that
idea, for 9.5 anyway. Maybe later, but it's not urgent.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers