Redesigning checkpoint_segments

Started by Heikki Linnakangasalmost 13 years ago121 messageshackers
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

checkpoint_segments is awkward. From an admin's point of view, it
controls two things:

1. it limits the amount of disk space needed for pg_xlog. (it's a soft
limit, but still)
2. it limits the time required to recover after a crash.

For limiting the disk space needed for pg_xlog, checkpoint_segments is
awkward because it's defined in terms of 16MB segments between
checkpoints. It takes a fair amount of arithmetic to calculate the disk
space required to hold the specified number of segments. The manual
gives the formula: (2 + checkpoint_completion_target) *
checkpoint_segments + 1, which amounts to about 1GB per 20 segments as a
rule of thumb. We shouldn't impose that calculation on the user. It
should be possible to just specify "checkpoint_segments=512MB", and the
system would initiate checkpoints so that the total size of WAL in
pg_xlog stays below 512MB.

For limiting the time required to recover after crash,
checkpoint_segments is awkward because it's difficult to calculate how
long recovery will take, given checkpoint_segments=X. A bulk load can
use up segments really fast, and recovery will be fast, while segments
full of random deletions can need a lot of random I/O to replay, and
take a long time. IMO checkpoint_timeout is a much better way to control
that, although it's not perfect either.

A third point is that even if you have 10 GB of disk space reserved for
WAL, you don't want to actually consume all that 10 GB, if it's not
required to run the database smoothly. There are several reasons for
that: backups based on a filesystem-level snapshot are larger than
necessary, if there are a lot of preallocated WAL segments and in a
virtualized or shared system, there might be other VMs or applications
that could make use of the disk space. On the other hand, you don't want
to run out of disk space while writing WAL - that can lead to a PANIC in
the worst case.

In VMware's vPostgres fork, we've hacked the way that works, so that
there is a new setting, checkpoint_segments_max that can be set by the
user, but checkpoint_segments is adjusted automatically, on the fly. The
system counts how many segments were consumed during the last checkpoint
cycle, and that becomes the checkpoint_segments setting for the next
cycle. That means that in a system with a steady load, checkpoints are
triggered by checkpoint_timeout, and the effective checkpoint_segments
value converges at the exact number of segments needed for that. That's
simple but very effective. It doesn't behave too well with bursty load,
however; during quiet times, checkpoint_segments is dialed way down, and
when the next burst comes along, you get several checkpoints in quick
succession, until checkpoint_segments is dialed back up again.

I propose that we do something similar, but not exactly the same. Let's
have a setting, max_wal_size, to control the max. disk space reserved
for WAL. Once that's reached (or you get close enough, so that there are
still some segments left to consume while the checkpoint runs), a
checkpoint is triggered.

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high,
without actually consuming that much space in normal operation. It's
just a backstop, to avoid completely filling the disk, if there's a
sudden burst of activity. The number of segments preallocated is
auto-tuned, based on the number of segments used in previous checkpoint
cycles.

I'll write up a patch to do that, but before I do, does anyone disagree
on those tuning principles? How do you typically tune
checkpoint_segments on your servers? If the system was to tune it
automatically, what formula should it use?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 9:16 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

checkpoint_segments is awkward. From an admin's point of view, it controls
two things:

1. it limits the amount of disk space needed for pg_xlog. (it's a soft
limit, but still)
2. it limits the time required to recover after a crash.

For limiting the disk space needed for pg_xlog, checkpoint_segments is
awkward because it's defined in terms of 16MB segments between checkpoints.
It takes a fair amount of arithmetic to calculate the disk space required to
hold the specified number of segments. The manual gives the formula: (2 +
checkpoint_completion_target) * checkpoint_segments + 1, which amounts to
about 1GB per 20 segments as a rule of thumb. We shouldn't impose that
calculation on the user. It should be possible to just specify
"checkpoint_segments=512MB", and the system would initiate checkpoints so
that the total size of WAL in pg_xlog stays below 512MB.

For limiting the time required to recover after crash, checkpoint_segments
is awkward because it's difficult to calculate how long recovery will take,
given checkpoint_segments=X. A bulk load can use up segments really fast,
and recovery will be fast, while segments full of random deletions can need
a lot of random I/O to replay, and take a long time. IMO checkpoint_timeout
is a much better way to control that, although it's not perfect either.

A third point is that even if you have 10 GB of disk space reserved for WAL,
you don't want to actually consume all that 10 GB, if it's not required to
run the database smoothly. There are several reasons for that: backups based
on a filesystem-level snapshot are larger than necessary, if there are a lot
of preallocated WAL segments and in a virtualized or shared system, there
might be other VMs or applications that could make use of the disk space. On
the other hand, you don't want to run out of disk space while writing WAL -
that can lead to a PANIC in the worst case.

In VMware's vPostgres fork, we've hacked the way that works, so that there
is a new setting, checkpoint_segments_max that can be set by the user, but
checkpoint_segments is adjusted automatically, on the fly. The system counts
how many segments were consumed during the last checkpoint cycle, and that
becomes the checkpoint_segments setting for the next cycle. That means that
in a system with a steady load, checkpoints are triggered by
checkpoint_timeout, and the effective checkpoint_segments value converges at
the exact number of segments needed for that. That's simple but very
effective. It doesn't behave too well with bursty load, however; during
quiet times, checkpoint_segments is dialed way down, and when the next burst
comes along, you get several checkpoints in quick succession, until
checkpoint_segments is dialed back up again.

I propose that we do something similar, but not exactly the same. Let's have
a setting, max_wal_size, to control the max. disk space reserved for WAL.
Once that's reached (or you get close enough, so that there are still some
segments left to consume while the checkpoint runs), a checkpoint is
triggered.

What if max_wal_size is reached while the checkpoint is running? We should
change the checkpoint from spread mode to fast mode? Or, if max_wal_size
is hard limit, we should keep the allocation of new WAL file waiting until
the checkpoint has finished and removed some old WAL files?

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high, without
actually consuming that much space in normal operation. It's just a
backstop, to avoid completely filling the disk, if there's a sudden burst of
activity. The number of segments preallocated is auto-tuned, based on the
number of segments used in previous checkpoint cycles.

How is wal_keep_segments handled in your approach?

I'll write up a patch to do that, but before I do, does anyone disagree on
those tuning principles?

No at least from me. I like your idea.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#2)
Re: Redesigning checkpoint_segments

On 05.06.2013 21:16, Fujii Masao wrote:

On Wed, Jun 5, 2013 at 9:16 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I propose that we do something similar, but not exactly the same. Let's have
a setting, max_wal_size, to control the max. disk space reserved for WAL.
Once that's reached (or you get close enough, so that there are still some
segments left to consume while the checkpoint runs), a checkpoint is
triggered.

What if max_wal_size is reached while the checkpoint is running? We should
change the checkpoint from spread mode to fast mode?

The checkpoint spreading code already tracks if the checkpoint is "on
schedule", and it takes into account both checkpoint_timeout and
checkpoint_segments. Ie. if you consume segments faster than expected,
the checkpoint will speed up as well. Once checkpoint_segments is
reached, the checkpoint will complete ASAP, with no delays to spread it out.

This would still work the same with max_wal_size. A new checkpoint would
be started well before reaching max_wal_size, so that it has enough time
to complete. If the checkpoint "falls behind", it will hurry up until
it's back on schedule. If max_wal_size is reached anyway, it will
complete ASAP.

Or, if max_wal_size
is hard limit, we should keep the allocation of new WAL file waiting until
the checkpoint has finished and removed some old WAL files?

I was not thinking of making it a hard limit. It would be just like
checkpoint_segments from that point of view - if a checkpoint takes a
long time, max_wal_size might still be exceeded.

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high, without
actually consuming that much space in normal operation. It's just a
backstop, to avoid completely filling the disk, if there's a sudden burst of
activity. The number of segments preallocated is auto-tuned, based on the
number of segments used in previous checkpoint cycles.

How is wal_keep_segments handled in your approach?

Hmm, haven't thought about that. I think a better unit to set
wal_keep_segments in would also be MB, not segments. Perhaps
max_wal_size should include WAL retained for wal_keep_segments, leaving
less room for checkpoints. Ie. when you you set wal_keep_segments
higher, a xlog-based checkpoint would be triggered earlier, because the
old segments kept for replication would leave less room for new
segments. And setting wal_keep_segments higher than max_wal_size would
be an error.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Heikki Linnakangas (#3)
Re: Redesigning checkpoint_segments

Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

I was not thinking of making it a hard limit. It would be just
like checkpoint_segments from that point of view - if a
checkpoint takes a long time, max_wal_size might still be
exceeded.

Then I suggest we not use exactly that name.  I feel quite sure we
would get complaints from people if something labeled as "max" was
exceeded -- especially if they set that to the actual size of a
filesystem dedicated to WAL files.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#3)
Re: Redesigning checkpoint_segments

On Thu, Jun 6, 2013 at 3:35 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 05.06.2013 21:16, Fujii Masao wrote:

On Wed, Jun 5, 2013 at 9:16 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I propose that we do something similar, but not exactly the same. Let's
have

a setting, max_wal_size, to control the max. disk space reserved for WAL.
Once that's reached (or you get close enough, so that there are still
some
segments left to consume while the checkpoint runs), a checkpoint is
triggered.

What if max_wal_size is reached while the checkpoint is running? We should
change the checkpoint from spread mode to fast mode?

The checkpoint spreading code already tracks if the checkpoint is "on
schedule", and it takes into account both checkpoint_timeout and
checkpoint_segments. Ie. if you consume segments faster than expected, the
checkpoint will speed up as well. Once checkpoint_segments is reached, the
checkpoint will complete ASAP, with no delays to spread it out.

Yep, right. One problem is that this mechanism doesn't work in the standby.
So, are you planning to 'fix' that so that max_wal_size works well even in
the standby? Or just leave that as it is? According to the remaining part of
your email, you seem to choose the latter, though.

This would still work the same with max_wal_size. A new checkpoint would be
started well before reaching max_wal_size, so that it has enough time to
complete. If the checkpoint "falls behind", it will hurry up until it's back
on schedule. If max_wal_size is reached anyway, it will complete ASAP.

Or, if max_wal_size
is hard limit, we should keep the allocation of new WAL file waiting until
the checkpoint has finished and removed some old WAL files?

I was not thinking of making it a hard limit. It would be just like
checkpoint_segments from that point of view - if a checkpoint takes a long
time, max_wal_size might still be exceeded.

So, if the archive command keeps failing or its speed is very slow
(e.g., because
of using compression tool), max_wal_size can still be extremely exceeded. Right?

I'm wondering if it's worth exposing the option specifying whether to use
max_wal_size as the hard limit or not. If it's not hard limit, the
disk space can
be filled up with WAL files and PANIC can happen. In this case, in order to
restart the database service, we need to enlarge the disk space or relocate
some WAL files to another disk space, and then we need to start up the server.
The normal crash recovery needs to be done. This would lead lots of service
down time.

OTOH, if we use max_wal_size as a hard limit, we can avoid such PANIC
error and long down time. Of course, in this case, once max_wal_size is
reached, we cannot complete any query writing WAL until the checkpoint
has completed and removed old WAL files. During that time, the database
service looks like down from a client, but its down time is shorter than the
PANIC error case. So I'm thinking that some users might want the hard
limit of pg_xlog size.

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high,
without
actually consuming that much space in normal operation. It's just a
backstop, to avoid completely filling the disk, if there's a sudden burst
of
activity. The number of segments preallocated is auto-tuned, based on the
number of segments used in previous checkpoint cycles.

How is wal_keep_segments handled in your approach?

Hmm, haven't thought about that. I think a better unit to set
wal_keep_segments in would also be MB, not segments.

+1

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#1)
Re: Redesigning checkpoint_segments

Heikki,

We shouldn't impose that calculation on the user. It
should be possible to just specify "checkpoint_segments=512MB", and the
system would initiate checkpoints so that the total size of WAL in
pg_xlog stays below 512MB.

Agreed.

For limiting the time required to recover after crash,
checkpoint_segments is awkward because it's difficult to calculate how
long recovery will take, given checkpoint_segments=X. A bulk load can
use up segments really fast, and recovery will be fast, while segments
full of random deletions can need a lot of random I/O to replay, and
take a long time. IMO checkpoint_timeout is a much better way to control
that, although it's not perfect either.

This is true, but I don't see that your proposal changes this at all
(for the better or for the worse).

A third point is that even if you have 10 GB of disk space reserved for
WAL, you don't want to actually consume all that 10 GB, if it's not
required to run the database smoothly.

Agreed.

I propose that we do something similar, but not exactly the same. Let's
have a setting, max_wal_size, to control the max. disk space reserved
for WAL. Once that's reached (or you get close enough, so that there are
still some segments left to consume while the checkpoint runs), a
checkpoint is triggered.

Refinement of the proposal:

1. max_wal_size is a hard limit
2. checkpointing targets 50% of ( max_wal_size - wal_keep_segments )
to avoid lockup if checkpoint takes longer than expected.
3. wal_keep_segments is taken out of max_wal_size.
a. it automatically defaults to 20% of max_wal_size if
max_wal_senders > 0
b. for that reason, we don't allow it to be larger
than 80% of max_wal_size
4. preallocated WAL isn't allowed to shrink smaller than
wal_keep_segements + (max_wal_size * 0.1).

This would mean that I could set my server to:

max_wal_size = 2GB

and ...

* by default, 26 segments (416MB) would be kept for wal_keep_segments.
* checkpoint target would be 77 segments (1.2GB)
* preallocated WAL will always be at least 39 segments (624MB),
including keep_segments.

now, if I had a fairly low transaction database, but wanted to make sure
I could recover from an 8-hour break in replication, I might bump up
wal_keep_segments to 1GB. In that case:

* 64 segments (1GB) would be kept.
* checkpoints would target 96 segments (1.5GB)
* preallocated WAL would always be at least 77 segments (1.2GB)

Hmm, haven't thought about that. I think a better unit to set
wal_keep_segments in would also be MB, not segments.

Well, the ideal unit from the user's point of view is *time*, not space.
That is, the user wants the master to keep, say, "8 hours of
transaction logs", not any amount of MB. I don't want to complicate
this proposal by trying to deliver that, though.

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high,
without actually consuming that much space in normal operation. It's
just a backstop, to avoid completely filling the disk, if there's a
sudden burst of activity. The number of segments preallocated is
auto-tuned, based on the number of segments used in previous checkpoint
cycles.

"based on"; can you give me your algorithmic thinking here? I'm
thinking we should have some calculation of last cycle size and peak
cycle size so that bursty workloads aren't compromised.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Robert Haas
robertmhaas@gmail.com
In reply to: Fujii Masao (#5)
Re: Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 3:24 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

OTOH, if we use max_wal_size as a hard limit, we can avoid such PANIC
error and long down time. Of course, in this case, once max_wal_size is
reached, we cannot complete any query writing WAL until the checkpoint
has completed and removed old WAL files. During that time, the database
service looks like down from a client, but its down time is shorter than the
PANIC error case. So I'm thinking that some users might want the hard
limit of pg_xlog size.

I wonder if we could tie this in with the recent proposal from the
Heroku guys to have a way to slow down WAL writing. Maybe we have
several limits:

- When limit #1 is passed (or checkpoint_timeout elapses), we start a
spread checkpoint.

- If it looks like we're going to exceed limit #2 before the
checkpoint completes, we attempt to perform the checkpoint more
quickly, by reducing the delay between buffer writes. If we actually
exceed limit #2, we try to complete the checkpoint as fast as
possible.

- If it looks like we're going to exceed limit #3 before the
checkpoint completes, we start exerting back-pressure on writers by
making them wait every time they write WAL, probably in proportion to
the number of bytes written. We keep ratcheting up the wait until
we've slowed down writers enough that will finish within limit #3. As
we reach limit #3, the wait goes to infinity; only read-only
operations can proceed until the checkpoint finishes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Joshua D. Drake
jd@commandprompt.com
In reply to: Robert Haas (#7)
Re: Redesigning checkpoint_segments

On 06/05/2013 05:37 PM, Robert Haas wrote:

On Wed, Jun 5, 2013 at 3:24 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

OTOH, if we use max_wal_size as a hard limit, we can avoid such PANIC
error and long down time. Of course, in this case, once max_wal_size is
reached, we cannot complete any query writing WAL until the checkpoint
has completed and removed old WAL files. During that time, the database
service looks like down from a client, but its down time is shorter than the
PANIC error case. So I'm thinking that some users might want the hard
limit of pg_xlog size.

I wonder if we could tie this in with the recent proposal from the
Heroku guys to have a way to slow down WAL writing. Maybe we have
several limits:

I didn't see that proposal, link? Because the idea of slowing down
wal-writing sounds insane.

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
a rose in the deeps of my heart. - W.B. Yeats

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Michael Paquier
michael@paquier.xyz
In reply to: Joshua D. Drake (#8)
Re: Redesigning checkpoint_segments

On Thu, Jun 6, 2013 at 10:00 AM, Joshua D. Drake <jd@commandprompt.com>wrote:

On 06/05/2013 05:37 PM, Robert Haas wrote:

On Wed, Jun 5, 2013 at 3:24 PM, Fujii Masao <masao.fujii@gmail.com>
wrote:

OTOH, if we use max_wal_size as a hard limit, we can avoid such PANIC
error and long down time. Of course, in this case, once max_wal_size is
reached, we cannot complete any query writing WAL until the checkpoint
has completed and removed old WAL files. During that time, the database
service looks like down from a client, but its down time is shorter than
the
PANIC error case. So I'm thinking that some users might want the hard
limit of pg_xlog size.

I wonder if we could tie this in with the recent proposal from the
Heroku guys to have a way to slow down WAL writing. Maybe we have
several limits:

I didn't see that proposal, link? Because the idea of slowing down
wal-writing sounds insane.

Here it is:
/messages/by-id/CAM3SWZQcyNxvPaskr-pxm8DeqH7_qevW7uqbhPCsg1FpSxKpoQ@mail.gmail.com
--
Michael

#10Daniel Farina
daniel@heroku.com
In reply to: Joshua D. Drake (#8)
Re: Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 6:00 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

I didn't see that proposal, link? Because the idea of slowing down
wal-writing sounds insane.

It's not as insane as introducing an archiving gap, PANICing and
crashing, or running this hunk o junk I wrote
http://github.com/fdr/ratchet

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Joshua D. Drake
jd@commandprompt.com
In reply to: Robert Haas (#7)
Re: Redesigning checkpoint_segments

On 06/05/2013 05:37 PM, Robert Haas wrote:

- If it looks like we're going to exceed limit #3 before the
checkpoint completes, we start exerting back-pressure on writers by
making them wait every time they write WAL, probably in proportion to
the number of bytes written. We keep ratcheting up the wait until
we've slowed down writers enough that will finish within limit #3. As
we reach limit #3, the wait goes to infinity; only read-only
operations can proceed until the checkpoint finishes.

Alright, perhaps I am dense. I have read both this thread and the other
one on better handling of archive command
(/messages/by-id/CAM3SWZQcyNxvPaskr-pxm8DeqH7_qevW7uqbhPCsg1FpSxKpoQ@mail.gmail.com).
I recognize there are brighter minds than mine on this thread but I just
honestly don't get it.

1. WAL writes are already fast. They are the fastest write we have
because it is sequential.

2. We don't want them to be slow. We want data written to disk as
quickly as possible without adversely affecting production. That's the
point.

3. The spread checkpoints have always confused me. If anything we want a
checkpoint to be fast and short because:

4. Bgwriter. We should be adjusting bgwriter so that it is writing
everything in a manner that allows any checkpoint to be in the range of
never noticed.

Now perhaps my customers workloads are different but for us:

1. Checkpoint timeout is set as high as reasonable, usually 30 minutes
to an hour. I wish I could set them even further out.

2. Bgwriter is set to be aggressive but not obtrusive. Usually adjusting
based on an actual amount of IO bandwidth it may take per second based
on their IO constraints. (Note I know that wal_writer comes into play
here but I honestly don't remember where and am reading up on it to
refresh my memory).

3. The biggest issue we see with checkpoint segments is not running out
of space because really.... 10GB is how many checkpoint segments? It is
with wal_keep_segments. If we don't want to fill up the pg_xlog
directory, put the wal logs that are for keep_segments elsewhere.

Other oddities:

Yes checkpoint_segments is awkward. We shouldn't have to set it at all.
It should be gone. Basically we start with X amount perhaps to be set at
initdb time. That X amount changes dynamically based on the amount of
data being written. In order to not suffer from recycling and creation
penalties we always keep X+N where N is enough to keep up with new data.

Along with the above, I don't see any reason for checkpoint_timeout.
Because of bgwriter we should be able to rather indefinitely not worry
about checkpoints (with a few exceptions such as pg_start_backup()).
Perhaps a setting that causes a checkpoint to happen based on some
non-artificial threshold (timeout) such as amount of data currently in
need of a checkpoint?

Heikki said, "I propose that we do something similar, but not exactly
the same. Let's have a setting, max_wal_size, to control the max. disk
space reserved for WAL. Once that's reached (or you get close enough, so
that there are still some segments left to consume while the checkpoint
runs), a checkpoint is triggered.

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high,
without actually consuming that much space in normal operation. It's
just a backstop, to avoid completely filling the disk, if there's a
sudden burst of activity. The number of segments preallocated is
auto-tuned, based on the number of segments used in previous checkpoint
cycles. "

This makes sense except I don't see a need for the parameter. Why not
just specify how the algorithm works and adhere to that without the need
for another GUC? Perhaps at any given point we save 10% of available
space (within a 16MB calculation) for pg_xlog, you hit it, we checkpoint
and LOG EXACTLY WHY.

Instead of "running out of disk space PANIC" we should just write to an
emergency location within PGDATA and log very loudly that the SA isn't
paying attention. Perhaps if that area starts to get to an unhappy place
we immediately bounce into read-only mode and log even more loudly that
the SA should be fired. I would think read-only mode is safer and more
polite than an PANIC crash.

I do not think we should worry about filling up the hard disk except to
protect against data loss in the event. It is not user unfriendly to
assume that a user will pay attention to disk space. Really?

Open to people telling me I am off in left field. Sorry if it is noise.

Sincerely,

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
a rose in the deeps of my heart. - W.B. Yeats

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Joshua D. Drake
jd@commandprompt.com
In reply to: Daniel Farina (#10)
Re: Redesigning checkpoint_segments

On 06/05/2013 06:23 PM, Daniel Farina wrote:

On Wed, Jun 5, 2013 at 6:00 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

I didn't see that proposal, link? Because the idea of slowing down
wal-writing sounds insane.

It's not as insane as introducing an archiving gap, PANICing and
crashing, or running this hunk o junk I wrote
http://github.com/fdr/ratchet

Well certainly we shouldn't PANIC and crash but that is a simple fix.
You have a backup write location and start logging really loudly that
you are using it.

Sincerely,

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
a rose in the deeps of my heart. - W.B. Yeats

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Daniel Farina
daniel@heroku.com
In reply to: Joshua D. Drake (#12)
Re: Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 8:23 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

It's not as insane as introducing an archiving gap, PANICing and
crashing, or running this hunk o junk I wrote
http://github.com/fdr/ratchet

Well certainly we shouldn't PANIC and crash but that is a simple fix. You
have a backup write location and start logging really loudly that you are
using it.

If I told you there were some of us who would prefer to attenuate the
rate that things get written rather than cancel or delay archiving for
a long period of time, would that explain the framing of the problem?

Or, is it that you understand that's what I want, but find the notion
of such a operation hard to relate to?

Or, am I misunderstanding your confusion?

Or, none of the above?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Joshua D. Drake
jd@commandprompt.com
In reply to: Daniel Farina (#13)
Re: Redesigning checkpoint_segments

On 6/5/2013 10:07 PM, Daniel Farina wrote:

If I told you there were some of us who would prefer to attenuate the
rate that things get written rather than cancel or delay archiving for
a long period of time, would that explain the framing of the problem?

I understand that based on what you said above.

Or, is it that you understand that's what I want, but find the notion
of such a operation hard to relate to?

I think this is where I am at. To me, you don't attenuate the rate that
things get written, you fix the problem in needing to do so. The problem
is one of provisioning. Please note that I am not suggesting there
aren't improvements to be made, there absolutely are. I just wonder if
we are looking in the right place (outside of some obvious badness like
the PANIC running out of disk space).

Or, am I misunderstanding your confusion?

To be honest part of my confusion was just trying to parse all the bits
that people were talking about into a cohesive, "this is the actual
problem".

Sincerely,

JD

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Joshua D. Drake (#14)
Re: Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 10:27 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

I just wonder if we are looking in the right place (outside of some obvious
badness like the PANIC running out of disk space).

So you don't think we should PANIC on running out of disk space? If
you don't think we should do that, and you don't think that WAL
writing should be throttled, what's the alternative?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Daniel Farina
daniel@heroku.com
In reply to: Joshua D. Drake (#14)
Re: Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 10:27 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

On 6/5/2013 10:07 PM, Daniel Farina wrote:

If I told you there were some of us who would prefer to attenuate the
rate that things get written rather than cancel or delay archiving for
a long period of time, would that explain the framing of the problem?

I understand that based on what you said above.

Or, is it that you understand that's what I want, but find the notion
of such a operation hard to relate to?

I think this is where I am at. To me, you don't attenuate the rate that
things get written, you fix the problem in needing to do so. The problem is
one of provisioning. Please note that I am not suggesting there aren't
improvements to be made, there absolutely are. I just wonder if we are
looking in the right place (outside of some obvious badness like the PANIC
running out of disk space).

Okay, well, I don't see the fact that the block device is faster than
the archive command as a "problem," it's just an artifact of the
ratios of performance of stuff in the system. If one views archives
as a must-have, there's not much other choice than to attenuate.

An alternative is to buy a slower block device. That'd accomplish the
same effect, but it's a pretty bizarre and heavyhanded way to go about
it, and not easily adaptive to, say, if I made the archive command
faster (in my case, I well could, with some work).

So, I don't think it's all that unnatural to allow for the flexibility
of a neat attenuation technique, and it's pretty important too.
Methinks. Disagree?

Final thought: I can't really tell users to knock off what they're
doing on a large scale. It's better to not provide abrupt changes in
service (like crashing or turning off everything for extended periods
while the archive uploads). So, smoothness and predictability is
desirable.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Joshua D. Drake
jd@commandprompt.com
In reply to: Peter Geoghegan (#15)
Re: Redesigning checkpoint_segments

On 6/5/2013 10:54 PM, Peter Geoghegan wrote:

On Wed, Jun 5, 2013 at 10:27 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

I just wonder if we are looking in the right place (outside of some obvious
badness like the PANIC running out of disk space).

So you don't think we should PANIC on running out of disk space? If
you don't think we should do that, and you don't think that WAL
writing should be throttled, what's the alternative?

As I mentioned in my previous email:

Instead of "running out of disk space PANIC" we should just write to an
emergency location within PGDATA and log very loudly that the SA isn't
paying attention. Perhaps if that area starts to get to an unhappy place
we immediately bounce into read-only mode and log even more loudly that
the SA should be fired. I would think read-only mode is safer and more
polite than an PANIC crash.

I do not think we should worry about filling up the hard disk except to
protect against data loss in the event. It is not user unfriendly to
assume that a user will pay attention to disk space. Really?

JD

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Daniel Farina
daniel@heroku.com
In reply to: Joshua D. Drake (#17)
Re: Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 11:05 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

On 6/5/2013 10:54 PM, Peter Geoghegan wrote:

On Wed, Jun 5, 2013 at 10:27 PM, Joshua D. Drake <jd@commandprompt.com>
wrote:

I just wonder if we are looking in the right place (outside of some
obvious
badness like the PANIC running out of disk space).

So you don't think we should PANIC on running out of disk space? If
you don't think we should do that, and you don't think that WAL
writing should be throttled, what's the alternative?

As I mentioned in my previous email:

Instead of "running out of disk space PANIC" we should just write to an
emergency location within PGDATA and log very loudly that the SA isn't
paying attention. Perhaps if that area starts to get to an unhappy place we
immediately bounce into read-only mode and log even more loudly that the SA
should be fired. I would think read-only mode is safer and more polite than
an PANIC crash.

I do not think we should worry about filling up the hard disk except to
protect against data loss in the event. It is not user unfriendly to assume
that a user will pay attention to disk space. Really?

Okay, then I will say it's user unfriendly, especially for a transient
use of space, and particularly if there's no knob for said SA to
attenuate what's going on. You appear to assume the SA can lean on
the application to knock off whatever is going on or provision more
disk in time, or that disk is reliable enough to meet one's goals. In
my case, none of these precepts are true or desirable.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Harold Giménez
harold.gimenez@gmail.com
In reply to: Joshua D. Drake (#17)
Re: Redesigning checkpoint_segments

Hi,

On Wed, Jun 5, 2013 at 11:05 PM, Joshua D. Drake <jd@commandprompt.com>wrote:

On 6/5/2013 10:54 PM, Peter Geoghegan wrote:

On Wed, Jun 5, 2013 at 10:27 PM, Joshua D. Drake <jd@commandprompt.com>
wrote:

Instead of "running out of disk space PANIC" we should just write to an
emergency location within PGDATA

This merely buys you some time, but with aggressive and sustained write
throughput you are left on the same spot. Practically speaking it's the
same situation as increasing the pg_xlog disk space.

and log very loudly that the SA isn't paying attention. Perhaps if that
area starts to get to an unhappy place we immediately bounce into read-only
mode and log even more loudly that the SA should be fired. I would think
read-only mode is safer and more polite than an PANIC crash.

I agree it is better than PANIC, but read-only mode is definitely also a
form of throttling; a much more abrupt and unfriendly one if I may add.

Regards,

-Harold

#20Joshua D. Drake
jd@commandprompt.com
In reply to: Daniel Farina (#18)
Re: Redesigning checkpoint_segments

On 6/5/2013 11:09 PM, Daniel Farina wrote:

Instead of "running out of disk space PANIC" we should just write to an
emergency location within PGDATA and log very loudly that the SA isn't
paying attention. Perhaps if that area starts to get to an unhappy place we
immediately bounce into read-only mode and log even more loudly that the SA
should be fired. I would think read-only mode is safer and more polite than
an PANIC crash.

I do not think we should worry about filling up the hard disk except to
protect against data loss in the event. It is not user unfriendly to assume
that a user will pay attention to disk space. Really?
Okay, then I will say it's user unfriendly, especially for a transient
use of space, and particularly if there's no knob for said SA to
attenuate what's going on. You appear to assume the SA can lean on
the application to knock off whatever is going on or provision more
disk in time, or that disk is reliable enough to meet one's goals. In
my case, none of these precepts are true or desirable.

I have zero doubt that in your case it is true and desirable. I just
don't know that it is a positive solution to the problem as a whole.
Your case is rather limited to your environment, which is rather limited
to the type of user that your environment has. Which lends itself to the
idea that this should be a Heroku Postgres thing, not a .Org wide thing.

Sincerely,

JD

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Joshua D. Drake (#20)
#22Joshua D. Drake
jd@commandprompt.com
In reply to: Harold Giménez (#19)
#23Joshua D. Drake
jd@commandprompt.com
In reply to: Peter Geoghegan (#21)
#24Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Joshua D. Drake (#11)
#25Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Josh Berkus (#6)
#26Joshua D. Drake
jd@commandprompt.com
In reply to: Heikki Linnakangas (#24)
#27Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Joshua D. Drake (#26)
#28Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kevin Grittner (#4)
#29Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#5)
#30Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Heikki Linnakangas (#28)
#31Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kevin Grittner (#30)
#32Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#1)
#33Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#1)
#34Jeff Janes
jeff.janes@gmail.com
In reply to: Joshua D. Drake (#11)
#35Jeff Janes
jeff.janes@gmail.com
In reply to: Joshua D. Drake (#26)
#36Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Josh Berkus (#33)
#37Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#1)
#38Greg Smith
gsmith@gregsmith.com
In reply to: Joshua D. Drake (#26)
#39Greg Smith
gsmith@gregsmith.com
In reply to: Heikki Linnakangas (#25)
#40Robert Haas
robertmhaas@gmail.com
In reply to: Greg Smith (#39)
#41Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#40)
#42Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#41)
#43Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#42)
#44Greg Smith
gsmith@gregsmith.com
In reply to: Robert Haas (#40)
#45Craig Ringer
craig@2ndquadrant.com
In reply to: Josh Berkus (#32)
#46Craig Ringer
craig@2ndquadrant.com
In reply to: Joshua D. Drake (#23)
#47Peter Eisentraut
peter_e@gmx.net
In reply to: Heikki Linnakangas (#36)
#48Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#1)
#49Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Eisentraut (#47)
#50Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#1)
#51Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#49)
#52Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#49)
#53Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Josh Berkus (#50)
#54Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Amit Kapila (#51)
#55Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#1)
#56Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Josh Berkus (#55)
#57Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#1)
#58Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Josh Berkus (#57)
#59Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#1)
#60Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Josh Berkus (#59)
#61Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#60)
#62Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#61)
#63Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#53)
#64Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#56)
#65Venkata B Nagothi
nag1010@gmail.com
In reply to: Heikki Linnakangas (#53)
#66Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Venkata B Nagothi (#65)
#67Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#66)
#68Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#67)
#69Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#68)
#70Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#67)
#71Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Heikki Linnakangas (#69)
#72Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#70)
#73Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#72)
#74Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#73)
#75Petr Jelinek
petr@2ndquadrant.com
In reply to: Robert Haas (#74)
#76Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#1)
#77Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#76)
#78Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#1)
#79Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#78)
#80Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#1)
#81David Steele
david@pgmasters.net
In reply to: Robert Haas (#79)
#82Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: David Steele (#81)
#83Amit Kapila
amit.kapila16@gmail.com
In reply to: Josh Berkus (#80)
#84Venkata B Nagothi
nag1010@gmail.com
In reply to: Heikki Linnakangas (#1)
#85Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#80)
#86Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#85)
#87Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#86)
#88Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#1)
#89Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#88)
#90Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#1)
#91Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#89)
#92Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#91)
#93Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#90)
#94Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#1)
#95David Steele
david@pgmasters.net
In reply to: Josh Berkus (#94)
#96Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Josh Berkus (#80)
#97Petr Jelinek
petr@2ndquadrant.com
In reply to: Heikki Linnakangas (#96)
#98Venkata B Nagothi
nag1010@gmail.com
In reply to: Petr Jelinek (#97)
#99Robert Haas
robertmhaas@gmail.com
In reply to: Petr Jelinek (#97)
#100Petr Jelinek
petr@2ndquadrant.com
In reply to: Robert Haas (#99)
#101Venkata B Nagothi
nag1010@gmail.com
In reply to: Heikki Linnakangas (#96)
#102Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#99)
#103Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#102)
#104Josh Berkus
josh@agliodbs.com
In reply to: Josh Berkus (#76)
#105Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Josh Berkus (#104)
#106Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#105)
#107Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#106)
#108Josh Berkus
josh@agliodbs.com
In reply to: Josh Berkus (#80)
#109Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Josh Berkus (#108)
#110Stephen Frost
sfrost@snowman.net
In reply to: Heikki Linnakangas (#109)
#111Josh Berkus
josh@agliodbs.com
In reply to: Josh Berkus (#80)
#112Jeff Janes
jeff.janes@gmail.com
In reply to: Heikki Linnakangas (#103)
#113Jeff Janes
jeff.janes@gmail.com
In reply to: Jeff Janes (#112)
#114Fujii Masao
masao.fujii@gmail.com
In reply to: Jeff Janes (#113)
#115Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Janes (#113)
#116Jeff Janes
jeff.janes@gmail.com
In reply to: Fujii Masao (#114)
#117Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jeff Janes (#116)
#118Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#117)
#119Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#118)
#120Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#119)
#121Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#117)