Hard limit on WAL space used (because PANIC sucks)

Started by Heikki Linnakangasalmost 13 years ago106 messageshackers

heikki.linnakangas@enterprisedb.com

almost 13 years ago

In the "Redesigning checkpoint_segments" thread, many people opined that
there should be a hard limit on the amount of disk space used for WAL:
/messages/by-id/CA+TgmoaOkgZb5YsmQeMg8ZVqWMtR=6S4-PPd+6jiy4OQ78ihUA@mail.gmail.com.
I'm starting a new thread on that, because that's mostly orthogonal to
redesigning checkpoint_segments.

The current situation is that if you run out of disk space while writing
WAL, you get a PANIC, and the server shuts down. That's awful. We can
try to avoid that by checkpointing early enough, so that we can remove
old WAL segments to make room for new ones before you run out, but
unless we somehow throttle or stop new WAL insertions, it's always going
to be possible to use up all disk space. A typical scenario where that
happens is when archive_command fails for some reason; even a checkpoint
can't remove old, unarchived segments in that case. But it can happen
even without WAL archiving.

I've seen a case, where it was even worse than a PANIC and shutdown.
pg_xlog was on a separate partition that had nothing else on it. The
partition filled up, and the system shut down with a PANIC. Because
there was no space left, it could not even write the checkpoint after
recovery, and thus refused to start up again. There was nothing else on
the partition that you could delete to make space. The only recourse
would've been to add more disk space to the partition (impossible), or
manually delete an old WAL file that was not needed to recover from the
latest checkpoint (scary). Fortunately this was a test system, so we
just deleted everything.

So we need to somehow stop new WAL insertions from happening, before
it's too late.

Peter Geoghegan suggested one method here:
/messages/by-id/CAM3SWZQcyNxvPaskr-pxm8DeqH7_qevW7uqbhPCsg1FpSxKpoQ@mail.gmail.com.
I don't think that exact proposal is going to work very well; throttling
WAL flushing by holding WALWriteLock in WAL writer can have knock-on
effects on the whole system, as Robert Haas mentioned. Also, it'd still
be possible to run out of space, just more difficult.

To make sure there is enough room for the checkpoint to finish, other
WAL insertions have to stop some time before you completely run out of
disk space. The question is how to do that.

A naive idea is to check if there's enough preallocated WAL space, just
before inserting the WAL record. However, it's too late to check that in
XLogInsert; once you get there, you're already holding exclusive locks
on data pages, and you are in a critical section so you can't back out.
At that point, you have to write the WAL record quickly, or the whole
system will suffer. So we need to act earlier.

A more workable idea is to sprinkle checks in higher-level code, before
you hold any critical locks, to check that there is enough preallocated
WAL. Like, at the beginning of heap_insert, heap_update, etc., and all
similar indexam entry points. I propose that we maintain a WAL
reservation system in shared memory. First of all, keep track of how
much preallocated WAL there is left (and try to create more if needed).
Also keep track of a different number: the amount of WAL pre-reserved
for future insertions. Before entering the critical section, increase
the reserved number with a conservative estimate (ie. high enough) of
how much WAL space you need, and check that there is still enough
preallocated WAL to satisfy all the reservations. If not, throw an error
or sleep until there is. After you're done with the insertion, release
the reservation by decreasing the number again.

A shared reservation counter like that could become a point of
contention. One optimization is keep a constant reservation of, say, 32
KB for each backend. That's enough for most operations. Change the logic
so that you check if you've exceeded the reserved amount of space
*after* writing the WAL record, while you're holding WALInsertLock
anyway. If you do go over the limit, set a flag in backend-private
memory indicating that the *next* time you're about to enter a critical
section where you will write a WAL record, you check again if more space
has been made available.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

almost 13 years ago

In reply to: Heikki Linnakangas (#1)

Re: Hard limit on WAL space used (because PANIC sucks)

On 2013-06-06 17:00:30 +0300, Heikki Linnakangas wrote:

A more workable idea is to sprinkle checks in higher-level code, before you
hold any critical locks, to check that there is enough preallocated WAL.
Like, at the beginning of heap_insert, heap_update, etc., and all similar
indexam entry points. I propose that we maintain a WAL reservation system in
shared memory.

I am rather doubtful that this won't end up with a bunch of complex code
that won't prevent the situation in all circumstances but which will
provide bugs/performance problems for some time.
Obviously that's just gut feeling since I haven't see the code...

I am much more excited about getting the soft limit case right and then
seeing how many problems remain in reality.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 13 years ago

In reply to: Andres Freund (#2)

Re: Hard limit on WAL space used (because PANIC sucks)

On 06.06.2013 17:17, Andres Freund wrote:

On 2013-06-06 17:00:30 +0300, Heikki Linnakangas wrote:

A more workable idea is to sprinkle checks in higher-level code, before you
hold any critical locks, to check that there is enough preallocated WAL.
Like, at the beginning of heap_insert, heap_update, etc., and all similar
indexam entry points. I propose that we maintain a WAL reservation system in
shared memory.

I am rather doubtful that this won't end up with a bunch of complex code
that won't prevent the situation in all circumstances but which will
provide bugs/performance problems for some time.
Obviously that's just gut feeling since I haven't see the code...

I also have a feeling that we'll likely miss some corner cases in the
first cut, so that you can still run out of disk space if you try hard
enough / are unlucky. But I think it would still be a big improvement if
it only catches, say 90% of the cases.

I think it can be made fairly robust otherwise, and the performance
impact should be pretty easy to measure with e.g pgbench.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Christian Ullrich

chris@chrullrich.net

almost 13 years ago

In reply to: Heikki Linnakangas (#1)

Re: Hard limit on WAL space used (because PANIC sucks)

* Heikki Linnakangas wrote:

The current situation is that if you run out of disk space while writing
WAL, you get a PANIC, and the server shuts down. That's awful. We can

So we need to somehow stop new WAL insertions from happening, before
it's too late.

A naive idea is to check if there's enough preallocated WAL space, just
before inserting the WAL record. However, it's too late to check that in

There is a database engine, Microsoft's "Jet Blue" aka the Extensible
Storage Engine, that just keeps some preallocated log files around,
specifically so it can get consistent and halt cleanly if it runs out of
disk space.

In other words, the idea is not to check over and over again that there
is enough already-reserved WAL space, but to make sure there always is
by having a preallocated segment that is never used outside a disk space
emergency.

--
Christian

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

almost 13 years ago

In reply to: Christian Ullrich (#4)

Re: Hard limit on WAL space used (because PANIC sucks)

On 2013-06-06 23:28:19 +0200, Christian Ullrich wrote:

* Heikki Linnakangas wrote:

The current situation is that if you run out of disk space while writing
WAL, you get a PANIC, and the server shuts down. That's awful. We can

So we need to somehow stop new WAL insertions from happening, before
it's too late.

A naive idea is to check if there's enough preallocated WAL space, just
before inserting the WAL record. However, it's too late to check that in

There is a database engine, Microsoft's "Jet Blue" aka the Extensible
Storage Engine, that just keeps some preallocated log files around,
specifically so it can get consistent and halt cleanly if it runs out of
disk space.

In other words, the idea is not to check over and over again that there is
enough already-reserved WAL space, but to make sure there always is by
having a preallocated segment that is never used outside a disk space
emergency.

That's not a bad technique. I wonder how reliable it would be in
postgres. Do all filesystems allow a rename() to succeed if there isn't
actually any space left? E.g. on btrfs I wouldn't be sure. We need to
rename because WAL files need to be named after the LSN timelineid...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Bruce Momjian

bruce@momjian.us

almost 13 years ago

In reply to: Andres Freund (#5)

Re: Hard limit on WAL space used (because PANIC sucks)

On Thu, Jun 6, 2013 at 10:38 PM, Andres Freund <andres@2ndquadrant.com> wrote:

That's not a bad technique. I wonder how reliable it would be in
postgres. Do all filesystems allow a rename() to succeed if there isn't
actually any space left? E.g. on btrfs I wouldn't be sure. We need to
rename because WAL files need to be named after the LSN timelineid...

I suppose we could just always do the rename at the same time as
setting up the current log file. That is, when we start wal log x also
set up wal file x+1 at that time.

This isn't actually guaranteed to be enough btw. It's possible that
the record we're actively about to write will require all of both
those files... But that should be very unlikely.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Josh Berkus

josh@agliodbs.com

almost 13 years ago

In reply to: Heikki Linnakangas (#1)

Re: Hard limit on WAL space used (because PANIC sucks)

Let's talk failure cases.

There's actually three potential failure cases here:

- One Volume: WAL is on the same volume as PGDATA, and that volume is
completely out of space.

- XLog Partition: WAL is on its own partition/volume, and fills it up.

- Archiving: archiving is failing or too slow, causing the disk to fill
up with waiting log segments.

I'll argue that these three cases need to be dealt with in three
different ways, and no single solution is going to work for all three.

Archiving
---------

In some ways, this is the simplest case. Really, we just need a way to
know when the available WAL space has become 90% full, and abort
archiving at that stage. Once we stop attempting to archive, we can
clean up the unneeded log segments.

What we need is a better way for the DBA to find out that archiving is
falling behind when it first starts to fall behind. Tailing the log and
examining the rather cryptic error messages we give out isn't very
effective.

xLog Partition
--------------

As Heikki pointed, out, a full dedicated WAL drive is hard to fix once
it gets full, since there's nothing you can safely delete to clear
space, even enough for a checkpoint record.

On the other hand, it should be easy to prevent full status; we could
simply force a non-spread checkpoint whenever the available WAL space
gets 90% full. We'd also probably want to be prepared to switch to a
read-only mode if we get full enough that there's only room for the
checkpoint records.

One Volume
----------

This is the most complicated case, because we wouldn't necessarily run
out of space because of WAL using it up. Anything could cause us to run
out of disk space, including activity logs, swapping, pgsql_tmp files,
database growth, or some other process which writes files.

This means that the DBA getting out of disk-full manually is in some
ways easier; there's usually stuff she can delete. However, it's much
harder -- maybe impossible -- for PostgreSQL to prevent this kind of
space outage. There should be things we can do to make it easier for
the DBA to troubleshoot this, but I'm not sure what.

We could use a hard limit for WAL to prevent WAL from contributing to
out-of-space, but that'll only prevent a minority of cases.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WMe78152785474345e19391879d34252728cd63291f2bbbd3fd92b68bda52b0748adc92d0bac213b8297082803696bae88@asav-3.01.com

Jaime Casanova

jcasanov@systemguards.com.ec

almost 13 years ago

In reply to: Christian Ullrich (#4)

Re: Hard limit on WAL space used (because PANIC sucks)

On Thu, Jun 6, 2013 at 4:28 PM, Christian Ullrich <chris@chrullrich.net> wrote:

* Heikki Linnakangas wrote:

The current situation is that if you run out of disk space while writing
WAL, you get a PANIC, and the server shuts down. That's awful. We can

So we need to somehow stop new WAL insertions from happening, before
it's too late.

A naive idea is to check if there's enough preallocated WAL space, just
before inserting the WAL record. However, it's too late to check that in

There is a database engine, Microsoft's "Jet Blue" aka the Extensible
Storage Engine, that just keeps some preallocated log files around,
specifically so it can get consistent and halt cleanly if it runs out of
disk space.

fwiw, informix (at least until IDS 2000, not sure after that) had the
same thing. only this was a parameter to set, and bad things happened
if you forgot about it :D

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jeff Janes

jeff.janes@gmail.com

almost 13 years ago

In reply to: Josh Berkus (#7)

Re: Hard limit on WAL space used (because PANIC sucks)

On Thursday, June 6, 2013, Josh Berkus wrote:

Let's talk failure cases.

There's actually three potential failure cases here:

- One Volume: WAL is on the same volume as PGDATA, and that volume is
completely out of space.

- XLog Partition: WAL is on its own partition/volume, and fills it up.

- Archiving: archiving is failing or too slow, causing the disk to fill
up with waiting log segments.

I'll argue that these three cases need to be dealt with in three
different ways, and no single solution is going to work for all three.

Archiving
---------

In some ways, this is the simplest case. Really, we just need a way to
know when the available WAL space has become 90% full, and abort
archiving at that stage. Once we stop attempting to archive, we can
clean up the unneeded log segments.

I would oppose that as the solution, either an unconditional one, or
configurable with is it as the default. Those segments are not unneeded.
I need them. That is why I set up archiving in the first place. If you
need to shut down the database rather than violate my established retention
policy, then shut down the database.

What we need is a better way for the DBA to find out that archiving is
falling behind when it first starts to fall behind. Tailing the log and
examining the rather cryptic error messages we give out isn't very
effective.

The archive command can be made a shell script (or that matter a compiled
program) which can do anything it wants upon failure, including emailing
people. Of course maybe whatever causes the archive to fail will also
cause the delivery of the message to fail, but I don't see a real solution
to this that doesn't start down an infinite regress. If it is not failing
outright, but merely falling behind, then I don't really know how to go
about detecting that, either in archive_command, or through tailing the
PostgreSQL log. I guess archive_command, each time it is invoked, could
count the files in the pg_xlog directory and warn if it thinks the number
is unreasonable.

xLog Partition
--------------

As Heikki pointed, out, a full dedicated WAL drive is hard to fix once
it gets full, since there's nothing you can safely delete to clear
space, even enough for a checkpoint record.

Although the DBA probably wouldn't know it from reading the manual, it is
almost always safe to delete the oldest WAL file (after copying it to a
different partition just in case something goes wrong--it should be
possible to do that as if WAL is on its own partition, it is hard to
imagine you can't scrounge up 16MB on a different one), as PostgreSQL keeps
two complete checkpoints worth of WAL around. I think the only reason you
would not be able to recover after removing the oldest file is if the
controldata file is damaged such that the most recent checkpoint record
cannot be found and so it has to fall back to the previous one. Or at
least, this is my understanding.

On the other hand, it should be easy to prevent full status; we could
simply force a non-spread checkpoint whenever the available WAL space
gets 90% full. We'd also probably want to be prepared to switch to a
read-only mode if we get full enough that there's only room for the
checkpoint records.

I think that that last sentence could also be applied without modification
to the "one volume" case as well.

So what would that look like? Before accepting a (non-checkpoint) WAL
Insert that fills up the current segment to a high enough level that a
checkpoint record will no longer fit, it must first verify that a recycled
file exists, or if not it must successfully init a new file.

If that init fails, then it must do what? Signal for a checkpoint, release
it's locks, and then ERROR out? That would be better than a PANIC, but can
it do better? Enter a retry loop so that once the checkpoint has finished
and assuming it has freed up enough WAL files to recycling/removal, then it
can try the original WAL Insert again?

Cheers,

Jeff

#10

Joshua D. Drake

jd@commandprompt.com

almost 13 years ago

In reply to: Jeff Janes (#9)

Re: Hard limit on WAL space used (because PANIC sucks)

On 06/06/2013 09:30 PM, Jeff Janes wrote:

Archiving
---------

In some ways, this is the simplest case. Really, we just need a way to
know when the available WAL space has become 90% full, and abort
archiving at that stage. Once we stop attempting to archive, we can
clean up the unneeded log segments.

I would oppose that as the solution, either an unconditional one, or
configurable with is it as the default. Those segments are not
unneeded. I need them. That is why I set up archiving in the first
place. If you need to shut down the database rather than violate my
established retention policy, then shut down the database.

Agreed and I would oppose it even as configurable. We set up the
archiving for a reason. I do think it might be useful to be able to
store archiving logs as well as wal_keep_segments logs in a different
location than pg_xlog.

What we need is a better way for the DBA to find out that archiving is
falling behind when it first starts to fall behind. Tailing the log and
examining the rather cryptic error messages we give out isn't very
effective.

The archive command can be made a shell script (or that matter a
compiled program) which can do anything it wants upon failure, including
emailing people.

Yep, that is what PITRTools does. You can make it do whatever you want.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Daniel Farina

daniel@heroku.com

almost 13 years ago

In reply to: Jeff Janes (#9)

Re: Hard limit on WAL space used (because PANIC sucks)

On Thu, Jun 6, 2013 at 9:30 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

I would oppose that as the solution, either an unconditional one, or
configurable with is it as the default. Those segments are not unneeded. I
need them. That is why I set up archiving in the first place. If you need
to shut down the database rather than violate my established retention
policy, then shut down the database.

Same boat. My archives are the real storage. The disks are
write-back caching. That's because the storage of my archives is
probably three to five orders of magnitude more reliable.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 13 years ago

In reply to: Andres Freund (#5)

Re: Hard limit on WAL space used (because PANIC sucks)

On 07.06.2013 00:38, Andres Freund wrote:

On 2013-06-06 23:28:19 +0200, Christian Ullrich wrote:

* Heikki Linnakangas wrote:

The current situation is that if you run out of disk space while writing
WAL, you get a PANIC, and the server shuts down. That's awful. We can

So we need to somehow stop new WAL insertions from happening, before
it's too late.

A naive idea is to check if there's enough preallocated WAL space, just
before inserting the WAL record. However, it's too late to check that in

There is a database engine, Microsoft's "Jet Blue" aka the Extensible
Storage Engine, that just keeps some preallocated log files around,
specifically so it can get consistent and halt cleanly if it runs out of
disk space.

In other words, the idea is not to check over and over again that there is
enough already-reserved WAL space, but to make sure there always is by
having a preallocated segment that is never used outside a disk space
emergency.

That's not a bad technique. I wonder how reliable it would be in
postgres.

That's no different from just having a bit more WAL space in the first
place. We need a mechanism to stop backends from writing WAL, before you
run out of it completely. It doesn't matter if the reservation is done
by stashing away a WAL segment for emergency use, or by a variable in
shared memory. Either way, backends need to stop using it up, by
blocking or throwing an error before they enter the critical section.

I guess you could use the stashed away segment to ensure that you can
recover after PANIC. At recovery, there are no other backends that could
use up the emergency segment. But that's not much better than what we
have now.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Bernd Helmle

mailings@oopsware.de

almost 13 years ago

In reply to: Josh Berkus (#7)

Re: Hard limit on WAL space used (because PANIC sucks)

--On 6. Juni 2013 16:25:29 -0700 Josh Berkus <josh@agliodbs.com> wrote:

Archiving
---------

In some ways, this is the simplest case. Really, we just need a way to
know when the available WAL space has become 90% full, and abort
archiving at that stage. Once we stop attempting to archive, we can
clean up the unneeded log segments.

What we need is a better way for the DBA to find out that archiving is
falling behind when it first starts to fall behind. Tailing the log and
examining the rather cryptic error messages we give out isn't very
effective.

Slightly OT, but i always wondered wether we could create a function, say

pg_last_xlog_removed()

for example, returning a value suitable to be used to calculate the
distance to the current position. An increasing value could be used to
instruct monitoring to throw a warning if a certain threshold is exceeded.

I've also seen people creating monitoring scripts by looking into
archive_status and do simple counts on the .ready files and give a warning,
if that exceeds an expected maximum value.

I haven't looked at the code very deep, but i think we already store the
position of the last removed xlog in shared memory already, maybe this can
be used somehow. Afaik, we do cleanup only during checkpoints, so this all
has too much delay...

--
Thanks

Bernd

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 13 years ago

In reply to: Heikki Linnakangas (#1)

Re: Hard limit on WAL space used (because PANIC sucks)

On 06.06.2013 17:00, Heikki Linnakangas wrote:

A more workable idea is to sprinkle checks in higher-level code, before
you hold any critical locks, to check that there is enough preallocated
WAL. Like, at the beginning of heap_insert, heap_update, etc., and all
similar indexam entry points.

Actually, there's one place that catches most of these: LockBuffer(...,
BUFFER_LOCK_EXCLUSIVE). In all heap and index operations, you always
grab an exclusive lock on a page first, before entering the critical
section where you call XLogInsert.

That leaves a few miscellaneous XLogInsert calls that need to be
guarded, but it leaves a lot less room for bugs of omission, and keeps
the code cleaner.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Heikki Linnakangas (#14)

Re: Hard limit on WAL space used (because PANIC sucks)

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 06.06.2013 17:00, Heikki Linnakangas wrote:

A more workable idea is to sprinkle checks in higher-level code, before
you hold any critical locks, to check that there is enough preallocated
WAL. Like, at the beginning of heap_insert, heap_update, etc., and all
similar indexam entry points.

Actually, there's one place that catches most of these: LockBuffer(...,
BUFFER_LOCK_EXCLUSIVE). In all heap and index operations, you always
grab an exclusive lock on a page first, before entering the critical
section where you call XLogInsert.

Not only is that a horrible layering/modularity violation, but surely
LockBuffer can have no idea how much WAL space will be needed.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 13 years ago

In reply to: Tom Lane (#15)

Re: Hard limit on WAL space used (because PANIC sucks)

On 07.06.2013 19:33, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

On 06.06.2013 17:00, Heikki Linnakangas wrote:

A more workable idea is to sprinkle checks in higher-level code, before
you hold any critical locks, to check that there is enough preallocated
WAL. Like, at the beginning of heap_insert, heap_update, etc., and all
similar indexam entry points.

Actually, there's one place that catches most of these: LockBuffer(...,
BUFFER_LOCK_EXCLUSIVE). In all heap and index operations, you always
grab an exclusive lock on a page first, before entering the critical
section where you call XLogInsert.

Not only is that a horrible layering/modularity violation, but surely
LockBuffer can have no idea how much WAL space will be needed.

It can be just a conservative guess, like, 32KB. That should be enough
for almost all WAL-logged operations. The only exception that comes to
mind is a commit record, which can be arbitrarily large, when you have a
lot of subtransactions or dropped/created relations.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Heikki Linnakangas (#16)

Re: Hard limit on WAL space used (because PANIC sucks)

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 07.06.2013 19:33, Tom Lane wrote:

Not only is that a horrible layering/modularity violation, but surely
LockBuffer can have no idea how much WAL space will be needed.

It can be just a conservative guess, like, 32KB. That should be enough
for almost all WAL-logged operations. The only exception that comes to
mind is a commit record, which can be arbitrarily large, when you have a
lot of subtransactions or dropped/created relations.

What happens when several updates are occurring concurrently?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Josh Berkus

josh@agliodbs.com

almost 13 years ago

In reply to: Heikki Linnakangas (#1)

Re: Hard limit on WAL space used (because PANIC sucks)

I would oppose that as the solution, either an unconditional one, or
configurable with is it as the default. Those segments are not
unneeded. I need them. That is why I set up archiving in the first
place. If you need to shut down the database rather than violate my
established retention policy, then shut down the database.

Agreed and I would oppose it even as configurable. We set up the
archiving for a reason. I do think it might be useful to be able to
store archiving logs as well as wal_keep_segments logs in a different
location than pg_xlog.

People have different configurations. Most of my clients use archiving
for backup or replication; they would rather have archiving cease (and
send a CRITICAL alert) than have the master go offline. That's pretty
common, probably more common than the "if I don't have redundancy shut
down" case.

Certainly anyone making the decision that their master database should
shut down rather than cease archiving should make it *consciously*,
instead of finding out the hard way.

The archive command can be made a shell script (or that matter a
compiled program) which can do anything it wants upon failure, including
emailing people.

You're talking about using external tools -- frequently hackish,
workaround ones -- to handle something which PostgreSQL should be doing
itself, and which only the database engine has full knowledge of. While
that's the only solution we have for now, it's hardly a worthy design goal.

Right now, what we're telling users is "You can have continuous backup
with Postgres, but you'd better hire and expensive consultant to set it
up for you, or use this external tool of dubious provenance which
there's no packages for, or you might accidentally cause your database
to shut down in the middle of the night."

At which point most sensible users say "no thanks, I'll use something else".

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM8340df87643972fcc0dcbf35c5538e5f4840ae15c9ea234d05966a51cac2ec86f571c6c026b7af2ab2b0307a4b672190@asav-3.01.com

#19

Daniel Farina

daniel@heroku.com

almost 13 years ago

In reply to: Josh Berkus (#18)

Re: Hard limit on WAL space used (because PANIC sucks)

On Fri, Jun 7, 2013 at 12:14 PM, Josh Berkus <josh@agliodbs.com> wrote:

Right now, what we're telling users is "You can have continuous backup
with Postgres, but you'd better hire and expensive consultant to set it
up for you, or use this external tool of dubious provenance which
there's no packages for, or you might accidentally cause your database
to shut down in the middle of the night."

Inverted and just as well supported: "if you want to not accidentally
lose data, you better hire an expensive consultant to check your
systems for all sorts of default 'safety = off' features." This
being but the hypothetical first one.

Furthermore, I see no reason why high quality external archiving
software cannot exist. Maybe some even exists already, and no doubt
they can be improved and the contract with Postgres enriched to that
purpose.

Contrast: JSON, where the stable OID in the core distribution helps
pragmatically punt on a particularly sticky problem (extension
dependencies and non-system OIDs), I can't think of a reason an
external archiver couldn't do its job well right now.

At which point most sensible users say "no thanks, I'll use something else".

Oh, I lost some disks, well, no big deal, I'll use the archives. Surprise!

So, as it turns out, it has been dropping segments at times because of
systemic backlog for months/years.

Alternative ending:

Hey, I restored the database.

<later> Why is the state so old? Why are customers getting warnings
that their (thought paid) invoices are overdue? Oh crap, the restore
was cut short by this stupid option and this database lives in the
past!

Fin.

I have a clear bias in experience here, but I can't relate to someone
who sets up archives but is totally okay losing a segment unceremoniously,
because it only takes one of those once in a while to make a really,
really bad day. Who is this person that lackadaisically archives, and
are they just fooling themselves? And where are these archivers that
enjoy even a modicum of long-term success that are not reliable? If
one wants to casually drop archives, how is someone going to find out
and freak out a bit? Per experience, logs are pretty clearly
hazardous to the purpose.

Basically, I think the default that opts one into danger is not good,
especially since the system is starting from a position of "do too
much stuff and you'll crash."

Finally, it's not that hard to teach any archiver how to no-op at
user-peril, or perhaps Postgres can learn a way to do this expressly
to standardize the procedure a bit to ease publicly shared recipes, perhaps.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 13 years ago

In reply to: Daniel Farina (#19)

Re: Hard limit on WAL space used (because PANIC sucks)

From: "Daniel Farina" <daniel@heroku.com>

On Fri, Jun 7, 2013 at 12:14 PM, Josh Berkus <josh@agliodbs.com> wrote:

Right now, what we're telling users is "You can have continuous backup
with Postgres, but you'd better hire and expensive consultant to set it
up for you, or use this external tool of dubious provenance which
there's no packages for, or you might accidentally cause your database
to shut down in the middle of the night."

At which point most sensible users say "no thanks, I'll use something
else".

Inverted and just as well supported: "if you want to not accidentally
lose data, you better hire an expensive consultant to check your
systems for all sorts of default 'safety = off' features." This
being but the hypothetical first one.

Furthermore, I see no reason why high quality external archiving
software cannot exist. Maybe some even exists already, and no doubt
they can be improved and the contract with Postgres enriched to that
purpose.

Finally, it's not that hard to teach any archiver how to no-op at
user-peril, or perhaps Postgres can learn a way to do this expressly
to standardize the procedure a bit to ease publicly shared recipes,
perhaps.

Yes, I feel designing reliable archiving, even for the simplest case - copy
WAL to disk, is very difficult. I know there are following three problems
if you just follow the PostgreSQL manual. Average users won't notice them.
I guess even professional DBAs migrating from other DBMSs won't, either.

1. If the machine or postgres crashes while archive_command is copying a WAL
file, later archive recovery fails.
This is because cp leaves a file of less than 16MB in archive area, and
postgres refuses to start when it finds such a small archive WAL file.
The solution, which IIRC Tomas san told me here, is to do like "cp %p
/archive/dir/%f.tmp && mv /archive/dir/%f.tmp /archive/dir/%f".

2. archive_command dumps core when you run pg_ctl stop -mi.
This is because postmaster sends SIGQUIT to all its descendants. The core
files accumulate in the data directory, which will be backed up with the
database. Of course those core files are garbage.
archive_command script needs to catch SIGQUIT and exit.

3. You cannot know the reason of archive_command failure (e.g. archive area
full) if you don't use PostgreSQL's server logging.
This is because archive_command failure is not logged in syslog/eventlog.

I hope PostgreSQL will provide a reliable archiving facility that is ready
to use.

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Joshua D. Drake

jd@commandprompt.com

almost 13 years ago

In reply to: Josh Berkus (#18)

#22

Joshua D. Drake

jd@commandprompt.com

almost 13 years ago

In reply to: Tsunakawa, Takayuki (#20)

#23

Joshua D. Drake

jd@commandprompt.com

almost 13 years ago

In reply to: Heikki Linnakangas (#3)

#24

Andres Freund

andres@anarazel.de

almost 13 years ago

In reply to: Joshua D. Drake (#23)

#25

Andres Freund

andres@anarazel.de

almost 13 years ago

In reply to: Heikki Linnakangas (#12)

#26

Jeff Janes

jeff.janes@gmail.com

almost 13 years ago

In reply to: Josh Berkus (#18)

#27

Jeff Janes

jeff.janes@gmail.com

almost 13 years ago

In reply to: Joshua D. Drake (#23)

#28

Joshua D. Drake

jd@commandprompt.com

almost 13 years ago

In reply to: Andres Freund (#24)

#29

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Heikki Linnakangas (#12)

#30

Jeff Janes

jeff.janes@gmail.com

almost 13 years ago

In reply to: Andres Freund (#24)

#31

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 13 years ago

In reply to: Joshua D. Drake (#22)

#32

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 13 years ago

In reply to: Joshua D. Drake (#28)

#33

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 13 years ago

In reply to: Josh Berkus (#7)

#34

Craig Ringer

craig@2ndquadrant.com

almost 13 years ago

In reply to: Heikki Linnakangas (#1)

#35

Craig Ringer

craig@2ndquadrant.com

almost 13 years ago

In reply to: Daniel Farina (#19)

#36

Craig Ringer

craig@2ndquadrant.com

almost 13 years ago

In reply to: Tsunakawa, Takayuki (#33)

#37

Craig Ringer

craig@2ndquadrant.com

almost 13 years ago

In reply to: Jeff Janes (#26)

#38

Andres Freund

andres@anarazel.de

almost 13 years ago

In reply to: Joshua D. Drake (#28)

#39

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 13 years ago

In reply to: Craig Ringer (#36)

#40

Craig Ringer

craig@2ndquadrant.com

almost 13 years ago

In reply to: Tsunakawa, Takayuki (#39)

#41

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 13 years ago

In reply to: Craig Ringer (#40)

#42

Josh Berkus

josh@agliodbs.com

almost 13 years ago

In reply to: Heikki Linnakangas (#1)

#43

Jeff Janes

jeff.janes@gmail.com

almost 13 years ago

In reply to: Joshua D. Drake (#22)

#44

Daniel Farina

daniel@heroku.com

almost 13 years ago

In reply to: Josh Berkus (#42)

#45

Josh Berkus

josh@agliodbs.com

almost 13 years ago

In reply to: Heikki Linnakangas (#1)

#46

Daniel Farina

daniel@heroku.com

almost 13 years ago

In reply to: Josh Berkus (#45)

#47

Joshua D. Drake

jd@commandprompt.com

almost 13 years ago

In reply to: Josh Berkus (#45)

#48

Josh Berkus

josh@agliodbs.com

almost 13 years ago

In reply to: Heikki Linnakangas (#1)

#49

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Tsunakawa, Takayuki (#20)

#50

Claudio Freire

klaussfreire@gmail.com

almost 13 years ago

In reply to: Robert Haas (#49)

#51

Tatsuo Ishii

t-ishii@sra.co.jp

almost 13 years ago

In reply to: Robert Haas (#49)

#52

Magnus Hagander

magnus@hagander.net

almost 13 years ago

In reply to: Robert Haas (#49)

#53

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Jeff Janes (#30)

#54

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Magnus Hagander (#52)

#55

Peter Eisentraut

peter_e@gmx.net

almost 13 years ago

In reply to: Robert Haas (#49)

#56

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Peter Eisentraut (#55)

#57

Joshua D. Drake

jd@commandprompt.com

almost 13 years ago

In reply to: Robert Haas (#54)

#58

Claudio Freire

klaussfreire@gmail.com

almost 13 years ago

In reply to: Joshua D. Drake (#57)

#59

Josh Berkus

josh@agliodbs.com

almost 13 years ago

In reply to: Heikki Linnakangas (#1)

#60

Joshua D. Drake

jd@commandprompt.com

almost 13 years ago

In reply to: Josh Berkus (#59)

#61

Craig Ringer

craig@2ndquadrant.com

almost 13 years ago

In reply to: Josh Berkus (#59)

#62

Brendan Jurd

direvus@gmail.com

almost 13 years ago

In reply to: Craig Ringer (#61)

#63

Craig Ringer

craig@2ndquadrant.com

almost 13 years ago

In reply to: Brendan Jurd (#62)

#64

Brendan Jurd

direvus@gmail.com

almost 13 years ago

In reply to: Craig Ringer (#63)

#65

Stefan Drees

stefan@drees.name

almost 13 years ago

In reply to: Brendan Jurd (#64)

#66

Joshua D. Drake

jd@commandprompt.com

almost 13 years ago

In reply to: Craig Ringer (#63)

#67

Joshua D. Drake

jd@commandprompt.com

almost 13 years ago

In reply to: Brendan Jurd (#64)

#68

Craig Ringer

craig@2ndquadrant.com

almost 13 years ago

In reply to: Joshua D. Drake (#66)

#69

Martijn van Oosterhout

kleptog@svana.org

almost 13 years ago

In reply to: Craig Ringer (#61)

#70

Craig Ringer

craig@2ndquadrant.com

almost 13 years ago

In reply to: Martijn van Oosterhout (#69)

#71

Bruce Momjian

bruce@momjian.us

almost 13 years ago

In reply to: Craig Ringer (#61)

#72

Dimitri Fontaine

dimitri@2ndQuadrant.fr

almost 13 years ago

In reply to: Peter Eisentraut (#55)

#73

Bruce Momjian

bruce@momjian.us

almost 13 years ago

In reply to: Craig Ringer (#40)

#74

Simon Riggs

simon@2ndQuadrant.com

over 12 years ago

In reply to: Heikki Linnakangas (#1)

#75

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Simon Riggs (#74)

#76

Simon Riggs

simon@2ndQuadrant.com

over 12 years ago

In reply to: Tom Lane (#75)

#77

Bruce Momjian

bruce@momjian.us

over 12 years ago

In reply to: Simon Riggs (#76)

#78

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Bruce Momjian (#77)

#79

Jeff Janes

jeff.janes@gmail.com

over 12 years ago

In reply to: Tom Lane (#75)

#80

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Jeff Janes (#79)

#81

Peter Geoghegan

pg@bowt.ie

over 12 years ago

In reply to: Tom Lane (#80)

#82

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Tom Lane (#80)

#83

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Andres Freund (#82)

#84

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Tom Lane (#83)

#85

Simon Riggs

simon@2ndQuadrant.com

over 12 years ago

In reply to: Jeff Janes (#79)

#86

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Simon Riggs (#85)

#87

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Andres Freund (#84)

#88

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Andres Freund (#86)

#89

Peter Geoghegan

pg@bowt.ie

over 12 years ago

In reply to: Andres Freund (#82)

#90

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Tom Lane (#87)

#91

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Andres Freund (#90)

#92

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Peter Geoghegan (#89)

#93

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Tom Lane (#91)

#94

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Andres Freund (#93)

#95

Simon Riggs

simon@2ndQuadrant.com

over 12 years ago

In reply to: Tom Lane (#87)

#96

Simon Riggs

simon@2ndQuadrant.com

over 12 years ago

In reply to: Tom Lane (#88)

#97

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 12 years ago

In reply to: Simon Riggs (#96)

#98

Simon Riggs

simon@2ndQuadrant.com

over 12 years ago

In reply to: Heikki Linnakangas (#97)

#99

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 12 years ago

In reply to: Tom Lane (#78)

#100

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Tom Lane (#94)

#101

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Andres Freund (#100)

#102

Simon Riggs

simon@2ndQuadrant.com

over 12 years ago

In reply to: Simon Riggs (#98)

#103

Jim Nasby

Jim.Nasby@BlueTreble.com

over 12 years ago

In reply to: Andres Freund (#92)

#104

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Jim Nasby (#103)

#105

Simon Riggs

simon@2ndQuadrant.com

over 12 years ago

In reply to: Jim Nasby (#103)

#106

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Simon Riggs (#105)