BUG #5946: Long exclusive lock taken by vacuum (not full)
The following bug has been logged online:
Bug reference: 5946
Logged by: Maxim Boguk
Email address: Maxim.Boguk@gmail.com
PostgreSQL version: 8.4
Operating system: Linux
Description: Long exclusive lock taken by vacuum (not full)
Details:
From documentation I know that vacuum (without full) can truncate empty
pages from end of a relation if they are free and vacuum successfully grabed
exclusive lock for short time.
However, I wasn't ready to learn that 'short exclusive lock' can be 10-20
minutes in some cases.
In my case vacuum tried to truncate last 10-15GB from 100Gb relation, and
each time (3) it was cost 10+ minutes of service downtime (because that
table was completely locked).
Is that correct behaviour? Are here any way to speedup that process or at
least allow read-only queries during that time?
PS: no exessive disk IO observed during that 10+ min locks.
Excerpts from Maxim Boguk's message of vie mar 25 05:56:41 -0300 2011:
From documentation I know that vacuum (without full) can truncate empty
pages from end of a relation if they are free and vacuum successfully grabed
exclusive lock for short time.However, I wasn't ready to learn that 'short exclusive lock' can be 10-20
minutes in some cases.In my case vacuum tried to truncate last 10-15GB from 100Gb relation, and
each time (3) it was cost 10+ minutes of service downtime (because that
table was completely locked).Is that correct behaviour? Are here any way to speedup that process or at
least allow read-only queries during that time?PS: no exessive disk IO observed during that 10+ min locks.
I think you may be using a version prior to a fix we did to that code,
to have it avoid sleeping due to vacuum_cost_delay. This shouldn't
happen in 8.4 because it was fixed prior to that, though.
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Branch: master Release: REL8_3_0 [21c27af65] 2007-09-10 17:58:45 +0000
Branch: REL8_2_STABLE Release: REL8_2_5 [053731ab0] 2007-09-10 17:58:50 +0000
Branch: REL8_1_STABLE Release: REL8_1_10 [e52f4ec32] 2007-09-10 17:58:56 +0000
Branch: REL8_0_STABLE Release: REL8_0_14 [a44103519] 2007-09-10 17:59:03 +0000
Remove the vacuum_delay_point call in count_nondeletable_pages, because we hold
an exclusive lock on the table at this point, which we want to release as soon
as possible. This is called in the phase of lazy vacuum where we truncate the
empty pages at the end of the table.
An alternative solution would be to lower the vacuum delay settings before
starting the truncating phase, but this doesn't work very well in autovacuum
due to the autobalancing code (which can cause other processes to change our
cost delay settings). This case could be considered in the balancing code, but
it is simpler this way.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
"Maxim Boguk" <Maxim.Boguk@gmail.com> writes:
In my case vacuum tried to truncate last 10-15GB from 100Gb relation, and
each time (3) it was cost 10+ minutes of service downtime (because that
table was completely locked).
Is that correct behaviour? Are here any way to speedup that process or at
least allow read-only queries during that time?
Use autovacuum --- if there's something that wants to access the table,
autovac will get kicked off the lock. (Of course, the table may never
get truncated then, but maybe you don't care.)
regards, tom lane
On Fri, Mar 25, 2011 at 11:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Maxim Boguk" <Maxim.Boguk@gmail.com> writes:
In my case vacuum tried to truncate last 10-15GB from 100Gb relation, and
each time (3) it was cost 10+ minutes of service downtime (because that
table was completely locked).Is that correct behaviour? Are here any way to speedup that process or at
least allow read-only queries during that time?Use autovacuum --- if there's something that wants to access the table,
autovac will get kicked off the lock. (Of course, the table may never
get truncated then, but maybe you don't care.)
I could see this turning out somewhat unhappily...
If there's almost always something that wants to access the table,
which would be hardly surprising, in view that the table is being
updated with sufficient regularity that it's got 10GB+ of dead space
at the end, then you're eventually going to *HAVE* to vacuum it.
Transaction rollover, and what not.
The piper will need to get paid, eventually :-(.
So I don't think leaving it to autovacuum quite suffices, because
eventually there's a 10 minute "glut", or more, that needs to get
paid.
--
http://linuxfinances.info/info/linuxdistributions.html
Christopher Browne <cbbrowne@gmail.com> writes:
On Fri, Mar 25, 2011 at 11:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Use autovacuum --- if there's something that wants to access the table,
autovac will get kicked off the lock. �(Of course, the table may never
get truncated then, but maybe you don't care.)
I could see this turning out somewhat unhappily...
If there's almost always something that wants to access the table,
which would be hardly surprising, in view that the table is being
updated with sufficient regularity that it's got 10GB+ of dead space
at the end, then you're eventually going to *HAVE* to vacuum it.
Transaction rollover, and what not.
It is already vacuumed. The only remaining step is to release free
space back to the OS; and if he's got a lot of churn in the table,
it's not entirely clear that that's a productive activity anyway.
regards, tom lane
Excerpts from Tom Lane's message of vie mar 25 13:10:21 -0300 2011:
Christopher Browne <cbbrowne@gmail.com> writes:
If there's almost always something that wants to access the table,
which would be hardly surprising, in view that the table is being
updated with sufficient regularity that it's got 10GB+ of dead space
at the end, then you're eventually going to *HAVE* to vacuum it.
Transaction rollover, and what not.It is already vacuumed. The only remaining step is to release free
space back to the OS; and if he's got a lot of churn in the table,
it's not entirely clear that that's a productive activity anyway.
Right: future new tuples could need to extend the table, which would
turn out not to be necessary because these pages at the end of the table
are there and registered on the FSM. Truncating the table could turn
out to be conterproductive because you would allocate a new block
shortly, causing FS fragmentation.
What happened to the idea of preallocating tables to a size defined by
the user at CREATE TABLE time, and extending it in chunks instead of a
block at a time, with the idea of reducing fragmentation? Was it
rejected, or just not implemented?
The other possibly useful idea was to let this truncate step to be run
as a separate activity, without having to complete a vacuum cycle on the
table.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes:
What happened to the idea of preallocating tables to a size defined by
the user at CREATE TABLE time, and extending it in chunks instead of a
block at a time, with the idea of reducing fragmentation? Was it
rejected, or just not implemented?
I don't recall any particular discussion of making the user contend with
that. My thought would be to do something like enlarging the table by
10% anytime we need to extend it. The main issue here is where to cause
that to happen. An individual backend that just wants to store one more
tuple probably shouldn't have to do that much work.
The other possibly useful idea was to let this truncate step to be run
as a separate activity, without having to complete a vacuum cycle on the
table.
Yeah. I did think of a small problem with just letting autovac get
blown off the lock: it will fail to send the stats messages about what
it's done, which will likely result in a near-immediate launch of a new
autovacuum against the table. It'd be better to complete the autovacuum
and then try to do the truncation in an independent transaction.
regards, tom lane
On Fri, Mar 25, 2011 at 7:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I don't recall any particular discussion of making the user contend with
that. My thought would be to do something like enlarging the table by
10% anytime we need to extend it. The main issue here is where to cause
that to happen. An individual backend that just wants to store one more
tuple probably shouldn't have to do that much work.
Just for reference this is how Oracle *used* to behave. It was widely
hated and led to all sorts of problems. Best practice was to pick a
reasonable size for your tablespace and pre-allocate that size and set
future increments to be that size with 0% growth.
Otherwise the problem with growing 10% is that it's hard for a DBA to
know how much space headroom he needs on the drive. The database might
grow at any time by a hard to predict amount of space which isn't
proportional or connected in any way with the usage. If your database
starts out small and you load a terabyte into it then by the time it's
full that 10% exponential growth is 90GB and adding one more row might
trigger that allocation at any time.
And at the same time the first few hundred gigabytes are still
fragmented and allocated in small chunks due to the initial load
starting slowly.
It was considered much better to get predictable behaviour by setting
the initial extent size to something like 1GB and then setting it to
grow by 1GB with no growth.
--
greg
Greg Stark <gsstark@mit.edu> writes:
On Fri, Mar 25, 2011 at 7:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I don't recall any particular discussion of making the user contend with
that. �My thought would be to do something like enlarging the table by
10% anytime we need to extend it.
Just for reference this is how Oracle *used* to behave. It was widely
hated and led to all sorts of problems. Best practice was to pick a
reasonable size for your tablespace and pre-allocate that size and set
future increments to be that size with 0% growth.
Interesting, but I don't understand/believe your argument as to why this
is a bad idea or fixed-size extents are better. It sounds to me just
like the typical Oracle DBA compulsion to have a knob to twiddle. A
self-adjusting enlargement behavior seems smarter all round.
regards, tom lane
On Fri, Mar 25, 2011 at 8:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Interesting, but I don't understand/believe your argument as to why this
is a bad idea or fixed-size extents are better. It sounds to me just
like the typical Oracle DBA compulsion to have a knob to twiddle. A
self-adjusting enlargement behavior seems smarter all round.
So is it ok for inserting one row to cause my table to grow by 90GB?
Or should there be some maximum size increment at which it stops
growing? What should that maximum be? What if I'm on a big raid system
where that size doesn't even add a block to every stripe element?
Say you start with 64k (8 pg blocks). That means your growth
increments will be 64k, 70k, 77kl, 85k, 94k, 103k, 113k, 125k, 137k,
...
I'm having trouble imagining a set of hardware and filesystem where
growing a table by 125k will be optimal. The next allocation will have
to do some or all of a) go back and edit the previous one to round it
up, then b) add 128k more, then c) still have 6k more to allocate in a
new allocation.
--
greg
Greg Stark <gsstark@mit.edu> writes:
On Fri, Mar 25, 2011 at 8:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Interesting, but I don't understand/believe your argument as to why this
is a bad idea or fixed-size extents are better. �It sounds to me just
like the typical Oracle DBA compulsion to have a knob to twiddle. �A
self-adjusting enlargement behavior seems smarter all round.
So is it ok for inserting one row to cause my table to grow by 90GB?
If the table is already several TB, why not? The whole point here is
that it's very unlikely that you're not going to be inserting more rows
pretty soon.
Or should there be some maximum size increment at which it stops
growing? What should that maximum be? What if I'm on a big raid system
where that size doesn't even add a block to every stripe element?
Say you start with 64k (8 pg blocks). That means your growth
increments will be 64k, 70k, 77kl, 85k, 94k, 103k, 113k, 125k, 137k,
...
I have no problem with trying to be smart about allocating in powers of
2, not allocating more than X at a time, etc etc. I'm just questioning
the idea that the user should be bothered with this, or is likely to be
smarter than the system about such things. Particularly if you believe
that this problem actually justifies attention to such details. I think
you've already demonstrated that a simplistic fixed-size allocation
parameter probably *isn't* good enough.
regards, tom lane
On Sat, Mar 26, 2011 at 4:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Maxim Boguk" <Maxim.Boguk@gmail.com> writes:
In my case vacuum tried to truncate last 10-15GB from 100Gb relation, and
each time (3) it was cost 10+ minutes of service downtime (because that
table was completely locked).Is that correct behaviour? Are here any way to speedup that process or
at
least allow read-only queries during that time?
Use autovacuum --- if there's something that wants to access the table,
autovac will get kicked off the lock. (Of course, the table may never
get truncated then, but maybe you don't care.)regards, tom lane
Thank you for an idea.
Are having lots empty pages at end of the table can have any negative impact
on database performance (assuming I have plenty of free disk space)?
In my case these 100Gb table going to be reduced to 20Gb size actual data
located at start of the table, so I worry about possible negative impact of
having extra 80Gb free space at end of the table.
Regards,
Maxim
On Fri, Mar 25, 2011 at 5:09 PM, Greg Stark <gsstark@mit.edu> wrote:
On Fri, Mar 25, 2011 at 8:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Interesting, but I don't understand/believe your argument as to why this
is a bad idea or fixed-size extents are better. It sounds to me just
like the typical Oracle DBA compulsion to have a knob to twiddle. A
self-adjusting enlargement behavior seems smarter all round.So is it ok for inserting one row to cause my table to grow by 90GB?
Or should there be some maximum size increment at which it stops
growing? What should that maximum be? What if I'm on a big raid system
where that size doesn't even add a block to every stripe element?Say you start with 64k (8 pg blocks). That means your growth
increments will be 64k, 70k, 77kl, 85k, 94k, 103k, 113k, 125k, 137k,
...I'm having trouble imagining a set of hardware and filesystem where
growing a table by 125k will be optimal. The next allocation will have
to do some or all of a) go back and edit the previous one to round it
up, then b) add 128k more, then c) still have 6k more to allocate in a
new allocation.
This is certainly a converse of the problem actually being pointed at
by the bug.
The bug indicates a situation where the table already has an enormous
pile of free space, all of it already sitting at the very end.
There's *at least* 1GB of space free, and in the case spoken of, there
was 10GB free.
The point of the exercise isn't to allocate new space - it is to
*deallocate* the huge quantity of dead space at the end of the table,
without blocking anybody unnecessarily. I foresee that being somewhat
troublesome, mostly in view that stuff is still going on concurrently,
though it seems pretty plausible that one might *somewhat* safely
"fast-track" removal of all but the first of those empty extensions.
What seems natural-ish to me might include:
- Stomping a bit on the FSM replacement to make sure nobody's going to
be writing to the later extensions;
- Watching free space during the process so the "first" extension gets
re-opened up if the free space in the much earlier parts of the table
(e.g. - that are not planned to be dropped off) is running out.
--
http://linuxfinances.info/info/linuxdistributions.html
Christopher Browne <cbbrowne@gmail.com> writes:
What seems natural-ish to me might include:
- Stomping a bit on the FSM replacement to make sure nobody's going to
be writing to the later extensions;
- Watching free space during the process so the "first" extension gets
re-opened up if the free space in the much earlier parts of the table
(e.g. - that are not planned to be dropped off) is running out.
You seem to be thinking only about the possibility that somebody would
try to write a new tuple into the space-to-be-freed. The problem that
necessitates use of AccessExclusiveLock is that somebody could be doing
a seqscan that tries to *read* the blocks that are about to be truncated
away. We can't really improve matters much here unless we think of a
way to fix that. It would be okay if the scan just ignored blocks it
failed to read, but how do you distinguish the case from a filesystem
error that really should be reported?
regards, tom lane
On Fri, Mar 25, 2011 at 4:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Greg Stark <gsstark@mit.edu> writes:
On Fri, Mar 25, 2011 at 7:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I don't recall any particular discussion of making the user contend with
that. My thought would be to do something like enlarging the table by
10% anytime we need to extend it.Just for reference this is how Oracle *used* to behave. It was widely
hated and led to all sorts of problems. Best practice was to pick a
reasonable size for your tablespace and pre-allocate that size and set
future increments to be that size with 0% growth.Interesting, but I don't understand/believe your argument as to why this
is a bad idea or fixed-size extents are better. It sounds to me just
like the typical Oracle DBA compulsion to have a knob to twiddle. A
self-adjusting enlargement behavior seems smarter all round.
I think we've had a number of pieces of evidence that suggest that
extending 8kB at a time is too costly, but I agree with Greg that the
idea of extending an arbitrarily large table by 10% at a time is
pretty frightening - that could involve allocating a gigantic amount
of space on a big table. I would be inclined to do something like
extend by 10% of table or 1MB, whichever is smaller. If the main goal
is to avoid extending the file many times, this will reduce the number
of physical file extensions on a table > 10MB in size by more than 99%
compared to the current code, which ought to be enough to eliminate
this as an issue. And a 1MB extension is probably also small enough
that we can do it in the foreground without too much of a hiccup.
Now, there's a second effect here that's worth thinking about:
allocating in really big chunks might help the filesystem lay out the
space more contiguously on disk. But I'd want to see some careful
testing that reveals exactly what is needed to get that benefit before
we go too crazy with it.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Mar 25, 2011 at 5:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Christopher Browne <cbbrowne@gmail.com> writes:
What seems natural-ish to me might include:
- Stomping a bit on the FSM replacement to make sure nobody's going to
be writing to the later extensions;
- Watching free space during the process so the "first" extension gets
re-opened up if the free space in the much earlier parts of the table
(e.g. - that are not planned to be dropped off) is running out.You seem to be thinking only about the possibility that somebody would
try to write a new tuple into the space-to-be-freed. The problem that
necessitates use of AccessExclusiveLock is that somebody could be doing
a seqscan that tries to *read* the blocks that are about to be truncated
away. We can't really improve matters much here unless we think of a
way to fix that. It would be okay if the scan just ignored blocks it
failed to read, but how do you distinguish the case from a filesystem
error that really should be reported?
It's struck me a number of times that it would make some things
simpler if we were able to maintain some state in shared memory about
the tables people were using - for example, in this case, we could
cache the table size, or the fact that vacuum has just truncated away
N blocks, or, uh, something. *waves hands* But it's hard to know how
such an area could reasonably be sized or managed.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
I think we've had a number of pieces of evidence that suggest that
extending 8kB at a time is too costly, but I agree with Greg that the
idea of extending an arbitrarily large table by 10% at a time is
pretty frightening - that could involve allocating a gigantic amount
of space on a big table. I would be inclined to do something like
extend by 10% of table or 1MB, whichever is smaller.
Sure, something like that sounds sane, though the precise numbers
need some validation.
... And a 1MB extension is probably also small enough
that we can do it in the foreground without too much of a hiccup.
Less than convinced about this.
regards, tom lane
Likely "too large" is more an issue related to available resources than
of absolute figure.
On a penta byte of free storage I would not mind allocating some teras
with extending a (large) table.
If I'm left with some MB only, I'd be concerned for sure.
I still prefer an approach that will "just work", without much fiddling
of all kind of knobs.
I'd see the following points:
- There is a minimum size of allocation below which it is unreasonable
/inefficient to do allocations
- doing allocations on sizes based on current table size honor the
assumption that a large table will grow further
(and thus somehow improve this track of grows)
- large growth is "frightening" - largely (my assumption) due to
unwanted behavior towards end of space
So what seems to help out is twofold:
- support readjusting of allocation size to smaller units in case an
intended allocation could not be satisfied while sill allowing the
minimum required space to be claimed
- allow for allocated but unused space to be reclaimed
(It is perfectly OK to have all of my "unused" disk space allocated to
a large table that just happens to be not using it,
if this can still be used later for some smaller table as soon as
this is in need for some space.)
Allocation should also take into account the amount of space left.
This likely is something to be determined per tablespace.
From that allocation might work like:
a) try to get x% of the currently allocated amount for the object
b) but not more than y% of the free space on the related tablespace
c) and never less than a minimum necessary (for limiting overhead costs)
Rainer
Am 25.03.2011 22:34, schrieb Tom Lane:
Show quoted text
Greg Stark <gsstark@mit.edu> writes:
On Fri, Mar 25, 2011 at 8:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Interesting, but I don't understand/believe your argument as to why this
is a bad idea or fixed-size extents are better. It sounds to me just
like the typical Oracle DBA compulsion to have a knob to twiddle. A
self-adjusting enlargement behavior seems smarter all round.So is it ok for inserting one row to cause my table to grow by 90GB?
If the table is already several TB, why not? The whole point here is
that it's very unlikely that you're not going to be inserting more rows
pretty soon.Or should there be some maximum size increment at which it stops
growing? What should that maximum be? What if I'm on a big raid system
where that size doesn't even add a block to every stripe element?
Say you start with 64k (8 pg blocks). That means your growth
increments will be 64k, 70k, 77kl, 85k, 94k, 103k, 113k, 125k, 137k,
...I have no problem with trying to be smart about allocating in powers of
2, not allocating more than X at a time, etc etc. I'm just questioning
the idea that the user should be bothered with this, or is likely to be
smarter than the system about such things. Particularly if you believe
that this problem actually justifies attention to such details. I think
you've already demonstrated that a simplistic fixed-size allocation
parameter probably *isn't* good enough.regards, tom lane
Excerpts from Alvaro Herrera's message of lun mar 28 07:26:49 -0300 2011:
Likely "too large" is more an issue related to available resources than
of absolute figure.On a penta byte of free storage I would not mind allocating some teras
with extending a (large) table.
If I'm left with some MB only, I'd be concerned for sure.
...
Does anybody have an idea just W-T-F happened here? I did NOT send the
above email (as evidenced by it being signed by "Rainer"). I notice it
even has a "@commandprompt.com" message-id. Should I start signing my
email?
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Rainer, any idea? Please see
http://archives.postgresql.org/message-id/4D906269.6060109@commandprompt.com
Excerpts from Alvaro Herrera's message of lun mar 28 11:03:16 -0300 2011:
Excerpts from Alvaro Herrera's message of lun mar 28 07:26:49 -0300 2011:
Likely "too large" is more an issue related to available resources than
of absolute figure.On a penta byte of free storage I would not mind allocating some teras
with extending a (large) table.
If I'm left with some MB only, I'd be concerned for sure....
Does anybody have an idea just W-T-F happened here? I did NOT send the
above email (as evidenced by it being signed by "Rainer"). I notice it
even has a "@commandprompt.com" message-id. Should I start signing my
email?
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support