Slow Vacuum was: vacuum output question
To follow up on an old thread that I started - I had a customer who
had a system where manual vacuum runs were taking a very long time to
run. I was seeing output like this:
INFO: "cpe": found 95498 removable, 18757 nonremovable row versions
in 11117 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 280173 unused item pointers.
0 pages are entirely empty.
CPU 5.35s/0.99u sec elapsed 724.38 sec.
Then, running vacuum again immediately afterword, on a system that was
basically idle, would result in nearly the same amount of time to
vacuum the table.
Getting a copy of the database from the customer, and loading it onto
my Postgres System and running the vacuum would result in runs that
took less than a second (as expected).
General opinion was that it was a disk-io problem. We rebooted the
system, and magically, the problem went away.
As tends to be the case with "magically" fixed problems, this one is
back. Now I have a different customer running Postgres 8.1 on Fedora
Core 6, and their system produced that log snippit above.
hdparm shows disk-io thruput being perfectly normal on their system.
Everything else on the system seems to be working correctly. The
reboot solution doesn't help. Vacuum still runs painfully slow.
I'm still waiting for a copy of their postgresql config file, but I'm
guessing that almost everything was left on the default settings.
I don't suppose anyone knows of any bugs that existed between postgres
8.1 and older linux kernels that led to behavior like this? I can't
really just ask them to upgrade their OS and/or Postgres on the hunch
that the problem should go away.... And I haven't yet been able to
reproduce anything like this on a system that I have control over.
Thanks for any ideas.
Dan
On Tue, Dec 30, 2008 at 9:10 AM, Dan Armbrust
<daniel.armbrust.list@gmail.com> wrote:
To follow up on an old thread that I started - I had a customer who
had a system where manual vacuum runs were taking a very long time to
run. I was seeing output like this:INFO: "cpe": found 95498 removable, 18757 nonremovable row versions
in 11117 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 280173 unused item pointers.
0 pages are entirely empty.
CPU 5.35s/0.99u sec elapsed 724.38 sec.Then, running vacuum again immediately afterword, on a system that was
basically idle, would result in nearly the same amount of time to
vacuum the table.
You do realize that except for the end of a table, vacuum recovers no
actual space, just makes it available for new tuples to move in there.
So it makes sense that the second vacuum would take just as long, or
nearly so.
cluster or vacuum full followed by reindex will regain space lost.
Getting a copy of the database from the customer, and loading it onto
my Postgres System and running the vacuum would result in runs that
took less than a second (as expected).General opinion was that it was a disk-io problem. We rebooted the
system, and magically, the problem went away.
Hard to say. Have them run
vmstat 1
while vacuuming and see what bi/bo look like.
INFO: "cpe": found 95498 removable, 18757 nonremovable row versions
in 11117 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 280173 unused item pointers.
0 pages are entirely empty.
CPU 5.35s/0.99u sec elapsed 724.38 sec.Then, running vacuum again immediately afterword, on a system that was
basically idle, would result in nearly the same amount of time to
vacuum the table.You do realize that except for the end of a table, vacuum recovers no
actual space, just makes it available for new tuples to move in there.
So it makes sense that the second vacuum would take just as long, or
nearly so.
Yep. The real issue is that it took 724 seconds, instead of say, a
half second - like it does on my system. I wasn't sure if I should
expect vacuum to take longer to run when it finds a large number of
tuples that it needs to make available, so I just have them run it
twice so I can easily compare the time that it takes with essentially
no work to do.
Hard to say. Have them run
vmstat 1
while vacuuming and see what bi/bo look like.
Haven't looked at that yet on this particular system. Last time, on
different hardware when this occurred the vmstat 'wa' column showed
very large values while vacuum was running. I don't recall what the
bi/bo columns indicated.
top also showed very high load averages while vacuum was running - but
basically no cpu use.
Are there any common tools that could do a better disk benchmark than
hdparm -Tt?
Thanks,
Dan
INFO: "cpe": found 95498 removable, 18757 nonremovable row versions
in 11117 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 280173 unused item pointers.
0 pages are entirely empty.
CPU 5.35s/0.99u sec elapsed 724.38 sec.How many idle transactions are there?
Not sure. I don't expect that there were many, their system isn't
very highly loaded compared to most of our customers systems.
The way that they reported the problem to us was that if they enable
autovacuum, when ever it runs (about 4 times an hour) it would stop
processing the things it needed to process, due to table lock
contention for several minutes.
Their workaround had been to run a daily autovacuum at the lowest load
time of day, to cause the least disruption.
Import Notes
Reply to msg id not found: 20081230161705.GB34483@shinkuro.com
On Tue, Dec 30, 2008 at 9:32 AM, Dan Armbrust
<daniel.armbrust.list@gmail.com> wrote:
Haven't looked at that yet on this particular system. Last time, on
different hardware when this occurred the vmstat 'wa' column showed
very large values while vacuum was running. I don't recall what the
bi/bo columns indicated.
definitely sounds like poor io
top also showed very high load averages while vacuum was running - but
basically no cpu use.
yeah, load is the number of things running or waiting to run. If
vacuum is sucking up all the io, and this machine doesn't have much io
capability, then it's quite possible for other processes stuck behind
it to crank up the load factor.
Also, were the any vacuum cost delay settings over 0 on this machine
when the test was run?
Are there any common tools that could do a better disk benchmark than
hdparm -Tt?
Keep in mind, hdparm hits the drive directly, not through the
filesystem. I use bonnie++ or iozone to test io.
On Tue, Dec 30, 2008 at 9:47 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
Keep in mind, hdparm hits the drive directly, not through the
filesystem. I use bonnie++ or iozone to test io.
Also dd and vmstat together.
On Tue, Dec 30, 2008 at 10:37:04AM -0600, Dan Armbrust wrote:
The way that they reported the problem to us was that if they enable
autovacuum, when ever it runs (about 4 times an hour) it would stop
processing the things it needed to process, due to table lock
contention for several minutes.
If that's true, there are a couple possibilities for why. One of them
is that autovacuum needs tuning. Another is that certain tables
aren't really good candidates for autovacuum (<8.3, this is pretty
common). Another is that they're starved for I/O; you need to check
that when processing and vacuum is happening, not the rest of the
time. Another is that they're doing something like explicit locking
or very large numbers of UPDATEs in one transaction, and vacuum is
just highlighting a problem with what they're doing.
Their workaround had been to run a daily autovacuum at the lowest load
time of day, to cause the least disruption.
What is a "daily autovacuum"? It sounds like some tables just need
vacuuming more often. If they find that the system is not responsive
during that, it tells us that they need more disk bandwidth or that
they need to integrate vacuuming some tables with their program.
A
--
Andrew Sullivan
ajs@crankycanuck.ca
Their workaround had been to run a daily autovacuum at the lowest load
time of day, to cause the least disruption.What is a "daily autovacuum"? It sounds like some tables just need
vacuuming more often. If they find that the system is not responsive
during that, it tells us that they need more disk bandwidth or that
they need to integrate vacuuming some tables with their program.
Sorry, I meant a daily manual vacuum.
On paper, their hardware is plenty fast for their workload. Out of
hundreds of sites, all running the same software putting load on the
database, this is only the second time where we have seen this odd
behaviour of very slow vacuums.
I guess I was hoping that someone would be able to chime in and say -
yes, in so and so version, we fixed an obscure bug that sometimes
caused huge slowdowns, perhaps when combined with certain linux
kernels. It was a nice dream anyway :)
iozone looks useful. I'll see if I can get on their system and do
some proper benchmarks.
Thanks,
Dan
On Tue, Dec 30, 2008 at 10:14 AM, Dan Armbrust
<daniel.armbrust.list@gmail.com> wrote:
On paper, their hardware is plenty fast for their workload. Out of
hundreds of sites, all running the same software putting load on the
database, this is only the second time where we have seen this odd
behaviour of very slow vacuums.I guess I was hoping that someone would be able to chime in and say -
yes, in so and so version, we fixed an obscure bug that sometimes
caused huge slowdowns, perhaps when combined with certain linux
kernels. It was a nice dream anyway :)
This actually kinda sounds like one of those RAID cards that every so
often decides to check the consistency of the RAID array, and when
it's doing that, the whole system slows to a crawl.
Here is an interesting new datapoint.
Modern Ubuntu distro - PostgreSQL 8.1. SATA drive. No Raid. Cannot
reproduce slow vacuum performance - vacuums take less than a second
for the whole database.
Reinstall OS - Fedora Core 6 - PostgreSQL 8.1. Push data through
PostgreSQL for a couple hours (same as above) and now vacuum reports
this:
INFO: vacuuming "public.cpe"
INFO: index "pk_cpe" now contains 50048 row versions in 2328 pages
DETAIL: 415925 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.05s/0.33u sec elapsed 0.94 sec.
INFO: index "ix_cpe_ispid" now contains 50090 row versions in 1338 pages
DETAIL: 415925 index row versions were removed.
953 index pages have been deleted, 0 are currently reusable.
CPU 0.27s/0.22u sec elapsed 8.93 sec.
INFO: index "ix_cpe_enable" now contains 50676 row versions in 1637 pages
DETAIL: 415925 index row versions were removed.
1161 index pages have been deleted, 0 are currently reusable.
CPU 0.45s/0.31u sec elapsed 14.01 sec.
INFO: "cpe": removed 415925 row versions in 10844 pages
DETAIL: CPU 1.48s/0.25u sec elapsed 35.86 sec.
INFO: "cpe": found 415925 removable, 50003 nonremovable row versions
in 10849 pages
DETAIL: 6 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 2.40s/1.18u sec elapsed 61.13 sec.
It tooks 61 seconds to vacuum, and the number of index row versions
removed was huge.
We than issued a reindex command for the entire database - and now the
vaccum times are back down under a second.
What on earth could be going on between PostgreSQL 8.1 and Fedora 6
that is bloating and/or corrupting the indexes like this?
Thanks,
Dan
Dan Armbrust escribi�:
What on earth could be going on between PostgreSQL 8.1 and Fedora 6
that is bloating and/or corrupting the indexes like this?
Postgres 8.1 was slow to vacuum btree indexes. My guess is that your
indexes are so bloated that it takes a lot of time to scan them.
I think the solution here is to vacuum this table far more often.
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
On Tue, Jan 6, 2009 at 1:39 PM, Dan Armbrust
<daniel.armbrust.list@gmail.com> wrote:
Here is an interesting new datapoint.
Modern Ubuntu distro - PostgreSQL 8.1. SATA drive. No Raid. Cannot
reproduce slow vacuum performance - vacuums take less than a second
for the whole database.Reinstall OS - Fedora Core 6 - PostgreSQL 8.1. Push data through
PostgreSQL for a couple hours (same as above) and now vacuum reports
this:
Are you pushing the same amount of data through the ubuntu server? if
not, then the comparison is invalid, if so, then yeah, there's some
kind of difference between the platforms.
Note that Fedora Core 6 is quite old compared to ubuntu 8.04 or 8.10.
Also it's more likely to be installed on older and / or slower
equipment.
On Tue, Jan 6, 2009 at 3:01 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
Dan Armbrust escribió:
What on earth could be going on between PostgreSQL 8.1 and Fedora 6
that is bloating and/or corrupting the indexes like this?Postgres 8.1 was slow to vacuum btree indexes. My guess is that your
indexes are so bloated that it takes a lot of time to scan them.I think the solution here is to vacuum this table far more often.
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
Actually, the customer reported problem is that when they enable
autovacuum, the performance basically tanks because vacuum runs so
slow they can't bear to have it run frequently.
Though, perhaps they had bloated indexes before they started
autovacuum, and it never fixed them. Perhaps it will behave properly
if we do a reindex, and then enable autovacuum.
On Tue, Jan 6, 2009 at 1:39 PM, Dan Armbrust
<daniel.armbrust.list@gmail.com> wrote:Here is an interesting new datapoint.
Modern Ubuntu distro - PostgreSQL 8.1. SATA drive. No Raid. Cannot
reproduce slow vacuum performance - vacuums take less than a second
for the whole database.Reinstall OS - Fedora Core 6 - PostgreSQL 8.1. Push data through
PostgreSQL for a couple hours (same as above) and now vacuum reports
this:Are you pushing the same amount of data through the ubuntu server? if
not, then the comparison is invalid, if so, then yeah, there's some
kind of difference between the platforms.Note that Fedora Core 6 is quite old compared to ubuntu 8.04 or 8.10.
Also it's more likely to be installed on older and / or slower
equipment.
Yep - actually, we pushed much more data through the Ubuntu system and
could never reproduce the problem. On the Fedora Core 6 system, the
problem happened very quickly.
In our testing here, the Ubuntu test was on the same hardware as the
fedora core 6 system (not just identical, but the same actual box)
It seems that there is some sort of bad interaction between some part
of the older OS and PostgreSQL. We have also seen what appears to be
the same issue on a Cent OS 4.4 system. Which is a rather similar
package level to Fedora Core 6.
On Tuesday 06 January 2009, "Dan Armbrust" <daniel.armbrust.list@gmail.com>
wrote:
What on earth could be going on between PostgreSQL 8.1 and Fedora 6
that is bloating and/or corrupting the indexes like this?
Obviously the choice of operating system has no impact on the contents of
your index.
A better question might be, what did your application or maintenance
procedures do different in the different tests?
--
Alan
Obviously the choice of operating system has no impact on the contents of
your index.A better question might be, what did your application or maintenance
procedures do different in the different tests?--
Alan
Our problem for a long time has been assuming the "obvious". But we
now have tests that show otherwise.
I'm now thinking something along the lines of an obscure file system
or kernel interaction bug now - that was perhaps corrected in newer
releases of the OS.
Now that we can finally reproduce the problem in house, we are still
doing more tests to figure out specifics - does the problem go away
with Postgres 8.3, ext2/ext3/reiserFS, etc.
On Tue, Jan 6, 2009 at 2:07 PM, Dan Armbrust
<daniel.armbrust.list@gmail.com> wrote:
Actually, the customer reported problem is that when they enable
autovacuum, the performance basically tanks because vacuum runs so
slow they can't bear to have it run frequently.
Actually this is kinda backwards. What's happening is that the vacuum
uses up so much IO that nothing else can get through. The answer is
to make it run slower, by use of autovacuum_vacuum_cost_delay, and
setting it to 10 or 20 and seeing if they can then run autovacuum
during the day without these issues.
Note that vacuum was improved a fair bit from 8.1 to 8.2 and even
moreso from 8.2 to 8.3.
Though, perhaps they had bloated indexes before they started
autovacuum, and it never fixed them. Perhaps it will behave properly
if we do a reindex, and then enable autovacuum.
Definitely look at the cost_delay setting. Makes a huge difference.
"Dan Armbrust" <daniel.armbrust.list@gmail.com> writes:
INFO: "cpe": found 415925 removable, 50003 nonremovable row versions
in 10849 pages
What on earth could be going on between PostgreSQL 8.1 and Fedora 6
that is bloating and/or corrupting the indexes like this?
You're focusing on the indexes when the problem is dead table rows.
It's very hard to believe that there's any OS dependence as such
involved in that. I wonder whether (a) the Ubuntu and Fedora packages
you're using are the same 8.1.x point release; (b) if there's any
interesting non-default behavior built into the Ubuntu packaging
... like running autovacuum automatically, for instance.
regards, tom lane
On Tue, Jan 6, 2009 at 3:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Dan Armbrust" <daniel.armbrust.list@gmail.com> writes:
INFO: "cpe": found 415925 removable, 50003 nonremovable row versions
in 10849 pagesWhat on earth could be going on between PostgreSQL 8.1 and Fedora 6
that is bloating and/or corrupting the indexes like this?You're focusing on the indexes when the problem is dead table rows.
It's very hard to believe that there's any OS dependence as such
involved in that. I wonder whether (a) the Ubuntu and Fedora packages
you're using are the same 8.1.x point release; (b) if there's any
interesting non-default behavior built into the Ubuntu packaging
... like running autovacuum automatically, for instance.regards, tom lane
In our testing, Postgres 8.1 was build from source (PostgreSQL website
source) on both systems. No Distro packages involved.
Believe me, we are as baffled as you. We have been chasing this bug
off and on for months on a couple of different customer sites now.
Thanks,
Dan
I'm no closer to a solution, but here are some additional data points
- all taken on Fedora Core 6.
Postgres 8.1 built from source. Auto vacuum disabled.
Create Empty Database.
Run our load on the system for 2 hours to populate and exercise the database.
Run Vacuum. Takes more than a minute.
Run Vacuum immediately again. Takes more than a minute.
Reindex the database. Takes 10 seconds.
Run Vacuum again. Takes 2 seconds.
Allow load to run on the system for 30 minutes.
Run Vacuum again. Takes more than a minute.
Postgres 8.3 built from source. Auto vacuum disabled.
Create Empty Database.
Run our load on the system for 2 hours to populate and exercise the database.
Run Vacuum. Takes more than a minute.
Run Vacuum immediately again. Takes 15 seconds.
Run Vacuum immediately again. Takes 15 seconds.
Reindex the database. Takes 10 seconds.
Run Vacuum again. Takes 2 seconds.
So, PostgreSQL 8.3 shows better behaviour, but it is still showing
some sort of performance issue which a reindex fixes.
And then of course, the kicker is that we can't recreate any of these
issues when running the same exact test, on the same exact hardware -
but using a different underlying OS. When we were running under a
modern Ubuntu, the vacuum never takes more than 2 seconds. We will be
checking other OSs soon. I guess if we can't figure out what is
causing it, we can at least isolate the distros that we need to tell
our customers not to use (or to schedule a reindex if they insist on
not upgrading their OS)